Glue ingest MySQL8.0 via Glue connector

Architeture

Prepare the RDS MySQL connection

MySQL 8.0.17

Connect the Mysql and load sample data

mysql -h $mysqlhost -u dbadmin -p
MySQL [(none)]> source warehouse-catelog_sales.sql
MySQL [testdb]> SELECT * FROM catalog_sales LIMIT 10;
MySQL [testdb]> SELECT * FROM warehouse  LIMIT 10;

# The w_warehouse_sk and cs_warehouse_sk in these two tables can be related for subsequent data processing join.

Add a JDBC connection rds-mysql8.0

#JDBC URL
jdbc:mysql://$mysqlhost:3306/testdb

# Security group
3306 port

For security group and VPC setting, please check Setting Up a VPC to Connect to JDBC Data Stores

Glue Connection Properties details

Test the JDBC connection

Do not test the connection after creating a successful connection, because Glue's default MySQL JDBC driver does not support MySQL8.0

But we can use specify the MySQL8.0 JDBC in ETL Job. Bring your own JDBC drivers to your Glue Spark ETL jobs

Connection Types and Options for ETL in AWS Glue

For example, Oracle 18

connection_oracle18_options = {
    "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd",
    "customJdbcDriverS3Path": "s3://path/ojdbc10.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

df_oracle18 = glueContext.create_dynamic_frame.from_options(connection_type="oracle",
connection_options=connection_oracle18_options)

Create Glue sample database mysql_ingest
Create the dummy table (Use for dummy datasource in ETL job)

An AWS Glue ETL job read 2 tables from an RDS MySQL8.0 database and join 2 tables

Due to the Glue's default MySQL JDBC driver does not support MySQL8.0, you can not set up the crawler point to the MySQL tables to build the table metadata in the AWS Glue Data Catalog as a data source. We will directly edit the ETL job.
Create the ETL job mysql8_catalog_sales

Add the Job Parameters

{
    "--tablename":"catalog_sales",
    "--dbuser": "dbadmin",
    "--dbpassword": "YOUR_PASSWORD",
    "--dburl":"jdbc:mysql://$mysqlhost:3306/testdb",
    "--jdbcS3path":"s3://xxxx/mysqljdbc/",
    "--s3OutputPath":"s3://xxx/mysql8_ingest/"
}

Download the mysql-connector-java-8.0.17.jar，upload to jdbcS3path specified S3 path

Edit the ETL job script glue-etl-connect-paramters and sample etl script

Sample catalog_sales_etl.py

Important: Make sure you select the connection mysql8.0 you created

Create the similar ETL job mysql8_warehouse

Add the Job Parameters

{
    "--tablename":"warehouse",
    "--dbuser": "dbadmin",
    "--dbpassword": "YOUR_PASSWORD",
    "--dburl":"jdbc:mysql://$mysqlhost:3306/testdb",
    "--jdbcS3path":"s3://xxxx/mysqljdbc/",
    "--s3OutputPath":"s3://xxx/mysql8_ingest/"
}

Sample warehourse_etl.py

Set up the Crawler and populate the table metadata in the AWS Glue Data Catalog for the S3 parquet file ingested from MySQL8.0 tables.

After the data stored on S3, the table structure in the database is lost. The crawling program generate the table structure of the data in S3 and enables other programs such as Apache Presto or Amazon Athena to directly query the data in S3.

Configure a crawler to crawl the data structures of catalog_sales and warehouse.

You can see two tables in the Glue data directory after running the crawl program and successfully

Create an aggregation job, join two tables to generate an aggregate table for querying the report `warehourse_catalog_sales_join_etl`

Run the SQL on Athena to make sure query successful

select w_warehouse_name, w_city,count(*) as cnt_sales, sum(cs_list_price) as total_revenue,sum(cs_net_profit_double) as total_net_profit,sum(cs_wholesale_cost) as total_cost from mysql_ingest.catalog_sales join mysql_ingest.warehouse on cs_warehouse_sk = w_warehouse_sk group by w_warehouse_name, w_city

Create the warehourse_catalog_sales_join_etl job

Add the Job Parameters

{
    "--s3OutputPath":"s3://xxx/mysql8_ingest/"
}

Sample warehourse_catalog_sales_join_etl.py

Create Glue Workflow

If you need to deal with the dependencies between tasks for complex data processing processes. Glue's workflow function can solve upstream and downstream dependencies and timing issues.

Create the Workflow

Run the Workflow

Tuning for JDBC read parallelism (Optional)

In some cases, running an AWS Glue ETL job over a large database table results in out-of-memory (OOM) errors because all the data is read into a single executor. To avoid this situation, you can optimize the number of Apache Spark partitions and parallel JDBC connections that are opened during the job execution.

After crawling a database table, follow these steps to tune the parameters.

In the Data Catalog, edit the table and add the partitioning parameters hashexpression or hashfield.

hashexpression: If your database table contains a column with numeric values such as a unique ID or similar data, choose the column name for a parameter hashexpression.
hashfield: If no suitable numeric column is available, find a column containing string values with a good even distribution (high cardinality), and choose the column name for a parameter hashfield.
hashpartitions: Provide a value of hashpartition as a number. By default, this value is set to 7. This parameter determines the number of Spark partitions and the resulting number of JDBC connections opened to the target database.

Reference

How to access and analyze on-premises data stores using AWS Glue

Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL8.0-Glue-Demo.md

MySQL8.0-Glue-Demo.md

Glue ingest MySQL8.0 via Glue connector

Prepare the RDS MySQL connection

An AWS Glue ETL job read 2 tables from an RDS MySQL8.0 database and join 2 tables

Set up the Crawler and populate the table metadata in the AWS Glue Data Catalog for the S3 parquet file ingested from MySQL8.0 tables.

Create an aggregation job, join two tables to generate an aggregate table for querying the report `warehourse_catalog_sales_join_etl`

Create Glue Workflow

Tuning for JDBC read parallelism (Optional)

Reference

Files

MySQL8.0-Glue-Demo.md

Latest commit

History

MySQL8.0-Glue-Demo.md

File metadata and controls

Glue ingest MySQL8.0 via Glue connector

Prepare the RDS MySQL connection

An AWS Glue ETL job read 2 tables from an RDS MySQL8.0 database and join 2 tables

Set up the Crawler and populate the table metadata in the AWS Glue Data Catalog for the S3 parquet file ingested from MySQL8.0 tables.

Create an aggregation job, join two tables to generate an aggregate table for querying the report warehourse_catalog_sales_join_etl

Create Glue Workflow

Tuning for JDBC read parallelism (Optional)

Reference

Create an aggregation job, join two tables to generate an aggregate table for querying the report `warehourse_catalog_sales_join_etl`