Architeture
MySQL 8.0.17
- Connect the Mysql and load sample data
mysql -h $mysqlhost -u dbadmin -p
MySQL [(none)]> source warehouse-catelog_sales.sql
MySQL [testdb]> SELECT * FROM catalog_sales LIMIT 10;
MySQL [testdb]> SELECT * FROM warehouse LIMIT 10;
# The w_warehouse_sk and cs_warehouse_sk in these two tables can be related for subsequent data processing join.
- Add a JDBC connection
rds-mysql8.0
#JDBC URL
jdbc:mysql://$mysqlhost:3306/testdb
# Security group
3306 port
For security group and VPC setting, please check Setting Up a VPC to Connect to JDBC Data Stores
Glue Connection Properties details
- Test the JDBC connection
Do not test the connection after creating a successful connection, because Glue's default MySQL JDBC driver does not support MySQL8.0
But we can use specify the MySQL8.0 JDBC in ETL Job. Bring your own JDBC drivers to your Glue Spark ETL jobs
Connection Types and Options for ETL in AWS Glue
For example, Oracle 18
connection_oracle18_options = {
"url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
"dbtable": "test",
"user": "admin",
"password": "pwd",
"customJdbcDriverS3Path": "s3://path/ojdbc10.jar",
"customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
df_oracle18 = glueContext.create_dynamic_frame.from_options(connection_type="oracle",
connection_options=connection_oracle18_options)
-
Create Glue sample database
mysql_ingest
-
Create the dummy table (Use for dummy datasource in ETL job)
-
Due to the Glue's default MySQL JDBC driver does not support MySQL8.0, you can not set up the crawler point to the MySQL tables to build the table metadata in the AWS Glue Data Catalog as a data source. We will directly edit the ETL job.
-
Create the ETL job
mysql8_catalog_sales
Add the Job Parameters
{
"--tablename":"catalog_sales",
"--dbuser": "dbadmin",
"--dbpassword": "YOUR_PASSWORD",
"--dburl":"jdbc:mysql://$mysqlhost:3306/testdb",
"--jdbcS3path":"s3://xxxx/mysqljdbc/",
"--s3OutputPath":"s3://xxx/mysql8_ingest/"
}
- Download the mysql-connector-java-8.0.17.jar,upload to jdbcS3path specified S3 path
- Edit the ETL job script glue-etl-connect-paramters and sample etl script
Sample catalog_sales_etl.py
Important: Make sure you select the connection mysql8.0
you created
- Create the similar ETL job
mysql8_warehouse
Add the Job Parameters
{
"--tablename":"warehouse",
"--dbuser": "dbadmin",
"--dbpassword": "YOUR_PASSWORD",
"--dburl":"jdbc:mysql://$mysqlhost:3306/testdb",
"--jdbcS3path":"s3://xxxx/mysqljdbc/",
"--s3OutputPath":"s3://xxx/mysql8_ingest/"
}
Sample warehourse_etl.py
Set up the Crawler and populate the table metadata in the AWS Glue Data Catalog for the S3 parquet file ingested from MySQL8.0 tables.
After the data stored on S3, the table structure in the database is lost. The crawling program generate the table structure of the data in S3 and enables other programs such as Apache Presto or Amazon Athena to directly query the data in S3.
Configure a crawler to crawl the data structures of catalog_sales
and warehouse
.
You can see two tables in the Glue data directory after running the crawl program and successfully
Create an aggregation job, join two tables to generate an aggregate table for querying the report warehourse_catalog_sales_join_etl
- Run the SQL on Athena to make sure query successful
select w_warehouse_name, w_city,count(*) as cnt_sales, sum(cs_list_price) as total_revenue,sum(cs_net_profit_double) as total_net_profit,sum(cs_wholesale_cost) as total_cost from mysql_ingest.catalog_sales join mysql_ingest.warehouse on cs_warehouse_sk = w_warehouse_sk group by w_warehouse_name, w_city
- Create the
warehourse_catalog_sales_join_etl
job
Add the Job Parameters
{
"--s3OutputPath":"s3://xxx/mysql8_ingest/"
}
Sample warehourse_catalog_sales_join_etl.py
If you need to deal with the dependencies between tasks for complex data processing processes. Glue's workflow function can solve upstream and downstream dependencies and timing issues.
In some cases, running an AWS Glue ETL job over a large database table results in out-of-memory (OOM) errors because all the data is read into a single executor. To avoid this situation, you can optimize the number of Apache Spark partitions and parallel JDBC connections that are opened during the job execution.
After crawling a database table, follow these steps to tune the parameters.
In the Data Catalog, edit the table and add the partitioning parameters hashexpression
or hashfield
.
- hashexpression: If your database table contains a column with numeric values such as a unique ID or similar data, choose the column name for a parameter hashexpression.
- hashfield: If no suitable numeric column is available, find a column containing string values with a good even distribution (high cardinality), and choose the column name for a parameter hashfield.
- hashpartitions: Provide a value of hashpartition as a number. By default, this value is set to 7. This parameter determines the number of Spark partitions and the resulting number of JDBC connections opened to the target database.
How to access and analyze on-premises data stores using AWS Glue
Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS