Data Lake with Spark and EMR using Cloud9.

In this project we will implement the following:

Use AWS S3 storage as Data Lake
Building an ETL pipeline in Python for Music History data stored in Data Lake
Load data from S3 and submit Spark jobs on EMR cluster using Cloud9 as environment to run the scripts.

Introduction

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As a data engineer, you were tasked to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

Datasets

There are two datasets

Song datasets: all json files are nested in subdirectories under s3a://s3bucket-spark-input/song_data.

Sample:

{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}

Log datasets: all json files are nested in subdirectories under s3a://s3bucket-spark-input/log_data.

Sample:

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}

Schema

For this project, we will use Star Schema consisting of one Fact table and four Dimensional tables.

Fact Table

songplay

Dimension Tables

song
users
artists
time

A simple star schema was employed for designing the tables.

Files

The project includes 4 files:

spark_emr_execute_job - creates EMR cluster with 1 master and 3 slave nodes and executes etl_music_job.py script
etl_music_job.py - This script reads song_data and load_data from S3, transforms them to create five different tables, and writes them to partitioned parquet files in table directories on S3
emr_bootstarp.sh - installs necessary libraries on EMR cluster nodes when invoking a cluster.
terminate.sh - This script is used to terminate the cluster

How to run this project on AWS EMR using CLOUD9?

Upload emr_bootstrap.sh and etl_music_job.py scripts in S3 bucket under s3a://s3bucket-spark-input
Create Cloud9 environment in AWS, upload spark_emr_execute_job.sh and terminate.sh scripts to the project folder that you created and execute below commands.

chmod +x spark_emr_execute_job.sh
./spark_emr_execute_job.sh

Now go to EMR cluster in AWS console and wait untill it finishes all jobs.

Outputs

As soon as you ran step 2 script the following events can be observed in AWS EMR cluster.

Cluster creation

Clustser Created

when cluster is created, you can check the status of spark job.

Spark Job Status

Spark Job Detail

Data Lake final output tables in S3

Check the target S3 bucket and confirm that the tables have been written and also check the parquet files.

As you see the processed data is stored back to S3 as parquet files.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
final_output_tables-spark-DWH		final_output_tables-spark-DWH
images		images
README.md		README.md
emr_bootstrap.sh		emr_bootstrap.sh
etl_music_job.py		etl_music_job.py
spark_emr_execute_job.bash		spark_emr_execute_job.bash
terminate.sh		terminate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake with Spark and EMR using Cloud9.

Introduction

Datasets

Schema

Fact Table

Dimension Tables

Files

How to run this project on AWS EMR using CLOUD9?

Outputs

Cluster creation

Clustser Created

Spark Job Status

Spark Job Detail

Data Lake final output tables in S3

About

Releases

Packages

Languages

praveen-gopal-reddy/ETL-Spark-EMR-AWS-MusicData

Folders and files

Latest commit

History

Repository files navigation

Data Lake with Spark and EMR using Cloud9.

Introduction

Datasets

Schema

Fact Table

Dimension Tables

Files

How to run this project on AWS EMR using CLOUD9?

Outputs

Cluster creation

Clustser Created

Spark Job Status

Spark Job Detail

Data Lake final output tables in S3

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages