Data Modeling With Postgres is the first project of Udacity Data Engineering Nanodegree. It requires to create a Postgres database for a music streaming app and write an ETL pipeline that transfers data from files in two local directories into this database using Python and SQL.
The project has the following goals:
- Model a relation database with Postgres for a star schema
- Define fact and dimension tables
- Insert data into tables
- Create ETL pipeline with Python
The project includes data
folder with song and log datasets, as well as the following files:
test.ipynb
: displays the first few rows of each table to check the database.create_tables.py
: drops and creates tables.etl.ipynb
: reads and processes a single file fromdata/song_data
anddata/log_data
and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.etl.py
: reads and processes files fromdata/song_data
anddata/log_data
and loads them into tables.sql_queries.py
: contains all SQL queries.
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their music streaming app. The sparkifydb database is designed to optimize queries on song play analysis.
The data resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app.
This dataset is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
Below is an example of what a single song file data/song_data/A/A/B/TRAABJV128F1460C49.json
looks like.
{
"num_songs": 1,
"artist_id": "ARIK43K1187B9AE54C",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "Beverly Hills, CA",
"artist_name": "Lionel Richie",
"song_id": "SOBONFF12A6D4F84D8",
"title": "Tonight Will Be Alright",
"duration": 307.3824,
"year": 1986
}
This dataset consists of log files in JSON format generated by eventsim event simulator based on the songs in the song dataset. These simulate activity logs from a music streaming app based on specified configurations.
Below is an example of what a single line of a single file data/log_data/2018/11/2018-11-09-events.json
looks like.
{
"artist":"Beastie Boys",
"auth":"Logged In",
"firstName":"Harper",
"gender":"M",
"itemInSession":2,
"lastName":"Barrett",
"length":161.56689,
"level":"paid",
"location":"New York-Newark-Jersey City, NY-NJ-PA",
"method":"PUT",
"page":"NextSong",
"registration":1540685364796.0,
"sessionId":275,
"song":"Lighten Up",
"status":200,
"ts":1541722186796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId":"42"
}
The sparkifydb is a Postgres database filled with data from songs and log datasets. It is designed with star schema to optimize queries on song play analysis. The sparkifydb database contains the following tables:
- Fact table
- songplays: records in log data associated with song plays, i.e. records with page
NextSong
- songplays: records in log data associated with song plays, i.e. records with page
- Dimension tables
- users: users in the app
- songs: songs in music database
- artists: artists in music database
- time: timestamps of records in songplays broken down into specific units
The sparkifybd schema is shown below. Primary and foreign keys are marked as PK
and FK
, respectively.
To run ETL pipeline locally,
- Clone this repo.
cd
into project directory.- Run
create_tables.py
to create database and tables:
pyhton create_tables.py
To confirm the creation of tables with correct columns, run
test.ipynb
. Make sure to click Restart kernel to close the connection to the database after running this notebook.
- Run
etl.py
to fill the tables:
python etl.py
Remember to run
create_tables.py
before runningetl.py
to reset tables.
Alexandra Baturina