A fake startup called Sparkify
wants to analyze the data they've been collecting on their new
music streaming app about:
- songs;
- user activity.
The analytics team is particularly interested in understanding what songs users are listening to.
Currently, they don't have an easy way to query their data, which resides in a directory of JSON
logs on user activity on the app (folder log_data
), as well as a directory with JSON metadata on
the songs in their app (folder song_data
).
They'd like a data engineer to create:
- a PostgreSQL database with tables designed to optimize queries on song play analysis
- using fact and dimension tables for a star schema
- an ETL pipeline that transfers data from files in two local directories into these tables in PostgreSQL.
There are two datasets: song dataset and log dataset.
- Subset of real data from the Million Song Dataset
- Each file is in JSON format and contains metadata about a song and the artist of that song.
- Example content of file
song_data/A/A/A/TRAAAAW128F429D538.json
:{ "num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0 }
- Example content of file
- The files are partitioned by the first three letters of each song's track ID. For example, here
are filepaths to two files in this dataset:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
- Log files in JSON format generated by this event simulator
based on the songs in the dataset above. These simulate activity logs from a music streaming app
based on specified configurations.
- Example content of the first two lines of file
log_data/2018/11/2018-11-01-events.json
:
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"} {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
- Example content of the first two lines of file
- The log files are partitioned by year and month. For example, here are filepaths to two files in
this dataset:
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
songplays
: records in log data associated with song plays, i.e., records with pageNextSong
songplay_id
,start_time
,user_id
,level
,song_id
,artist_id
,session_id
,location
,user_agent
users
: users in the appuser_id
,first_name
,last_name
,gender
,level
songs
: songs in music databasesong_id
,title
,artist_id
,year
,duration
artists
: artists in music databaseartist_id
,name
,location
,latitude
,longitude
time
: timestamps of records in songplays broken down into specific unitsstart_time
,hour
,day
,week
,month
,year
,weekday
├── README.md: this file
├── constants.py: database constants
├── create_tables.py: create clean database with all tables. Run this file to reset your tables
| before running the ETL scripts.
├── data
│ ├── log_data: JSON log files
│ └── song_data: JSON song files
├── etl.ipynb: read and process a single file from `song_data` and `log_data`; and load the data
| into the tables. This notebook contains detailed instructions on the ETL process for
| each of the tables. It helps to create file `etl.py`
├── etl.py: read and process JSON files and insert data into dimension and fact tables
├── requirements.txt: project requirements (Python libraries)
├── requirements_dev.txt: additional requirements used for development
├── sql_queries.py: SQL commands (DROP TABLE, CREATE TABLE, INSERT INTO, SELECT); imported in
| `create_tables.py`, `etl.py` and `etl.ipynb`
└── test.ipynb: display the first few rows of each table to check if the database is correct
Create a conda
environment called etl-env
and install the requirements in it.
conda create -yn etl-env python=3.7 --file requirements.txt
Use the new conda
environment in jupyter notebook.
conda install -n base nb_conda_kernels
conda activate etl-env
python -m pip install -r requirements_dev.txt
conda deactivate
jupyter notebook
Now you can run the jupyter notebook files etl.ipynb
and test.ipynb
.
Install PostgreSQL:
- Install Homebrew (macOS): http://brew.sh/
- Install PostgreSQL (macOS):
brew install postgresql
Start service and create default user and database:
-
Start PostgreSQL with homebrew-services:
brew services run postgresql
- Alternative:
pg_ctl -D /usr/local/var/postgres start
brew service start <service>
starts the<service>
at login, whilebrew services run
runs the<service>
but doesn't start it at login (nor boot).
- Alternative:
-
Connect to default PostgreSQL database and create user:
$ psql --dbname postgres postgres=# \du postgres=# CREATE ROLE student WITH LOGIN CREATEDB PASSWORD 'student'; postgres=# \du postgres=# \quit
-
Connect to default PostgreSQL database as the new user and create database:
$ psql --dbname postgres --username student postgres=> \list postgres=> CREATE DATABASE studentdb; postgres=> \list postgres=> GRANT ALL PRIVILEGES ON DATABASE studentdb TO student; postgres=> \list postgres=> \connect studentdb postgres=> \quit
python create_tables.py && python etl.py
To debug: psql --dbname sparkifydb --username student
If you receive the error DETAIL: There is 1 other session using the database.
, close all other
connections, except the current one. Probably it is a jupyter notebook running, so restart it, or
run this query:
SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();
-
Remove the Python environment:
conda env remove -n etl-env
-
Remove all tables:
$ psql --dbname sparkifydb sparkifydb=# \dn+ sparkifydb=# DROP SCHEMA public CASCADE; sparkifydb=# \dn+ sparkifydb=# CREATE SCHEMA public; sparkifydb=# GRANT ALL ON SCHEMA public TO public; sparkifydb=# COMMENT ON SCHEMA public IS 'standard public schema'; sparkifydb=# \dn+ sparkifydb=# \quit
Or remove the database and the user:
$ psql --dbname postgres postgres=# DROP DATABASE IF EXISTS sparkifydb; postgres=# DROP USER student; postgres=# \quit
-
Stop the PostgreSQL service:
brew services stop postgresql
- Alternative:
pg_ctl -D /usr/local/var/postgres stop
- Check by listing all services managed by
brew services
(postgresql
should bestopped
):brew services list
- Alternative:
Examples of queries:
-
Top 10 most common years when song were published:
SELECT year, COUNT(*) FROM songs GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
-
Distribution of song duration:
SELECT percentile, PERCENTILE_DISC(percentile) WITHIN GROUP (ORDER BY duration) AS duration FROM songs, GENERATE_SERIES(0, 1, 0.25) AS percentile GROUP BY 1;
-
Top 10 most common first names and last names of users:
SELECT first_name, COUNT(*) FROM users GROUP BY 1 ORDER BY 2 DESC LIMIT 10; SELECT last_name, COUNT(*) FROM users GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
-
Top 10 most common artist locations:
SELECT location, COUNT(*) FROM artists GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
-
Percentage of users per gender and level:
SELECT gender, ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage FROM users GROUP BY 1 ORDER BY 1; SELECT level, ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage FROM users GROUP BY 1 ORDER BY 1; WITH per_gender_level AS ( SELECT gender, level, COUNT(*) AS count FROM users GROUP BY 1, 2), per_gender AS ( SELECT gender, COUNT(*) AS count FROM users GROUP BY 1) SELECT per_gender_level.gender, per_gender_level.level, ROUND(100 * (CAST(per_gender_level.count AS DECIMAL) / per_gender.count), 2) AS percentage FROM per_gender_level INNER JOIN per_gender ON per_gender.gender = per_gender_level.gender ORDER BY 1, 2;
-
OLAP cube for users gender and level:
SELECT gender, level, COUNT(*) FROM users WHERE gender IS NOT NULL AND level IS NOT NULL GROUP BY CUBE (1, 2) ORDER BY 1, 2;
-
Description of number of songs per session:
SELECT MIN(n_songs), AVG(n_songs), MAX(n_songs) FROM (SELECT session_id, COUNT(*) AS n_songs FROM songplays GROUP BY 1) AS songs_per_session;
-
Top 10 most active users:
SELECT user_id, COUNT(*) FROM songplays GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
-
Top 10 most played song and artist:
SELECT song_id, COUNT(*) FROM songplays GROUP BY 1 ORDER BY 2 DESC LIMIT 10; SELECT artist_id, COUNT(*) FROM songplays GROUP BY 1 ORDER BY 2 DESC LIMIT 10;