ETL

Introduction

A fake startup called Sparkify wants to analyze the data they've been collecting on their new music streaming app about:

songs;
user activity.

The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app (folder log_data), as well as a directory with JSON metadata on the songs in their app (folder song_data).

They'd like a data engineer to create:

a PostgreSQL database with tables designed to optimize queries on song play analysis
- using fact and dimension tables for a star schema
an ETL pipeline that transfers data from files in two local directories into these tables in PostgreSQL.

Project datasets

There are two datasets: song dataset and log dataset.

Song dataset

Subset of real data from the Million Song Dataset

Each file is in JSON format and contains metadata about a song and the artist of that song.

Example content of file song_data/A/A/A/TRAAAAW128F429D538.json:

{
  "num_songs": 1,
  "artist_id": "ARD7TVE1187B99BFB1",
  "artist_latitude": null,
  "artist_longitude": null,
  "artist_location": "California - LA",
  "artist_name": "Casual",
  "song_id": "SOMZWCG12A8C13C480",
  "title": "I Didn't Mean To",
  "duration": 218.93179,
  "year": 0
}

The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
- song_data/A/B/C/TRABCEI128F424C983.json
- song_data/A/A/B/TRAABJL12903CDCF1A.json

Log dataset

Log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

Example content of the first two lines of file log_data/2018/11/2018-11-01-events.json:

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:
- log_data/2018/11/2018-11-12-events.json
- log_data/2018/11/2018-11-13-events.json

Schema for song play analysis

Fact table

songplays: records in log data associated with song plays, i.e., records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension tables

users: users in the app
- user_id, first_name, last_name, gender, level
songs: songs in music database
- song_id, title, artist_id, year, duration
artists: artists in music database
- artist_id, name, location, latitude, longitude
time: timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

Project structure

├── README.md: this file
├── constants.py: database constants
├── create_tables.py: create clean database with all tables. Run this file to reset your tables
|                     before running the ETL scripts.
├── data
│   ├── log_data: JSON log files
│   └── song_data: JSON song files
├── etl.ipynb: read and process a single file from `song_data` and `log_data`; and load the data
|              into the tables. This notebook contains detailed instructions on the ETL process for
|              each of the tables. It helps to create file `etl.py`
├── etl.py: read and process JSON files and insert data into dimension and fact tables
├── requirements.txt: project requirements (Python libraries)
├── requirements_dev.txt: additional requirements used for development
├── sql_queries.py: SQL commands (DROP TABLE, CREATE TABLE, INSERT INTO, SELECT); imported in
|                   `create_tables.py`, `etl.py` and `etl.ipynb`
└── test.ipynb: display the first few rows of each table to check if the database is correct

Python environment

Setup

Create a conda environment called etl-env and install the requirements in it.

conda create -yn etl-env python=3.7 --file requirements.txt

Development

Use the new conda environment in jupyter notebook.

conda install -n base nb_conda_kernels
conda activate etl-env
python -m pip install -r requirements_dev.txt
conda deactivate
jupyter notebook

Now you can run the jupyter notebook files etl.ipynb and test.ipynb.

Database

Setup PostgreSQL database (macOS)

Install PostgreSQL:

Install Homebrew (macOS): http://brew.sh/
Install PostgreSQL (macOS): brew install postgresql

Start service and create default user and database:

Start PostgreSQL with homebrew-services: brew services run postgresql
- Alternative: pg_ctl -D /usr/local/var/postgres start
- brew service start <service> starts the <service> at login, while brew services run runs the <service> but doesn't start it at login (nor boot).

Connect to default PostgreSQL database and create user:

$ psql --dbname postgres
postgres=# \du
postgres=# CREATE ROLE student WITH LOGIN CREATEDB PASSWORD 'student';
postgres=# \du
postgres=# \quit

Connect to default PostgreSQL database as the new user and create database:

$ psql --dbname postgres --username student
postgres=> \list
postgres=> CREATE DATABASE studentdb;
postgres=> \list
postgres=> GRANT ALL PRIVILEGES ON DATABASE studentdb TO student;
postgres=> \list
postgres=> \connect studentdb
postgres=> \quit

Run the ETL

python create_tables.py && python etl.py

To debug: psql --dbname sparkifydb --username student

If you receive the error DETAIL: There is 1 other session using the database., close all other connections, except the current one. Probably it is a jupyter notebook running, so restart it, or run this query:

SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();

Cleanup

Remove the Python environment: conda env remove -n etl-env

Remove all tables:

$ psql --dbname sparkifydb
sparkifydb=# \dn+
sparkifydb=# DROP SCHEMA public CASCADE;
sparkifydb=# \dn+
sparkifydb=# CREATE SCHEMA public;
sparkifydb=# GRANT ALL ON SCHEMA public TO public;
sparkifydb=# COMMENT ON SCHEMA public IS 'standard public schema';
sparkifydb=# \dn+
sparkifydb=# \quit

Or remove the database and the user:

$ psql --dbname postgres
postgres=# DROP DATABASE IF EXISTS sparkifydb;
postgres=# DROP USER student;
postgres=# \quit

Stop the PostgreSQL service: brew services stop postgresql
- Alternative: pg_ctl -D /usr/local/var/postgres stop
- Check by listing all services managed by brew services (postgresql should be stopped): brew services list

Possible questions to ask

Examples of queries:

Top 10 most common years when song were published:

SELECT year, COUNT(*)
FROM songs
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Distribution of song duration:

SELECT
  percentile,
  PERCENTILE_DISC(percentile) WITHIN GROUP (ORDER BY duration) AS duration
FROM songs, GENERATE_SERIES(0, 1, 0.25) AS percentile
GROUP BY 1;

Top 10 most common first names and last names of users:

SELECT first_name, COUNT(*)
FROM users
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

SELECT last_name, COUNT(*)
FROM users
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Top 10 most common artist locations:

SELECT location, COUNT(*)
FROM artists
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Percentage of users per gender and level:

SELECT
  gender,
  ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
FROM users
GROUP BY 1
ORDER BY 1;

SELECT
  level,
  ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
FROM users
GROUP BY 1
ORDER BY 1;

WITH per_gender_level AS (
SELECT gender, level, COUNT(*) AS count
FROM users
GROUP BY 1, 2),

per_gender AS (
SELECT gender, COUNT(*) AS count
FROM users
GROUP BY 1)

SELECT
  per_gender_level.gender,
  per_gender_level.level,
  ROUND(100 * (CAST(per_gender_level.count AS DECIMAL) / per_gender.count), 2) AS percentage
FROM per_gender_level
INNER JOIN per_gender ON per_gender.gender = per_gender_level.gender
ORDER BY 1, 2;

OLAP cube for users gender and level:

SELECT gender, level, COUNT(*)
FROM users
WHERE gender IS NOT NULL AND level IS NOT NULL
GROUP BY CUBE (1, 2)
ORDER BY 1, 2;

Description of number of songs per session:

SELECT MIN(n_songs), AVG(n_songs), MAX(n_songs)
FROM (SELECT session_id, COUNT(*) AS n_songs
      FROM songplays
      GROUP BY 1) AS songs_per_session;

Top 10 most active users:

SELECT user_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Top 10 most played song and artist:

SELECT song_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

SELECT artist_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL

Introduction

Project datasets

Song dataset

Log dataset

Schema for song play analysis

Fact table

Dimension tables

Project structure

Python environment

Setup

Development

Database

Setup PostgreSQL database (macOS)

Run the ETL

Cleanup

Possible questions to ask

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
db_values.py		db_values.py
etl.ipynb		etl.ipynb
etl.py		etl.py
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

gcbeltramini/etl-postgresql

Folders and files

Latest commit

History

Repository files navigation

ETL

Introduction

Project datasets

Song dataset

Log dataset

Schema for song play analysis

Fact table

Dimension tables

Project structure

Python environment

Setup

Development

Database

Setup PostgreSQL database (macOS)

Run the ETL

Cleanup

Possible questions to ask

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages