Data Modeling and ETL pipeline for SparkifyDB

Introduction

In this project, we are going to create a Data pipeline, where we will create Star schema for Sparkify Data Analysis. After creating this pipeline extract the logs and songs data, then we will process this data, and stored this data into our created database.

Database Schema

Songplays table:

    CREATE TABLE IF NOT EXISTS songplays
    (
        songplay_id SERIAL PRIMARY KEY,
        start_time TIMESTAMP NOT NULL,
        user_id INTEGER NOT NULL,
        level VARCHAR,
        song_id VARCHAR,
        artist_id VARCHAR,
        session_id INTEGER NOT NULL,
        location VARCHAR,
        user_agent VARCHAR NOT NULL
    );

Users table

    CREATE TABLE IF NOT EXISTS users
    (
        user_id INTEGER PRIMARY KEY,
        first_name VARCHAR,
        last_name VARCHAR,
        gender VARCHAR,
        level VARCHAR
    );

Songs table

    CREATE TABLE IF NOT EXISTS songs
    (
        song_id VARCHAR PRIMARY KEY,
        title VARCHAR NOT NULL,
        artist_id VARCHAR NOT NULL,
        year INTEGER,
        duration NUMERIC NOT NULL
    );

Artists table

    CREATE TABLE IF NOT EXISTS artists
    (
        artist_id VARCHAR PRIMARY KEY,
        name VARCHAR NOT NULL,
        location VARCHAR,
        latitude FLOAT,
        longitude FLOAT
    );

Timetable table

    CREATE TABLE IF NOT EXISTS timetable
    (
        start_time TIMESTAMP PRIMARY KEY,
        hour INTEGER,
        day INTEGER,
        week INTEGER,
        month INTEGER,
        year INTEGER,
        weekday INTEGER
    );

Project Structure

In this project, we have multiple project files, according to the functionalities.

create_tables.py This file contains code for creating the database Schema.
sql_queries.py This file contains all the queries, inlcuding creating tables, and inserting, and getting values from the tables.
etl.py This file contains different functions of loading the dataset, process the dataset and in the end storing that information into the database.

Usage

You can first Create the DB Schema by running the following command.

    python create_tables.py

Then you can perform the ETL tasks, using the following command.

    python etl.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling and ETL pipeline for SparkifyDB

Introduction

Database Schema

Project Structure

Usage

About

Releases

Packages

Languages

License

saudBinHabib/data_modeling_with_postgres

Folders and files

Latest commit

History

Repository files navigation

Data Modeling and ETL pipeline for SparkifyDB

Introduction

Database Schema

Project Structure

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages