Code Review: Data Modeling

By Drew White

Code Review - Demonstrating the use of Data Modeling through dimension and fact diagrams and doing an ETL pipeline with that information.

Technologies Used

Google Cloud
Big Query
Draw.io
SQL
Python
Pandas
Jupyter Notebooks
Data Modeling

Description:

Inspect our main tickets source file: data/air_travel/tickets.json

This file include json row records. Each record includes an issued flight ticket such as:

{
  "eticket_num": "498-938211-0795",
  "confirmation": "ZVFDC4",
  "ticket_date": "2022-03-23",
  "price": 723.42,
  "seat": "31I",
  "status": "active",

  "airline": {
    "name": "China Eastern Airlines",
    "iata": "MU",
    "icao": "CES",
    "callsign": "CHINA EASTERN",
    "country": "China"
  },

  "origin": {
    "name": "Montreal / Pierre Elliott Trudeau International Airport",
    "city": "Montreal",
    "country": "Canada",
    "iata": "YUL",
    "icao": "CYUL",
    "latitude": 45.47,
    "longitude": -73.74,
    "altitude": 118,
    "tz_timezone": "America/Toronto"
  },

  "destination": {
    "name": "Chicago Midway International Airport",
    "city": "Chicago",
    "country": "United States",
    "iata": "MDW",
    "icao": "KMDW",
    "latitude": 41.79,
    "longitude": -87.75,
    "altitude": 620,
    "tz_timezone": "America/Chicago"
  },

  "passenger": {
    "first_name": "Robert",
    "last_name": "Brown",
    "gender": "M",
    "birth_date": "1969-02-17",
    "email": "robert.brown.69@hotmail.com",
    "street": "5007 Thomas Way",
    "city": "Lake Hollystad",
    "state": "DC",
    "zip": "20027"
  }
}

Exercise 1: Data Modeling

Using draw.io create a data model. Your data model MUST meet the following requirements:

Contain a tickets fact table
Contain the following dimensions: airlines, airports, and passengers
Develop passengers as an SCD Type2 dimension:
- Passenger email can be used as the natural key
- Be sure to add a surrogate key and effective start/end dates
- You can optionally add an active column
IATA codes can be used as the primary key for both airlines and airports
Use the t-ticket number as the primary key for the tickets fact

Data model created as outlined in exercise. Using data/air_travel/tickets.json

Exercise 2: Data Loading and Normalization

Develop an ETL pipeline that loads our dimensions and facts from the source file. You pipeline MUST meet the following requirements:

General:

Load all dimensions in order: airlines, airports, and passengers
Load the tickets fact table after loading dimensions
Your pipeline can drop/replace tables
You can assume only inserts at this state. No updates, deletes, or merges

Airlines Dim:

Identify unique airlines
Use IATA code as the dimension key

Airports Dim:

Identify unique airports from both origin and destination fields
Use IATA code as the dimension key

Passengers Dim:

Identify unique passengers
Use the passenger email as the dimension natural key
Generate UUIDs for the dimension surrogate keys
Set the effective start date to any date. You can either use the ticket date, current date, or a fixed set date in the past
Set the effective end date to None
Optionally set your active flag to 'Y'
Passenger address columns are considered SCD Type 2 columns
All other columns are SCD Type 1

Tickets Fact:

Link to airlines and airports dimensions by their IATA codes.
You don't need a lookup for airlines and airports since we use their IATA as dimension keys
Link to the passengers dimension by its surrogate key
You need to perform a lookup to the passengers dimension
Load all the tickets

Setup/Installation Requirements

Clone by inputting following into terminal:

git clone https://github.com/Drewrwhite/data_week_8.git

Navigate to directory:
```
cd <directory>
```
Create a virtual environment:
```
python3.7 -m venv venv
```
Activate virtual environment:
```
source venv/bin/activate
```
Install requirements:
```
pip install -r requirements.txt
```
Open directory in VSCode:
```
code .
```

Known Bugs

No known bugs

License

MIT

If you find any issues, please reach out at: d.white0002@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data/air_travel		data/air_travel
images		images
.gitignore		.gitignore
README.md		README.md
license.txt		license.txt
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Review: Data Modeling

By Drew White

Code Review - Demonstrating the use of Data Modeling through dimension and fact diagrams and doing an ETL pipeline with that information.

Technologies Used

Description:

Exercise 1: Data Modeling

Exercise 2: Data Loading and Normalization

Setup/Installation Requirements

Known Bugs

License

About

Releases

Packages

Languages

License

Drewrwhite/data_week_8

Folders and files

Latest commit

History

Repository files navigation

Code Review: Data Modeling

By Drew White

Code Review - Demonstrating the use of Data Modeling through dimension and fact diagrams and doing an ETL pipeline with that information.

Technologies Used

Description:

Exercise 1: Data Modeling

Exercise 2: Data Loading and Normalization

Setup/Installation Requirements

Known Bugs

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages