Code Review - Demonstrating the use of Data Modeling through dimension and fact diagrams and doing an ETL pipeline with that information.
- Google Cloud
- Big Query
- Draw.io
- SQL
- Python
- Pandas
- Jupyter Notebooks
- Data Modeling
Inspect our main tickets source file: data/air_travel/tickets.json
This file include json row records. Each record includes an issued flight ticket such as:
{
"eticket_num": "498-938211-0795",
"confirmation": "ZVFDC4",
"ticket_date": "2022-03-23",
"price": 723.42,
"seat": "31I",
"status": "active",
"airline": {
"name": "China Eastern Airlines",
"iata": "MU",
"icao": "CES",
"callsign": "CHINA EASTERN",
"country": "China"
},
"origin": {
"name": "Montreal / Pierre Elliott Trudeau International Airport",
"city": "Montreal",
"country": "Canada",
"iata": "YUL",
"icao": "CYUL",
"latitude": 45.47,
"longitude": -73.74,
"altitude": 118,
"tz_timezone": "America/Toronto"
},
"destination": {
"name": "Chicago Midway International Airport",
"city": "Chicago",
"country": "United States",
"iata": "MDW",
"icao": "KMDW",
"latitude": 41.79,
"longitude": -87.75,
"altitude": 620,
"tz_timezone": "America/Chicago"
},
"passenger": {
"first_name": "Robert",
"last_name": "Brown",
"gender": "M",
"birth_date": "1969-02-17",
"email": "robert.brown.69@hotmail.com",
"street": "5007 Thomas Way",
"city": "Lake Hollystad",
"state": "DC",
"zip": "20027"
}
}
Using draw.io create a data model. Your data model MUST meet the following requirements:
- Contain a tickets fact table
- Contain the following dimensions: airlines, airports, and passengers
- Develop passengers as an SCD Type2 dimension:
- Passenger email can be used as the natural key
- Be sure to add a surrogate key and effective start/end dates
- You can optionally add an active column
- IATA codes can be used as the primary key for both airlines and airports
- Use the t-ticket number as the primary key for the tickets fact
Data model created as outlined in exercise. Using data/air_travel/tickets.json
Develop an ETL pipeline that loads our dimensions and facts from the source file. You pipeline MUST meet the following requirements:
General:
- Load all dimensions in order: airlines, airports, and passengers
- Load the tickets fact table after loading dimensions
- Your pipeline can drop/replace tables
- You can assume only inserts at this state. No updates, deletes, or merges
Airlines Dim:
- Identify unique airlines
- Use IATA code as the dimension key
Airports Dim:
- Identify unique airports from both origin and destination fields
- Use IATA code as the dimension key
Passengers Dim:
- Identify unique passengers
- Use the passenger email as the dimension natural key
- Generate UUIDs for the dimension surrogate keys
- Set the effective start date to any date. You can either use the ticket date, current date, or a fixed set date in the past
- Set the effective end date to None
- Optionally set your active flag to 'Y'
- Passenger address columns are considered SCD Type 2 columns
- All other columns are SCD Type 1
Tickets Fact:
- Link to airlines and airports dimensions by their IATA codes.
- You don't need a lookup for airlines and airports since we use their IATA as dimension keys
- Link to the passengers dimension by its surrogate key
- You need to perform a lookup to the passengers dimension
- Load all the tickets
- Clone by inputting following into terminal:
git clone https://github.com/Drewrwhite/data_week_8.git
- Navigate to directory:
cd <directory>
- Create a virtual environment:
python3.7 -m venv venv
- Activate virtual environment:
source venv/bin/activate
- Install requirements:
pip install -r requirements.txt
- Open directory in VSCode:
code .
- No known bugs
If you find any issues, please reach out at: d.white0002@gmail.com.
Copyright (c) 2023 Drew White