This project analyzes Uber data using SQL on Google BigQuery to gain insights into ride-sharing patterns and optimize operations.
- Explore and understand the Uber data schema (star schema with fact and dimension tables)
- Build an ETL pipeline using Mage.ai to load data from Google Cloud Storage to BigQuery
- Utilize SQL queries in BigQuery to generate reports and answer business questions
- Create an analytics table for building interactive dashboards in Looker
- Data Source: Uber website: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Data Dictionary: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
- Data Modeling: Lucidchart (star schema)
- Data Transformation: Jupyter Notebook
- Data Storage: Google Cloud Storage
- Data Management: Google Compute Engine, Mage.ai
- Data Analysis: BigQuery (SQL)
- Data Visualization: Looker
- Download the Uber data from https://github.com/KanikaGaikwad/Uber-data-engineering-project-ETL-pipeline/blob/main/uber_data.csv
- Define the star schema for the data using Lucidchart.
- Create Jupyter notebooks to transform the data and create fact and dimension tables.
- Upload the data to Google Cloud Storage.
- Set up a Google Compute Engine virtual machine and connect to Mage.ai
- Build an ETL pipeline in Mage.ai to load the tables into BigQuery.
- This project requires familiarity with SQL, BigQuery, and potentially Looker.
- The Jupyter notebooks contain the data transformation code.
- The Mage.ai pipeline automates the data loading process to BigQuery.
- SQL queries written for BigQuery are available within the project directory (or specific location).
- Explore specific business questions related to Uber's ride-sharing data.
- Utilize advanced SQL functions for deeper data analysis.
- Develop interactive dashboards in Looker to showcase insights