Transport for London (TfL) is the organization responsible for managing and operating London's transportation system, including the London Underground, buses, trams, river services, and taxis. Its mission is to provide a safe, efficient, and sustainable transportation system that connects Londoners to jobs, services, and communities. TfL plans, invests in, manages, and maintains the transport infrastructure in Greater London while regulating and licensing transport services.
This project aims to build an end-to-end orchestrated data pipeline. The pipeline will fetch cycling data from https://cycling.data.tfl.gov.uk and export it to Google Cloud Storage. Then, the data will be filtered, transformed to the desired data types, and uploaded to BigQuery. In BigQuery, DBT will be used to transform the data. Finally, the data will be visualised in a dashboard.
- What is the average cycling duration?
- What are the top start stations
- what are the top 4 end stations
- Cloud:
Google Cloud
- Infrastructure:
Terraform
- Orchestration:
Prefect
- Data lake:
Google Cloud Storage
- Data transformation:
DBT
- Data warehouse:
BigQuery
- Data visualization:
Google Looker Studio
Column | Description |
---|---|
Rental_id | The rental id in database |
Bike_id | The rental bike id |
Start_Date | The start date a bike was rented |
StartStation_id | The station id a bike was rented |
StartStation_Name | The station name a bike was rented |
End_Date | The date the bike was returned |
EndStation_id | The station id the bike was returned |
EndStation_Name | The name of the station the bike was returned |
Duration | the rental time taken in seconds |
-
Partitioned on the
start_date
column as to allow for efficient data filtering based on specific date ranges. -
Clustered on the
startstation_id
column to improve query performance and data locality.
Please check the tutotial to recreate the project
Click here to access my dashboard on Looker.