I completed a Data Engineering Bootcamp / Zoomcamp and tracked some of my progress here with some rough notes.
Each week I worked through a series of videos and followed this up with homework exercises.
See my final project here. I've condensed much of my learning from the bootcamp into this project, documenting the steps I took.
The goal is to develop a data pipeline following the architecture below. We looked at New York City Taxi data.
We used a range of tools:
- Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Airflow: Pipeline Orchestration
- DBT: Data Transformation
- Spark: Distributed Processing
- Kafka: Streaming
See my final project here.
-
PostgreSQL | Terraform | Docker | Google Cloud Platform
This week we were introduced to Docker; a framework for managing containers. We created containers for PostgreSQL and PgAdmin, before finally creating our own image, which when run, created and populated tables within a PostgreSQL database.
Next up we learned about Google Cloud Platform (GCP) which is a suite of Google Cloud Computing resources. Here we setup a service account (more or less a user account for services running in GCP) as well a a Virtual Machine, connecting to it using SSH via the command line.
We was also introduced to Terraform - an infrastructure-as-code tool. We used this to generate Big Query and Google Cloud Storage on GCP.
I enjoyed this week, although it was heavy going. A lot of late nights trying to understand new concepts and fix unexpected bugs related to Docker. I now feel significantly more confident in understanding and utilsing this tool.
-
Airflow | Docker
This week we learned about Airflow; an orchestration tool.
Here we setup a Docker container with Airflow. We then setup a few basic DAGs. Each of these extracted CSV data from a website, converted them to parquet format, before loading them into our GCP data lake.
This week was easier than last week, but still challenging. It feels good to understand Airflow at a basic level, and implement some of my own DAGs. The configuation with Docker was a little tricky, but I plan on spending a bit more time going through the code to understand it all.
-
BigQuery
This week was focused on Data Warehousing, specifically BigQuery.
This was a more relaxed week. Not as much to take in, giving me a chance to catch up. It mostly consisted of BigQuery basics, ingesting more data into it, and playing around with partitioning and clustering.
I don't feel like I learned as much this week, and have made a note to spend more time on Data Warehousing and Dimensional Modeling in my own time.
-
DBT | Google Data Studio
This week we looked into DBT and Analytics Engineering.
We learned that DBT sits on top of data warehouses and can be used to develop pipelines using SELECT statements, as well as test and document our models.
Most of the week was spent writing some DBT models, before eventually pushing this to production.
We also gained some exposure to Google Data Studio, which we used to generate a simple dashboard.
This was an interesting week, and was good to see what a more modern data stack might look like.
-
Spark | Batch Processing
pending...
-
Kafka | Stream Processing
pending...