Pipeline for IMDB Movies and Ratings on GCP

This marks the second project within the Data Engineering Zoomcamp curriculum. In this project, we'll demonstrate a pipeline designed to handle batch processing of data sourced from IMDb Non-Commercial Datasets.

IMDb, aka Internet Movie Database, serves as an extensive online repository of data concerning movies, TV shows, and video games. It encompasses a comprehensive array of details, including cast and crew information, plot summaries, ratings, and reviews, rendering it an indispensable resource for both movie enthusiasts and professionals in the industry.

Our primary objective is to establish a robust data pipeline that retrieves raw data from the IMDB dataset, stores it, enhances its quality, and presents it in an intuitive dashboard. This pipeline enables us to address various inquiries, such as the distribution of films across genres, the duration of the longest movie in hours, the proportion of films within different categories, the total count of movies in the database, and the distribution of films by release years, among others.

As per the IMDB data specifications, the dataset undergoes daily updates. Consequently, our data pipeline is scheduled to execute daily at 1pm UTC time to ensure the integration of the latest data into our system.

Built With

Dataset repo: IMDB
Infrastructure as Code: Terraform
Workflow Orchestration: Airflow
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Managed service for Apache Spark: Google Dataproc
Transformation: PySpark
Visualisation: Metabase
Programming Language: Python and SQL

Project Architecture

The cloud infrastructure has been established using Terraform, providing a scalable and reliable foundation for our applications and services. Meanwhile, Airflow, our workflow orchestration tool, is operational within a local Docker container.

Prerequisites

A Google Cloud Platform account.
Install VSCode or Zed or any other IDE that works for you.
Install Terraform
Install Docker Desktop
Install Google Cloud SDK
Clone this repository onto your local machine.

Create a Google Cloud Project

Go to Google Cloud and create a new project.
Get the project ID and define the environment variables GCP_PROJECT_ID in the .env file located in the root directory
Create a Service account with the following roles:
- BigQuery Admin
- Storage Admin
- Viewer
- Compute Admin
- Dataproc Administrator
Download the Service Account credentials and store it in $HOME/.google/credentials/.
You need to activate the following APIs here
- Cloud Storage API
- BigQuery API
Assign the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your JSON credentials file, such that GOOGLE_APPLICATION_CREDENTIALS will be $HOME/.google/credentials/<authkeys_filename>.json
- add this line to the end of the .bashrc file
```
export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
```
- Activate the enviroment variable by runing source .bashrc
Set up the infrastructure on GCP with Terraform
- Using Zed or VSCode, open the cloned project IMDB-pipeline-project.
- To customize the default values of variable "project" and variable "region" to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables TF_VAR_project and TF_VAR_region.
- Open the terminal to the root project.
- Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd terraform.
- Set an alias alias tf='terraform'
- Initialise Terraform: tf init
- Plan the infrastructure: tf plan
- Apply the changes: tf apply

Set up Airflow and Metabase

Please confirm that the following environment variables are configured in .env in the root directory of the project.
- AIRFLOW_UID. The default value is 50000
- GCP_PROJECT_ID. This should be set from Create a Google Cloud Project section
- GCP_IMDB_BUCKET=imdb_datalake_<GCP project id>
- GCP_PROJECT_REGION. The location where the Data Proc is deployed
- GCP_IMDB_WH_DATASET=imdb_analytics
- GCP_IMDB_DATAPROC_TEMP_BUCKET. Review the Google Cloud Storage (GCS) section in the console to find the temp bucket name for Data Proc. This is the value you need here.
Run docker-compose up.
Access the Airflow dashboard by visiting http://localhost:8080/ in your web browser. The interface will resemble the following. Use the username and password airflow to log in.

Visit http://localhost:1460 in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.

Data Ingestion

Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs.

Below is the DAG's graph.

To run the DAG, Click on the play button.

Data Transformation

The IMDB DAG manages this process. All transformation scripts are located in the data_airflow/dags/scripts folder. Airflow copies these scripts to Google Cloud Storage (GCS) and submits transformation jobs to Data Proc.

Data Visualization

Please watch the provided video tutorial to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this PDF linked below contains the complete screenshot of the dashboard I created.

Contact

Twitter: @iamraphson

🦅

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_airflow/dags		data_airflow/dags
data_notebook		data_notebook
screenshots		screenshots
terraform		terraform
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for IMDB Movies and Ratings on GCP

Built With

Project Architecture

Prerequisites

Create a Google Cloud Project

Set up the infrastructure on GCP with Terraform

Set up Airflow and Metabase

Data Ingestion

Data Transformation

Data Visualization

Contact

About

Releases

Packages

Languages

iamraphson/IMDB-pipeline-project

Folders and files

Latest commit

History

Repository files navigation

Pipeline for IMDB Movies and Ratings on GCP

Built With

Project Architecture

Prerequisites

Create a Google Cloud Project

Set up the infrastructure on GCP with Terraform

Set up Airflow and Metabase

Data Ingestion

Data Transformation

Data Visualization

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages