Table of Contents
This project is part of the Data Engineering Zoomcamp. As part of the project, I developed a data pipeline to load and process data from a Kaggle dataset containing bookstore information for a book recommendation system. The dataset can be accessed on this Kaggle.
This dataset offers book ratings from users of various ages and geographical locations. It comprises three files: User.csv, which includes age and location data of bookstore users; Books.csv, containing information such as authors, titles, and ISBNs; and Ratings.csv, which details the ratings given by users for each book. Additional information about the dataset is available on Kaggle.
The primary objective of this project is to establish a streamlined data pipeline for obtaining, storing, cleansing, and visualizing data automatically. This pipeline aims to address various queries, such as identifying top-rated publishers and authors, and analyzing ratings based on geographical location.
Given that the data is static, the data pipeline operates as a one-time process.
- Dataset repo: Kaggle
- Infrastructure as Code: Terraform
- Workflow Orchestration: Airflow
- Data Lake: Google Cloud Storage
- Data Warehouse: Google BigQuery
- Transformation: DBT
- Visualisation: Metabase
- Programming Language: Python and SQL
Cloud infrastructure is set up with Terraform.
Airflow is run on a local docker container.
- A Google Cloud Platform account.
- A kaggle account.
- Install VSCode or Zed or any other IDE that works for you.
- Install Terraform
- Install Docker Desktop
- Install Google Cloud SDK
- Clone this repository onto your local machine.
- Go to Google Cloud and create a new project.
- Get the project ID and define the environment variables
GCP_PROJECT_ID
in the .env file located in the root directory - Create a Service account with the following roles:
BigQuery Admin
Storage Admin
Storage Object Admin
Viewer
- Download the Service Account credentials and store it in
$HOME/.google/credentials/
. - You need to activate the following APIs here
- Cloud Storage API
- BigQuery API
- Assign the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the path of your JSON credentials file, such thatGOOGLE_APPLICATION_CREDENTIALS
will be $HOME/.google/credentials/<authkeys_filename>.json- add this line to the end of the
.bashrc
file
export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
- Activate the enviroment variable by runing
source .bashrc
- add this line to the end of the
- A detailed description on how to authenicate is found here
- Define the environment variables
KAGGLE_USER
andKAGGLE_TOKEN
in the .env file located in the root directory. Note:KAGGLE_TOKEN
is the same asKAGGLE_KEY
- Using Zed or VSCode, open the cloned project
DE-2024-project-bookrecommendation
. - To customize the default values of
variable "project"
andvariable "region"
to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variablesTF_VAR_project
andTF_VAR_region
. - Open the terminal to the root project.
- Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command
cd terraform
. - Set an alias
alias tf='terraform'
- Initialise Terraform:
tf init
- Plan the infrastructure:
tf plan
- Apply the changes:
tf apply
- Please confirm that the following environment variables are configured in
.env
in the root directory of the project.AIRFLOW_UID
. The default value is 50000KAGGLE_USERNAME
. This should be set from Set up kaggle section.KAGGLE_TOKEN
. This should be set from Set up kaggle section tooGCP_PROJECT_ID
. This should be set from Create a Google Cloud Project sectionGCP_BOOK_RECOMMENDATION_BUCKET=book_recommendation_datalake_<GCP project id>
GCP_BOOK_RECOMMENDATION_WH_DATASET=book_recommendation_analytics
GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET=book_recommendataion_wh
- Run
docker-compose up
. - Access the Airflow dashboard by visiting
http://localhost:8080/
in your web browser. The interface will resemble the following. Use the username and password airflow to log in.
- Visit
http://localhost:1460
in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.
Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs Below is the DAG's graph. To run the DAG, Click on the play button(Figure 1)
- Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command
cd data_dbt
. - Generate a profiles.yml file within
${HOME}/.dbt
, followed by defining a profile for this project as instructed below.
data_dbt_book_recommendation:
outputs:
dev:
dataset: book_recommendation_analytics
fixed_retries: 1
keyfile: <location_google_auth_key>
location: <preferred project region>
method: service-account
priority: interactive
project: <preferred project id>
threads: 6
timeout_seconds: 300
type: bigquery
target: dev
- To run all models, run
dbt run -t dev
- Navigate to your Google BigQuery project by clicking on this link. There, you'll find all the tables and views created by DBT.
Please watch the provided video tutorial to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this PDF linked below contains the complete screenshot of the dashboard I created.
Twitter: @iamraphson
I would like to extend my heartfelt gratitude to the organizers of the Data Engineering Zoomcamp for providing such a valuable course. The insights I gained have been instrumental in broadening my understanding of the field of Data Engineering. Additionally, I want to express my appreciation to my fellow colleague with whom I took the course. Thank you all for your support and collaboration throughout this journey.
🦅