Data Pipeline Project for Book Recommendation

Table of Contents

Introduction
- Built With
Project Architecture
Getting Started
Data Ingestion
Data Transformation
Data Visualization
Contact
Acknowledgments

Introduction

This project is part of the Data Engineering Zoomcamp. As part of the project, I developed a data pipeline to load and process data from a Kaggle dataset containing bookstore information for a book recommendation system. The dataset can be accessed on this Kaggle.

This dataset offers book ratings from users of various ages and geographical locations. It comprises three files: User.csv, which includes age and location data of bookstore users; Books.csv, containing information such as authors, titles, and ISBNs; and Ratings.csv, which details the ratings given by users for each book. Additional information about the dataset is available on Kaggle.

The primary objective of this project is to establish a streamlined data pipeline for obtaining, storing, cleansing, and visualizing data automatically. This pipeline aims to address various queries, such as identifying top-rated publishers and authors, and analyzing ratings based on geographical location.

Given that the data is static, the data pipeline operates as a one-time process.

Built With

Dataset repo: Kaggle
Infrastructure as Code: Terraform
Workflow Orchestration: Airflow
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Transformation: DBT
Visualisation: Metabase
Programming Language: Python and SQL

Project Architecture

Cloud infrastructure is set up with Terraform.

Airflow is run on a local docker container.

Getting Started

Prerequisites

A Google Cloud Platform account.
A kaggle account.
Install VSCode or Zed or any other IDE that works for you.
Install Terraform
Install Docker Desktop
Install Google Cloud SDK
Clone this repository onto your local machine.

Create a Google Cloud Project

Go to Google Cloud and create a new project.
Get the project ID and define the environment variables GCP_PROJECT_ID in the .env file located in the root directory
Create a Service account with the following roles:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Viewer
Download the Service Account credentials and store it in $HOME/.google/credentials/.
You need to activate the following APIs here
- Cloud Storage API
- BigQuery API
Assign the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your JSON credentials file, such that GOOGLE_APPLICATION_CREDENTIALS will be $HOME/.google/credentials/<authkeys_filename>.json
- add this line to the end of the .bashrc file
```
export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
```
- Activate the enviroment variable by runing source .bashrc

Set up kaggle

A detailed description on how to authenicate is found here
Define the environment variables KAGGLE_USER and KAGGLE_TOKEN in the .env file located in the root directory. Note: KAGGLE_TOKEN is the same as KAGGLE_KEY

Set up the infrastructure on GCP with Terraform

Using Zed or VSCode, open the cloned project DE-2024-project-bookrecommendation.
To customize the default values of variable "project" and variable "region" to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables TF_VAR_project and TF_VAR_region.
Open the terminal to the root project.
Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd terraform.
Set an alias alias tf='terraform'
Initialise Terraform: tf init
Plan the infrastructure: tf plan
Apply the changes: tf apply

Set up Airflow and Metabase

Please confirm that the following environment variables are configured in .env in the root directory of the project.
- AIRFLOW_UID. The default value is 50000
- KAGGLE_USERNAME. This should be set from Set up kaggle section.
- KAGGLE_TOKEN. This should be set from Set up kaggle section too
- GCP_PROJECT_ID. This should be set from Create a Google Cloud Project section
- GCP_BOOK_RECOMMENDATION_BUCKET=book_recommendation_datalake_<GCP project id>
- GCP_BOOK_RECOMMENDATION_WH_DATASET=book_recommendation_analytics
- GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET=book_recommendataion_wh
Run docker-compose up.
Access the Airflow dashboard by visiting http://localhost:8080/ in your web browser. The interface will resemble the following. Use the username and password airflow to log in.

Visit http://localhost:1460 in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.

Data Ingestion

Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs Below is the DAG's graph. To run the DAG, Click on the play button(Figure 1)

Data Transformation

Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd data_dbt.
Generate a profiles.yml file within ${HOME}/.dbt, followed by defining a profile for this project as instructed below.

data_dbt_book_recommendation:
  outputs:
    dev:
      dataset: book_recommendation_analytics
      fixed_retries: 1
      keyfile: <location_google_auth_key>
      location: <preferred project region>
      method: service-account
      priority: interactive
      project: <preferred project id>
      threads: 6
      timeout_seconds: 300
      type: bigquery
  target: dev

To run all models, run dbt run -t dev
Navigate to your Google BigQuery project by clicking on this link. There, you'll find all the tables and views created by DBT.

Data Visualization

Please watch the provided video tutorial to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this PDF linked below contains the complete screenshot of the dashboard I created.

Contact

Twitter: @iamraphson

Acknowledgments

I would like to extend my heartfelt gratitude to the organizers of the Data Engineering Zoomcamp for providing such a valuable course. The insights I gained have been instrumental in broadening my understanding of the field of Data Engineering. Additionally, I want to express my appreciation to my fellow colleague with whom I took the course. Thank you all for your support and collaboration throughout this journey.

🦅

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_airflow/dags		data_airflow/dags
data_dbt		data_dbt
screenshots		screenshots
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline Project for Book Recommendation

Introduction

Built With

Project Architecture

Getting Started

Prerequisites

Create a Google Cloud Project

Set up kaggle

Set up the infrastructure on GCP with Terraform

Set up Airflow and Metabase

Data Ingestion

Data Transformation

Data Visualization

Contact

Acknowledgments

About

Releases

Packages

Languages

License

iamraphson/DE-2024-project-book-recommendation

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Project for Book Recommendation

Introduction

Built With

Project Architecture

Getting Started

Prerequisites

Create a Google Cloud Project

Set up kaggle

Set up the infrastructure on GCP with Terraform

Set up Airflow and Metabase

Data Ingestion

Data Transformation

Data Visualization

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages