Skip to content

benitomartin/de-hotel-reviews

Repository files navigation

Data Engineering Hotel Reviews

This is a personal data engineering project based on a hotel reviews Kaggle dataset.

Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack and Tools

Visual Studio Code Jupyter Notebook Python Pandas Anaconda Apache Spark Prefect dbt Linux Ubuntu Google Cloud Looker Studio Terraform Git

  • Data Analysis & Exploration: SQL/Python
  • Cloud: Google Cloud Platform
    • Data Lake - Google Cloud Storage
    • Data Warehouse: BigQuery
  • Infrastructure as Code (IaC): Terraform
  • Workflow Orchestration: Prefect
  • Distributed Processing: Spark
  • Data Transformation: dbt
  • Data Visualization: Looker Studio
  • CICD: Git, dbt

Project Structure

The project has been structured with the following folders and files:

  • .github: contains the CI/CD files (GitHub Actions)
  • data: raw dataset, saved parquet files and data processed using Spark
  • dbt: data transformation and CI/CD pipeline using dbt
  • flows: workflow orchestration pipeline
  • images: printouts of results
  • looker: reports from looker studio
  • notebooks: EDA performed at the beginning of the project to establish a baseline
  • spark: batch processing pipeline using spark
  • terraform: IaC stream-based pipeline infrastructure in GCP using Terraform
  • Makefile: set of execution tasks
  • .pre-commit-config.yaml: pre-commit configuration file
  • pre-commit.md: readme file of the pre-commit hooks
  • pyproject.toml: linting and formatting
  • requirements.txt: project requirements

Project Description

The dataset was obtained from Kaggle and contains various columns with hotel details and reviews of 5 countries ('Austria', 'France', 'Italy', 'Netherlands', 'Spain', 'UK'). To prepare the data an Exploratory Data Analysis was conducted. The following actions are performed either using pandas or spark to get a clean data set:

  • Remove rows with NaN
  • Remove duplicates
  • Create a new column with the country name

Afterwards, some columns have been selected the final clean data are ingested to a GCP Bucket and Big Query. This is done either using Prefect (see flows folder), dbt (see dbt folder) or Spark (see spark folder).

Prefect Data Ingestion

 

dbt Data Ingestion

 

Spark Data Ingestion

 

Visualization

CI/CD

Finally, to streamline the development process, a fully automated CI/CD pipeline was created using GitHub Actions and dbt as well:

dbt CI/CD

 

GitHub Actions CI/CD

 

Project Set Up

The Python version used for this project is Python 3.9.

  1. Clone the repo (or download it as zip):

    git clone https://github.com/benitomartin/de-hotel-reviews.git
  2. Create the virtual environment named main-env using Conda with Python version 3.9:

    conda create -n main-env python=3.9
    conda activate main-env
  3. Execute the requirements.txt script and install the project dependencies:

    pip install -r requirements.txt
    
    or
    
    make install
  4. Install terraform:

     conda install -c conda-forge terraform

Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:

  • BigQuery Admin
  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User
  • Dataproc Administrator
  • Storage Admin
  • Storage Object Admin
  • Storage Object Creator
  • Storage Object Viewer
  • Viewer

Best Practices

The following best practices have been implemented:

  • ✅ Makefile
  • ✅ CI/CD pipeline
  • ✅ Linter and code formatter
  • ✅ Pre-commit hooks