Data Engineering Hotel Reviews

This is a personal data engineering project based on a hotel reviews Kaggle dataset.

Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack and Tools

Data Analysis & Exploration: SQL/Python
Cloud: Google Cloud Platform
- Data Lake - Google Cloud Storage
- Data Warehouse: BigQuery
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Prefect
Distributed Processing: Spark
Data Transformation: dbt
Data Visualization: Looker Studio
CICD: Git, dbt

Project Structure

The project has been structured with the following folders and files:

.github: contains the CI/CD files (GitHub Actions)
data: raw dataset, saved parquet files and data processed using Spark
dbt: data transformation and CI/CD pipeline using dbt
flows: workflow orchestration pipeline
images: printouts of results
looker: reports from looker studio
notebooks: EDA performed at the beginning of the project to establish a baseline
spark: batch processing pipeline using spark
terraform: IaC stream-based pipeline infrastructure in GCP using Terraform
Makefile: set of execution tasks
.pre-commit-config.yaml: pre-commit configuration file
pre-commit.md: readme file of the pre-commit hooks
pyproject.toml: linting and formatting
requirements.txt: project requirements

Project Description

The dataset was obtained from Kaggle and contains various columns with hotel details and reviews of 5 countries ('Austria', 'France', 'Italy', 'Netherlands', 'Spain', 'UK'). To prepare the data an Exploratory Data Analysis was conducted. The following actions are performed either using pandas or spark to get a clean data set:

Remove rows with NaN
Remove duplicates
Create a new column with the country name

Afterwards, some columns have been selected the final clean data are ingested to a GCP Bucket and Big Query. This is done either using Prefect (see flows folder), dbt (see dbt folder) or Spark (see spark folder).

Prefect Data Ingestion

dbt Data Ingestion

Spark Data Ingestion

Visualization

CI/CD

Finally, to streamline the development process, a fully automated CI/CD pipeline was created using GitHub Actions and dbt as well:

dbt CI/CD

GitHub Actions CI/CD

Project Set Up

The Python version used for this project is Python 3.9.

Clone the repo (or download it as zip):

git clone https://github.com/benitomartin/de-hotel-reviews.git

Create the virtual environment named main-env using Conda with Python version 3.9:
```
conda create -n main-env python=3.9
conda activate main-env
```
Execute the requirements.txt script and install the project dependencies:
```
pip install -r requirements.txt

or

make install
```
Install terraform:
```
 conda install -c conda-forge terraform
```

Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:

BigQuery Admin
BigQuery Data Editor
BigQuery Job User
BigQuery User
Dataproc Administrator
Storage Admin
Storage Object Admin
Storage Object Creator
Storage Object Viewer
Viewer

Best Practices

The following best practices have been implemented:

✅ Makefile
✅ CI/CD pipeline
✅ Linter and code formatter
✅ Pre-commit hooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Hotel Reviews

Tech Stack and Tools

Project Structure

Project Description

Prefect Data Ingestion

dbt Data Ingestion

Spark Data Ingestion

Visualization

CI/CD

dbt CI/CD

GitHub Actions CI/CD

Project Set Up

Best Practices

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
data		data
dbt		dbt
flows		flows
images		images
looker		looker
notebooks		notebooks
spark		spark
terraform		terraform
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
pre-commit.md		pre-commit.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

benitomartin/de-hotel-reviews

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Hotel Reviews

Tech Stack and Tools

Project Structure

Project Description

Prefect Data Ingestion

dbt Data Ingestion

Spark Data Ingestion

Visualization

CI/CD

dbt CI/CD

GitHub Actions CI/CD

Project Set Up

Best Practices

About

Topics

Resources

Stars

Watchers

Forks

Languages