Data Engineering Project - Football Stadiums

This project demonstrates a data engineering pipeline that extracts data from Wikipedia, processes and stores it in Azure Data Lake Gen2, and performs further transformations and analyses using Azure Data Factory and Databricks. The final data is visualized using Tableau, Power BI, and Looker Studio.

Architecture

System Architecture

Components

Wikipedia: The source of raw data.
Apache Airflow: Orchestrates the data pipeline, fetching data from Wikipedia and storing it in Azure Data Lake Gen2.
Azure Data Lake Gen2: Stores raw and processed data.
Azure Data Factory: Manages the ETL (Extract, Transform, Load) processes.
Databricks: Performs data processing and transformation.

Visualization Tools

Tableau
Power BI
Looker Studio

Prerequisites

Azure Subscription
Apache Airflow
Azure Data Lake Gen2
Azure Data Factory
Databricks
Tableau
Power BI
Looker Studio

Setup

Apache Airflow

Install Apache Airflow:
```
pip install apache-airflow
```
Create a DAG:
- Define a DAG to fetch data from Wikipedia and store it in Azure Data Lake Gen2.

Azure Data Lake Gen2

Create a Storage Account:
- Follow these instructions to create a storage account.
Create a Container:
- Create a container to store raw and processed data.

Azure Data Factory

Create a Data Factory:
- Follow these instructions to create a Data Factory instance.
Create Pipelines:
- Define pipelines to perform ETL processes.

Databricks

Create a Databricks Workspace:
- Follow these instructions to create a Databricks workspace.
Create Notebooks:
- Define notebooks to process and transform the data.

Visualization

Tableau: Connect Tableau to Azure Data Lake Gen2 or Databricks to visualize the data.
Power BI: Connect Power BI to Azure Data Lake Gen2 or Databricks to visualize the data.
Looker Studio: Connect Looker Studio to Azure Data Lake Gen2 or Databricks to visualize the data.

Running the Pipeline

Start Apache Airflow:

airflow webserver -p 8080
airflow scheduler

Trigger the DAG:
- Trigger the DAG to start fetching data from Wikipedia.
Monitor Data Factory:
- Monitor the ETL processes in Azure Data Factory.
Run Databricks Notebooks:
- Execute Databricks notebooks for data processing and transformation.
Visualize Data:
- Use Tableau, Power BI, or Looker Studio to create visualizations from the processed data.
- Link for Power BI: https://abrir.link/JqqDM
Screenshots:

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
dags		dags
dashboard		dashboard
data		data
pipelines		pipelines
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
script.sql		script.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project - Football Stadiums

Architecture

System Architecture

Components

Visualization Tools

Prerequisites

Setup

Apache Airflow

Azure Data Lake Gen2

Azure Data Factory

Databricks

Visualization

Running the Pipeline

Contributing

License

About

Releases

Packages

Languages

License

nandodevs/football-azure-engineering

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project - Football Stadiums

Architecture

System Architecture

Components

Visualization Tools

Prerequisites

Setup

Apache Airflow

Azure Data Lake Gen2

Azure Data Factory

Databricks

Visualization

Running the Pipeline

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages