Building-streaming-ETL-Data-pipeline

Building streaming Data pipeline using apache airflow, kafka, spark and container based object storage ( Minio S3 Bucket)

1. Project overview and architecture

In this project, we build a real-time ETL (Extract, Transform, and Load) data pipeline. During this process we will use open api to get data Building a streaming ETL (Extract, Transform, Load) data pipeline involves ingesting real-time data , process and transform , and load it into a data storage or analytics system. This overview outlines the process of building such a pipeline requiring Apache Kafka for data ingestion, Apache Spark for data processing, and Amazon S3 for data storage.

Our project is composed of several services:

a. Apache kafka

Set up Kafka Cluster: Deploy a Kafka cluster with multiple brokers for high availability and scalability.
Create Kafka Topics : Define topics to categorize and organize the incoming data streams based on their sources or types.
Configure Kafka Producers : integrate Kafka producers to send data from open api to the appropriate Kafka topic.

b. Automation and Orchestration: apache airflow

Leverage automation and orchestration tools (e.g., Apache Airflow) to manage and coordinate the various components of the pipeline, enabling efficient deployment, scheduling, and maintenance.

c. Data Processing with Apache Spark

Apache Spark is a powerful open-source distributed processing framework that excels at processing large-scale data streams. In this pipeline, Spark will consume data from Kafka topics, perform transformations and computations, and prepare the data for storage in Amazon S3.

Configure Spark Streaming : Set up a Spark Streaming application to consume data from Kafka topic in real-time.
Define Transformations : Implement the necessary transformations and computations on the incoming data streams using Spark's powerful APIs. This may include data cleaning, filtering, aggregations, and enrichment from other data sources.
Integrate with Amazon S3 : Configure Spark to write the processed data to Minio S3 object storage in a suitable format (e.g., Parquet, Avro, or CSV) for efficient storage and querying.

d. Data Storage in Minio S3

Minio is a high-performance, S3 compatible object store. A MinIO "bucket" is equivalent to an S3 bucket, which is a fundamental container used to store objects (files) in object storage. In this pipeline, S3 will serve as the final destination for storing the processed data streams.

Create S3 Bucket : Set up an Minio S3 bucket to store the processed data in real-time.
Define Data Organization: Determine the appropriate folder structure and naming conventions for organizing the data in the S3 bucket based on factors such as time, source, or data type.
Configure Access and Permissions : Create appropriate access key, secret key and permissions for the Minio object storage to ensure data security and compliance with organizational policies.

2. Getting Started

Prerequisites

Understanding of Docker, docker compose and network
S3 bucket created: We will use Minio object storage
Basic understanding of Python and apache spark structured streaming
Knowledge of how kafka works: topic, brokers, partitions and kafka streaming
Basic undestanding of distributed systems

3. Setting up project environment:

Make sure docker is running: from terminal docker --version
Clone the repository and navigate to the project directory

 git clone https://github.com/fermat01/Building-streaming-ETL-Data-pipeline.git

and

 cd Building-streaming-ETL-Data-pipeline

Create all services using docker compose

docker compose up -d

4. Access the services:

Access airflow UI at http://localhost:8080 using given credentials username: $\color{orange}{airflow01}$ and password: $\color{orange}{airflow01}$
Access the Kafka UI at http://localhost:8888 and create topic name it $\color{orange}{streaming-topic}$ with number of partitions: $\color{orange}{6}$
Acess Minio UI using http://127.0.0.1:9001 and with credentials username: $\color{orange}{MINIOAIRFLOW01}$ and password: $\color{orange}{ AIRFLOW123 }$

5. Spark application

Before submitting spark applcation, it is important to understant how spark communicate with apache kafka and Minio container based object storage when using docker. Make sure to verify the broker ports and hostnames. Required jar files must be downloaded.

Make sure Minio has a right API URL http://minio:9000 to communicate with spark.

spark version can be verified using :/opt/bitnami/spark/bin/spark-submit --version
and kafka version using the log of one of broker containers: docker logs broker-1 | grep -i "kafka.*version"

Download required jar files and submit your spark application

From scripts folder, there are 3 files:

run this command in your terminal to download all required jar files using

./download_jars.sh

From same scripts folder, run this command in your terminal to submit spark application

./run_spark_submit.sh

Go back to minio bucket to ensure that data has been uploaded.

6. Conclusion

This project successfully demonstrates the construction of a real-time ETL (Extract, Transform, Load) data pipeline using Apache Kafka for data ingestion, Apache Spark for data processing, and Minio S3 bucket for data storage. By leveraging open APIs, we were able to ingest real-time data, process and transform it efficiently, and load it into a robust storage system for further analysis. The use of Apache Kafka provided a scalable and fault-tolerant platform for data ingestion, ensuring that data streams were handled effectively. Apache Spark enabled real-time data processing and transformation, offering powerful capabilities for handling large datasets with low latency. Finally, Minio S3 object storage served as a reliable and scalable storage solution, allowing for seamless integration with various analytics tools. Throughout this project, we highlighted the importance of selecting appropriate tools and technologies to meet the specific requirements of real-time data processing. The integration of these components resulted in a flexible, scalable, and efficient ETL pipeline capable of handling diverse data sources and formats.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
dags		dags
images		images
scripts		scripts
spark_app		spark_app
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building-streaming-ETL-Data-pipeline

1. Project overview and architecture

a. Apache kafka

b. Automation and Orchestration: apache airflow

c. Data Processing with Apache Spark

d. Data Storage in Minio S3

2. Getting Started

3. Setting up project environment:

4. Access the services:

5. Spark application

6. Conclusion

About

Releases

Packages

Languages

License

fermat01/Building-streaming-ETL-Data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Building-streaming-ETL-Data-pipeline

1. Project overview and architecture

a. Apache kafka

b. Automation and Orchestration: apache airflow

c. Data Processing with Apache Spark

d. Data Storage in Minio S3

2. Getting Started

3. Setting up project environment:

4. Access the services:

5. Spark application

6. Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages