Skip to content

In this article, we'll walk through how to build a scalable ETL pipeline using Apache Airflow, Kafka, and Python, Mongo and Flask

Notifications You must be signed in to change notification settings

Stefen-Taime/Scalable-RSS-Feed-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

Python 3.8

About

In today's data-driven world, processing large volumes of data in real-time has become essential for many organizations. The Extract, Transform, Load (ETL) process is a common way to manage the flow of data between systems. In this article, we'll walk through how to build a scalable ETL pipeline using Apache Airflow, Kafka, and Python, Mongo and Flask

Architecture

alt text

Scenario

In this pipeline, the RSS feeds are scraped using a Python library called feedparser. This library is used to parse the XML data in the RSS feeds and extract the relevant information. The parsed data is then transformed into a standardized JSON format using Python's built-in json library. This format includes fields such as title, summary, link, published_date, and language, which make the data easier to analyze and consume.

Prerequisites

Setup

Clone the project to your desired location:

$ git clone https://github.com/Stefen-Taime/Scalable-RSS-Feed-Pipeline.git

Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

Build Docker:

$ docker-compose build 

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

When everything is done, you can check all the containers running:

$ docker ps

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow Now, we can trigger our DAG and see all the tasks running.

alt text

Note: run the docker-compose mongo-kafka file before running the dag

Setup kafka, Mongo

navigate to cd mongo-kafka:

$ docker-compose up -d

Execute the following command that will create SinkConnector for Mongo

$ curl -X POST \-H "Content-Type: application/json" \
--data @mongo-sink.json \
http://localhost:8083/connectors



$ docker exec -ti mongo mongo -u debezium -p dbz --authenticationDatabase admin localhost:27017/demo

$ show collections;
$ db.rss_feeds_collection.find();

alt text

Run api

$ python api.py

alt text

References

About

In this article, we'll walk through how to build a scalable ETL pipeline using Apache Airflow, Kafka, and Python, Mongo and Flask

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published