Real-time User Data Streaming

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a distributed storage system.

In this project, I have used the following technologies:

DOCKER
APACHE AIRFLOW
APACHE KAFKA with schema registry & Control Center
APACHE CASSANDRA
APACHE SPARK CLUSTER [PYSPARK]
POSTGRESQL

This project aims to create a data streaming pipeline using the Kappa architecture, fully deployed within Docker containers for easy management and scalability. The pipeline begins with streaming user data generated by a Random Generator API into a Kafka broker. The data is structured according to a predefined schema stored in the Schema Registry, ensuring consistency and compatibility across the pipeline.

Once the data is ingested into Kafka Topic , it is processed in real-time using a Spark cluster. Spark applies the necessary transformations to the incoming data streams. After processing, the data is loaded into a Cassandra keyspace for storage and querying.

Apache Airflow plays a crucial role by orchestrating the entire data pipeline, managing and scheduling the various tasks involved . This ensures that each component operates in the correct sequence and that dependencies between tasks are handled efficiently.

The entire data pipeline is deployed within Docker containers, providing an isolated and consistent environment for each component. By leveraging the Kappa architecture, this pipeline focuses on processing real-time data streams, ensuring that the system can efficiently handle large volumes of user-generated data.

This project showcases the integration of Kafka for distributed data streaming, Spark for real-time processing, Cassandra for scalable storage, and Apache Airflow for workflow orchestration, all orchestrated within Docker for a streamlined and easily deployable solution.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dags		dags
jobs		jobs
schema		schema
.gitignore		.gitignore
DockerFile2		DockerFile2
Dockerfile		Dockerfile
Project Architecture.png		Project Architecture.png
README.md		README.md
airflow.env		airflow.env
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time User Data Streaming

About

Releases

Packages

Languages

bousettayounes/Real-time-User-Data-Streaming

Folders and files

Latest commit

History

Repository files navigation

Real-time User Data Streaming

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages