Skip to content

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system

Notifications You must be signed in to change notification settings

bousettayounes/Real-time-User-Data-Streaming

Repository files navigation

Real-time User Data Streaming

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a distributed storage system.

In this project, I have used the following technologies:

  • DOCKER
  • APACHE AIRFLOW
  • APACHE KAFKA with schema registry & Control Center
  • APACHE CASSANDRA
  • APACHE SPARK CLUSTER [PYSPARK]
  • POSTGRESQL

Project Architecture


This project aims to create a data streaming pipeline using the Kappa architecture, fully deployed within Docker containers for easy management and scalability. The pipeline begins with streaming user data generated by a Random Generator API into a Kafka broker. The data is structured according to a predefined schema stored in the Schema Registry, ensuring consistency and compatibility across the pipeline.

Once the data is ingested into Kafka Topic , it is processed in real-time using a Spark cluster. Spark applies the necessary transformations to the incoming data streams. After processing, the data is loaded into a Cassandra keyspace for storage and querying.

Apache Airflow plays a crucial role by orchestrating the entire data pipeline, managing and scheduling the various tasks involved . This ensures that each component operates in the correct sequence and that dependencies between tasks are handled efficiently.

The entire data pipeline is deployed within Docker containers, providing an isolated and consistent environment for each component. By leveraging the Kappa architecture, this pipeline focuses on processing real-time data streams, ensuring that the system can efficiently handle large volumes of user-generated data.

This project showcases the integration of Kafka for distributed data streaming, Spark for real-time processing, Cassandra for scalable storage, and Apache Airflow for workflow orchestration, all orchestrated within Docker for a streamlined and easily deployable solution.

About

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published