This is the graduate thesis work for the student Hector Miguel Rodriguez Sosa for the Computer Science 2025 class at the University of Havana.
This is a data engineering project focused on a Modular Architecture for a data pipeline. The project is divided into two main parts:
-
Research and Architecture Documentation: All the research and architecture design documents are located in the
docs
folder. These documents are written in Typst, and you need to have typst installed on your system to compile them. Usemake
for compiling all the files,make clean
for deleting the .pdf files, andmake watch <filepath>
to watch a specific file. -
Implementation: The implementation of the architecture is located in the
src
directory. This implementation is a use case of the architecture about a ride data pipeline app, utilizing the following technologies:- Java for the core application
- Scala for Kafka and MongoDB connector and future Spark jobs
- Flink for stream processing
- Kafka for event streaming
- MongoDB for data storage
The current use case is a cab share data pipeline that processes streaming data using Flink, streams events with Kafka, and stores the data in MongoDB.
Future enhancements to this project include:
- Adding a Monitoring Layer
- Adding a Presentation Layer
- Adding a Batch Layer using Spark