The architecture consists of the following components:
- Producer: Reads frames from video files or live streams and publishes them to a Kafka server. Each frame is sent to a topic corresponding to the video file name.
- Kafka Server: Stores frames in their respective topics.
- Spark Consumer: Consumes frames from Kafka, applies a user-defined function (UDF), such as a face detector, and pushes processed frames to a second Kafka server.
- Final Kafka Consumer: Writes frames according to the topic name and saves the processed videos to the output folder.
Spin up Kafka containers for two servers (listening on ports 9093 and 9095) and Zookeeper (listening on port 2181) using Docker Compose:
docker-compose up -d
-
Kafka:
- Run Kafka using kafka_start.sh.
- Ensure you create two different server.properties files in the conf directory and adjust the broker ID and listening port.
-
Spark:
- Download and install Spark from Apache Spark Downloads.
- Alternatively, use the provided Dockerfile for Spark installation.
-
Python Libraries:
-
Create a Conda environment and install the required libraries from requirements.txt:
pip install -r requirements.txt
-
- Start the Producer:
python confluentKafkaProducer
- Start the Spark Consumer:
-
Source the bash profile:
source ~/.bash_profile
-
Run Spark with the following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 sparkConsumer.py
-
Start the Kafka Consumer:
python kafkaConsumer.py
-
To ensure Spark can access Conda environment libraries, set these environment variables:
export PYSPARK_PYTHON=$(which python) export PYSPARK_DRIVER_PYTHON=$(which python)
-
To list running Kafka topics:
bin/kafka-topics.sh --list --bootstrap-server localhost:PORT
-
To delete a Kafka topic:
kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic your_topic_name
-
Some suggestions:
- Here, I have used only two brokers with replication factor of 2, you can update it as per the requirements.
- I have taken only one partition each topic, you can update it for faster processing.
- You can use Kafka streaming API instead of Spark for processing frames.
- You can work on tracking objects across the frames. The basic code is there in repo.
- I am using .csv file to read camera metadata. You can use other databases for storing camera details.