Skip to content

Efficient Pandas and Ray Kafka Producer for python using actor model.

Notifications You must be signed in to change notification settings

ujjawal-khare-27/ray-kafka-producer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kafka Producer for Efficient Data Streaming to Kafka

This Python Kafka producer facilitates high-performance data streaming from Ray DataFrame and Pandas DataFrame to Kafka. It is optimized to provide approximately 3-4x performance improvement compared to standard Kafka producers.

Installation

Install the package using pip:

pip3 install ray_kafka_producer@git+https://github.com/ujjawal-khare-27/ray-kafka-producer@main --force-reinstall

Usage

  1. Import the package
from ray_kafka_producer.producer_manager import KafkaProducerManager
  1. Create an instance of KafkaProducerManager
# actor_pool_size is the number of actors that will be created to send data to Kafka
# num_cpu is the number of CPUs that will be allocated to each actor
kafka_producer_manager = KafkaProducerManager(bootstrap_servers="localhost:9092", topic="test", actor_pool_size=12,
            num_cpu=0.25)
  1. Send messages to Kafka (Ray DataFrame)
kafka_producer_manager.flush_ray_df(df = ray_df, is_actor=True)
  1. Send messages to Kafka (Pandas DataFrame)
kafka_producer_manager.flush_pandas_df(df = pandas_df)