Kafka client Java API wrapper #753

merlimat · 2017-09-09T22:33:45Z

Motivation

Add a Kafka API wrapper based on Pulsar client library.

This will allow existing application using the Kafka client library to publish or suscbribe to a Pulsar topic without any code change.

This first iteration is targeting the Kafka high-level consumer with managed offsets, with or without auto-commit.

Examples and documentation on the website will follow this PR.

Modifications

Add implementation of Kafka Producer and Consumer interfaces that internally use Pulsar client library
Use Maven shading plugin to replace KafkaProducer class with PulsarKafkaProducer in the jar

Result

The org.apache.pulsar:pulsar-client-kafka-compact artifact will be a drop-in replacement for org.apache.kafka:kafka-clients.

sijie · 2017-09-12T17:33:31Z

pulsar-client-kafka-compat/pom.xml

+                  </excludes>
+                </filter>
+              </filters>
+              <relocations>


just out of curiosity - that means the PulsarKafkaProducer needs to have exactly same constructors as the original KafkaProducer, and same applied for consumer, right? How hard is to keep such compatibility going forward?

Yes, we need to use the same constructor as the original class.

I think we can maintain compatibility up to a specific kafka-clients version. That should be noted in the documentation. So far, I haven't seen differences in the constructors from 0.9 to 0.11 versions.

In case Kafka adds a new constructor in next releases, the client app would still work unless it tries to use the constructor form.

sijie · 2017-09-12T17:39:13Z

...client-kafka-compat/src/main/java/org/apache/kafka/clients/producer/PulsarKafkaProducer.java

+    }
+
+    @Override
+    public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {


callback is not used?

sijie · 2017-09-12T17:46:53Z

...client-kafka-compat/src/main/java/org/apache/kafka/clients/producer/PulsarKafkaProducer.java

+    }
+
+    @Override
+    public void flush() {


the flush semantic here is a bit problematic. the flush might potentially wait for records sent after the flush call. if you have a high send rate, this might be potentially spinning for a long time.

if pulsar producer has the ordering property on producing message, you might be able to just keep track the last outstanding send and wait for its callback.

Good point. We can just keep a hashmap with the future associated with the last send operation.
Then, in the flush() we can iterate over a snapshot of that map.

sijie · 2017-09-12T17:47:29Z

...client-kafka-compat/src/main/java/org/apache/kafka/clients/producer/PulsarKafkaProducer.java

+
+    @Override
+    public List<PartitionInfo> partitionsFor(String topic) {
+        return Collections.emptyList();


Are you planning to implement this in future? add an issue and todo item here?

We can easily get the partition list info, though most of the other things wont apply here:

public class PartitionInfo { private final String topic; private final int partition; private final Node leader; private final Node[] replicas; private final Node[] inSyncReplicas; ...

I think we should just throw UnsupportedOperationException here for now

yeah sounds good to me

sijie · 2017-09-12T17:48:15Z

...client-kafka-compat/src/main/java/org/apache/kafka/clients/producer/PulsarKafkaProducer.java

+                try {
+                    outstandingWrites.wait();
+                } catch (InterruptedException e) {
+                    throw new RuntimeException(e);


flush() is expecting an interrupted exception. so you can just re-throw e?

Signature for flush() doesn't have any checked exceptions :

/** * Flush any accumulated records from the producer. Blocks until all sends are complete. */ public void flush();

ah i see. the kafka producer was written in scala, so it is able to throw InterruptedException without declare it in the method signature.

sijie · 2017-09-12T18:13:52Z