-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIP 94: Message converter at broker level #11962
Comments
do we need do add a Converter on the Publish side ? |
we should define some constraints on what the Converter can do with the input ByteBuf. Should it keep the reader/writer indexes ? what about refcount ? |
I would add some "init(ServerConfiguration configuration)" method and a close() method, in order to let it be configurable and to let it release resources. |
Protocol handler can be used for the publish side.
For example in KoP, if messages were produced from Pulsar producer, when
KoP read entries from BK, it converts the message to Pulsar format so that
Pulsar consumer can receive the message.
See streamnative/kop#632 <streamnative/kop#632> for details.
Thanks,
Yunze Xu
… 2021年9月8日 下午2:03,Enrico Olivelli ***@***.***> 写道:
do we need do add a Converter on the Publish side ?
IIUC this Converter is for the Consumer
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#11962 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEK4RA3IL7F2TA32STSVZQ3UA34CJANCNFSM5DT4VXSA>.
|
Thanks for your suggestion. It makes sense to me. When we need to load converters from other systems, the load and release phase are necessary. |
If I use kafka client to produce messages and pulsar client to consume messages, the performace of message publish could be improve but performace of message dispatch may be downgrade. |
@wangjialing218 The conversion is unavoidable. We have to sacrifice either producer or consumer side. But before this proposal, if we're going to support both Kafka and Pulsar clients, the conversion was:
After this proposal:
For your second question
Currently yes. But I'm not sure if the implementation can be improved. In my demo implementation I performed the conversion in |
BTW, here is my demo implementation in https://github.com/BewareMyPower/pulsar/tree/bewaremypower/payload-converter (will be deleted after pushing the final implementation). The unit tests were not pushed but I've added them in my local env. And the interface was also a little different. |
Yes. I'll add more description. |
@BewareMyPower Just a idea. Current there is one ManagedLedger(ledger) associated with
|
@wangjialing218 It could be an enhancement. I think we can submit another PIP for the design details. It sounds like a solution that each For example, in KoP, there's still a @eolivelli BTW, I just found a problem from @wangjialing218's idea. If we passed Here is an example if the converter needs to initialize some resources. KeyMessageConverter converter = new KeyMessageConverter(remoteServerAddress);
converter.init(); // request key from remote server once or periodically
// call accept() and convert() using the dynamically changed key...
converter.close(); // close the connection to key server |
I've updated the |
We should avoid converting(serialization and deserialization) the data at the broker side, this will put a very heavy burden on the broker GC. In my opinion, we should do the data converting at the client-side, we can have a diverse data format implementation and by default, the Pulsar client only has the Pulsar data format processor. if users want to consume the data with Kafka data format, they can add a separate dependency such as The data to the Kafka client, should not be considered by the Pulsar broker, the KoP should handle it. If the data with Kafka format, KoP can send it directly to the Kafka consumer, if the data with Pulsar format, KoP needs to convert the data to the Kafka format. For the storage layer(BookKeeper and tiered storage), the data might be read directly bypass the broker such as PulsarSQL, Flink(In the future). This is also be considered if we are doing the data conversion at the broker side, we might need another implementation to read the data with multiple data formats from the BookKeeper/Tiered Storage.
We can't use multiple managed ledgers for a topic, this will break the FIFO semantics if you have Kafka producers and Pulsar producers publishing data to the topic and Kafka consumers, Pulsar consumers to consume the data from the topic.
If the KoP wants to convert the data at the publishing path, KoP(Or other protocol handlers) can implement directly, any reason introduces the converter at the broker for the data publishing? And I think if using kafka format for KoP, KoP will convert the data on the publish side for now, this is an inefficient way. |
this is not always possible because downstream users may not be aware of the format of the data. btw this will be very hard in big enterprises, to require to add a client side plugin to every possible client |
@eolivelli Both KoP or the client-side converter are optional plugins, not required. If users tend to convert the data at the broker side, the existing KoP implementation already supports this requirement. They can just convert the data at the publish side, no need to convert it multiple times on the consumption side. I think we need to clarify the purpose of this proposal, we are trying to find a way to support efficient data conversion between different formats such as Kafka, Pulsar. This initial motivation is we are seen the performance issue on KoP before because we have done data conversion at the broker side, so introduce data format(Kafka or Pulsar) in KoP to avoid the data conversion. This will not prevent users to use the broker-side data conversion, just to provide a more efficient way to handle the data conversion. Just one more choice for users, but if we did the data conversion at the broker-side, users can only choose one solution(Essentially, no difference between publishing converter and consumption converter, and the consumption converter is more expensive than the publishing side that @wangjialing218 has mentioned here #11962 (comment)) |
@codelipenghui The GC problem for KoP is caused by the heap memory usage, not the conversion, the converter only affects CPU theoretically because all memory usages of messages are from direct memory.
It's not. Introducing data format in KoP is to avoid the data conversion, right. Because we tends to push users to use pure Kafka client when using KoP. However, if they want to turn to Pulsar client later, it could be impossible because there's no way for Pulsar consumer to consume these messages produced by Kafka producer. Actually, it's right the conversion should be performed at client side. But upgrading the client might not be always easy. @eolivelli was right that this is not always possible because downstream users may not be aware of the format of the data.
No. It requires configuring the In conclusion, supporting message converter at both broker level and client level is necessary. They have both PROs and CONs.
It's a priority comparison about availability vs. performance. I chose the availability so I tried to implement the Message converter at broker side first. The message converter here can also be applied to Pulsar Client. I'm not sure if it's proper to contain two tasks in a PIP. If yes, I can add the client side support in this proposal. |
@codelipenghui You can see streamnative/kop#673 for more details about the GC problem. In this PR, I used the direct memory allocator and the GC problem was fixed. The left heap memory usage is a KoP side problem, not related to data format conversion. Before this PR, deserializing Kafka records also doesn't use heap memory.
|
Because we are converting the Kafka format data to Pulsar format data right? If yes, what is the difference when we converting Kafka format data to Pulsar format data at the consumption path? |
If using for the Kafka client publish: kafka format -> pulsar format, for Kafka client consume: pulsar format -> kafka format (improved by streamnative/kop#673) After introduced the message converter and for the Kafka client publish: kafka format -> kafka format, for Kafka client consume: kafka format -> kafka format Is my understanding above correct @BewareMyPower I said |
@codelipenghui The fact is the most existing KoP users prefer
Even without these concerns, you can see following table for the comparison.
After PIP 94, the Not Available entry would become Low. The main difference is, for all Kafka producers,
After PIP 94, the No conversion of first row would become a check for entry buffer. |
If the user only had Kafka clients for now, they can choose If the user already had Kafka clients and Pulsar clients for the same topics, they must choose |
@codelipenghui We could consider how to keep FIFO semantics when using multiple managed ledgers. For example, only one primary ledger is writeable (store original messages from producer), other ledgers are readable (store conversion result for each message from primary ledger), and keep the message order same in all ledgers. consumer could select primary ledger or one readable ledger to consume messages. I have also considerd convert message at broker level from other motivation (not only for protocol handler). The purpose of multiple managed ledgers for a topic is to do message conversion asynchronously. This will cost more storage but no need to sacrifice performance neither producer nor consumer side. |
@BewareMyPower And If we are try to introduce the data format processor at the client-side again, users will have 3 options, In my opinion,
The proposal looks like an |
@wangjialing218 This looks like need to copy data between different managed ledgers? and we have multiple copies with the primary managed ledger and the other managed ledgers.
I think it's not the same storage as the PIP-94 right? PIP 94 is converting the data encoding format, it will not touch the user's data, if you want to convert the user's data, the broker needs to deserialize data which will bring more GC workload on the broker. Yes do data conversion on the broker side can reduce the network workload, but increase the CPU workload, the burden of JVM GC, you might get a more unstable broker.
But there are also many disadvantages, each managed ledger we need to maintain the metadata, more data copies, more entries write to bookies. |
It's right. But this proposal is mainly for the first case, not the second case. What if the user switched to Pulsar client later? Without the converter, they have to discard old messages. If it's not acceptable, they would never consider switching to Pulsar client.
It's right because in this case users prefer to store messages as Pulsar format. However, as I've said, the message converter is mainly for the Both
We should cover all use cases. |
Yes, I means convert user's data in broker side. Thanks for your advise, I'll consider the disadvantages and see if there is a solution.
Could we record the primary ledger id in other managed ledgers' metadata, write only one data copies to bookies? If the data is lost, we read data from primary ledger (which have multiple copies) with same entry id and do conversion again. |
@BewareMyPower Users should avoid using In any case, if the user wants to get better performance, they need to do some changes, or broker-side, or client-side. Why not consider a more efficient way? In other words, if they change the format to kafka, users are looking forward to the high performance, but after we increase the burden of the broker, will this meet the user's performance requirements? If users set up incorrect format before, they should do the topic migration before we provided the solution. |
@codelipenghui The default
It's still a priority comparison I've said before: availability vs. performance. We should tell user not to use older Pulsar clients. I accept both cases, just need to change PIP 94 to
It's nearly impossible. The |
I just thought of a problem. If we added converter on client side, only Java client (>= 2.9.0) was able to consume these messages. There is still no way for other client to consume them. We need to write converters for other language clients, it could take much effort because we cannot reuse the Java classes. |
After discussing with @codelipenghui and @hangc0276 , I think it's better to add converter on client side. The main reason is we cannot find a good implementation to insert the converter. Currently, we must convert entries in callback of Lines 446 to 450 in 3cd5b9e
After inserting the converter, it could be like public synchronized void readEntriesComplete(List<Entry> entries, Object ctx) {
entries = converterList.convert(entries); // wrapper of converter's methods
ReadType readType = (ReadType) ctx;
I'll open a new PIP to add message converter for Java client. For other client, maybe it needs extra efforts to do it. /cc @eolivelli |
Make sense to me. cc @merlimat |
Motivation
The initial motivation was from Kafka's protocol handler, i.e. KoP (https://github.com/streamnative/kop). KoP allows Kafka producer to configure entry format to
kafka
, which means the message from Kafka producer can be written to bookies directly without any conversion between Kafka's message format and Pulsar's message format. This improves the performance significantly. However, it also introduced the limit that Pulsar consumer cannot consume this topic because it cannot recognize the entry's format.This proposal tries to introduce a message converter, which is responsible to convert the buffer of an entry to the format that Pulsar consumer can recognize. Once the converter was configured, before dispatching messages to Pulsar consumers, the converter would check if the buffer needs to be converted and then perform the conversion if necessary. We can configure multiple converters because we can configure multiple protocol handlers as well. Each protocol handler could write the entry with its own format.
The benefit is, after this change:
Before this change, if we want to interact Pulsar consumer with other clients:
This proposal is mainly for protocol handlers because they can access
PersistentTopic
and write bytes to bookies directly. In a rare case, if users want to write something to the topic's ledger directly by BookKeeper client, the converter can also handle the case.Goal
This proposal's goal is only adding message converter at broker level. Once the related broker configs were enabled, the converters would be applied to all topics. An overhead would be brought to the topics which are only created for Pulsar clients. Because we need to check if the buffer needs to be converted. See
MessageConverter#accept
method in the next section.In future, we can configure the message converters at namespace level or topic level. Even we can also configure the message converter for Pulsar client so that the conversion only happens at client side and the CPU overload of broker can be reduced.
API changes
First an interface is added under package
org.apache.pulsar.common.api.raw
The a new configuration is added
Implementation
For
MessageConverter
, add a classMessageConverterValidator
to validate whether the implementation is valid.The implementation is simple. When the broker starts, load all classes that implement
MessageConverter
interface frommessageConverters
config. Then we can pass the converters toServerCnx
. Each time a dispatcher dispatches messages to consumer, it will eventually callServerCnx#newMessageAndIntercept
method, in which we can perform the conversion.For unit tests, we can test following converters:
RejectAllConverter
:accept
returns false so that no conversion is performed.EchoConverter
:accept
returns true andconvert
simply returns the original buffer.BytesConverter
: It's an example of a real world converter. The message format has theMessageMetadata
part that has theentry.format=bytes
property. And the payload part is only the raw bytes withoutSingleMessageMetadata
. TheBytesConverter#converter
will convert the raw bytes to the format that Pulsar consumer can recognize.The text was updated successfully, but these errors were encountered: