Acked messages unexpectedly redelivered when others are negatively acked #5969

gmethvin · 2019-12-31T04:48:45Z

Describe the bug

We've encountered an issue in which acknowledged messages are redelivered one or more times after other messages are negatively acknowledged. This seems to occur when messages are produced in batches. This happens in the absence of any known broker or connection failures.

To Reproduce

I've modified the NegativeAcksTest to test for the correct behavior here: master...gmethvin:negative-ack-duplicates

As the test demonstrates, in some configurations positively acknowledged messages are redelivered. This is similar to a situation we see in production.

Expected behavior

Only the negatively acknowledged messages should be redelivered. Positively acknowledged messages should not be redelivered, at least not in a typical situation with no failures.

We produce messages in batches, but both the APIs and the documentation suggest that both acks and negative acks act on a per-message level. If negative acks act on batches, then the APIs and documentation should be changed to clearly indicate that.

The text was updated successfully, but these errors were encountered:

sijie · 2020-01-01T01:21:55Z

I think the fundamental problem of #5891 #5969 is the cursor in tracking at batch level not at message level. Hence failures can result in redelivering the whole batch. In order to address the fundamental problem here, we need to improve the cursor tracking at the message level in the broker side.

Loop in @jiazhai @codelipenghui in the discussion.

jerrypeng · 2020-01-01T21:16:14Z

@sijie @gmethvin we could track cursors at the batchLevel i.e. track cursors on the granularity of LedgerId:EntryId:BatchIndex, however there are performance implications if we would want to redeliver only messages that are NOT ACK ed or NACK ed. To do so would require us to de-serialized BK entries on the broker side, filter the messages in the entry, re-serialize the message, send it to the client, and finally the client with have to deserialize. To do this would entail more latency and 2X cpu effort for serialization and de-serialization. We have generally avoid entry inspection in the broker for such a reason.

The question I have is if a user really wants to get only the messages that are NOT ACK ed or NACK ed, can the user just turn off batching? Or if the user wants fewer duplicates from redelivered messages/batches, can the user just tune the batch size to be smaller? Are these good enough workflows or knobs to turn for users to satisfy these use cases?

Or is this also just a documentation / education issue? The docs are NOT very clear on what the expected behavior of ACKs or NACKs when batching is enabled.

@merlimat can you also chime in?

sijie · 2020-01-02T01:43:46Z

@jerrypeng there are many ways to avoid serialization and deserialization. Broker can deliver a bit metadata along with the batches to tell the client what messages are acked. Clients can filter out the already delivered messages during deserialization at client side.

The ability to control cursor tracking at the message level is a requirement for transactions anyway. We'd resolve the problem in single place.

jerrypeng · 2020-01-02T16:37:02Z

@sijie isn't that design just trading cpu usage for network usage? If you don't filter on the broker side and resend the whole batch to be filtered on the client side, there will be network implications especially if batches are big.

sijie · 2020-01-03T00:09:31Z

@jerrypeng yes it is a trade off here - whose resource to be used for this task. However i think the core problem here isn’t the resource problem so far. It is a semantic issue. Redelivering batches mostly are okay. But people doesn’t have the knowledge of which messages have acknowledged when a batch is redelivered. Semantic issue is what most of people care about first.

gmethvin · 2020-01-03T01:17:54Z

I agree @sijie. The main issue is that the API semantics don't match what users expect. If the negativeAcknowledge API acts on a single message, then it is a very surprising behavior for all the messages in the batch to be redelivered. If on the other hand the negativeAcknowledge API applies to a batch, then the current behavior would be more understandable. I would argue that single-message semantics are much more useful for the user though.

cdbartholomew · 2020-01-03T17:35:35Z

@jerrypeng @sijie @gmethvin I agree that redelivering batches is OK in most cases. Redelivery of unacked/nacked messages is the exception, not the rule, but when it occurs it should behave in a way that makes sense to the user of the API.

It's not that hard for the client to filter out the already acked messages from the batch. It is already keeping track of this so it knows when it can ack the batch back to broker. It's just a matter of using this information to filter received messages.

That approach doesn't cover all failure cases (ex if client restarts), but in the case where the application is NACKing a message, it would give reasonable behavior. Plus it will reduce the number of duplicate messages that get sent to the client when using batch messages.

@zzzming and I are happy to work on a PR for this if everyone agrees this is the right approach.

jiazhai · 2020-01-04T01:17:01Z

Oh, since it is also related to transaction implementation, Penghui and me are already writing a PIP for this. We are going to share the PIP soon.

Master issue: #6253 Fixes #5969 ### Motivation Add support for ack batch message local index. Can be disabled at broker side by set batchIndexAcknowledgeEnable=false at broker.conf PIP-54 documentation will be created soon. ### Modifications 1. Managed cursor support track and persistent local index of batch message. 2. Client support send batch index ack to broker. 3. The batch messages with index ack information dispatched to the client. 4. Client skip the acked index. ### Verifying this change New unit tests added

Master issue: apache#6253 Fixes apache#5969 ### Motivation Add support for ack batch message local index. Can be disabled at broker side by set batchIndexAcknowledgeEnable=false at broker.conf PIP-54 documentation will be created soon. ### Modifications 1. Managed cursor support track and persistent local index of batch message. 2. Client support send batch index ack to broker. 3. The batch messages with index ack information dispatched to the client. 4. Client skip the acked index. ### Verifying this change New unit tests added

gmethvin added the type/bug The PR fixed a bug or issue reported a bug label Dec 31, 2019

gmethvin mentioned this issue Dec 31, 2019

Dead letters incorrect processed for messages batch #5891

Closed

jiazhai added the triage/week-1 label Jan 2, 2020

sijie mentioned this issue Jan 2, 2020

ISSUE-5969: Acked messages unexpectedly redelivered when others are negatively acked streamnative/pulsar-archived#522

Closed

zzzming mentioned this issue Jan 4, 2020

[Issue 5969] prevent redelivery of acked batch message at the client api #5990

Closed

codelipenghui mentioned this issue Jan 14, 2020

[PIP-54] Support acknowledgment at batch index level #6052

Merged

codelipenghui mentioned this issue Feb 7, 2020

Acknowledgement for batch message local index #6253

Closed

sijie mentioned this issue Feb 7, 2020

ISSUE-6253: Acknowledgement for batch message local index streamnative/pulsar-archived#669

Closed

lhotari mentioned this issue May 4, 2020

Negative acknowledgement doesn't remove the message id from UnAckedMessageTracker when message id is instance of BatchMessageIdImpl #6869

Closed

sijie closed this as completed in #6052 Jun 2, 2020

KannarFr mentioned this issue Jul 28, 2021

All batch is redelivered even if some messages are acked CleverCloud/pulsar4s#345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acked messages unexpectedly redelivered when others are negatively acked #5969

Acked messages unexpectedly redelivered when others are negatively acked #5969

gmethvin commented Dec 31, 2019

sijie commented Jan 1, 2020

jerrypeng commented Jan 1, 2020 •

edited

Loading

sijie commented Jan 2, 2020

jerrypeng commented Jan 2, 2020 •

edited

Loading

sijie commented Jan 3, 2020

gmethvin commented Jan 3, 2020

cdbartholomew commented Jan 3, 2020

jiazhai commented Jan 4, 2020

Acked messages unexpectedly redelivered when others are negatively acked #5969

Acked messages unexpectedly redelivered when others are negatively acked #5969

Comments

gmethvin commented Dec 31, 2019

sijie commented Jan 1, 2020

jerrypeng commented Jan 1, 2020 • edited Loading

sijie commented Jan 2, 2020

jerrypeng commented Jan 2, 2020 • edited Loading

sijie commented Jan 3, 2020

gmethvin commented Jan 3, 2020

cdbartholomew commented Jan 3, 2020

jiazhai commented Jan 4, 2020

jerrypeng commented Jan 1, 2020 •

edited

Loading

jerrypeng commented Jan 2, 2020 •

edited

Loading