Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Revert 5895: fix redeliveryCount #17060

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Aug 11, 2022

Reverts: #5881

Motivation

The redeliveryCount was introduced in PIP 22 with this PR #2508. It is an extra field on a message that indicates how many times a message has been redelivered. In the original design, it was only incremented for shared subscriptions when the consumer sent REDELIVER_UNACKNOWLEDGED_MESSAGES to the broker.

In #5881, this field's logic changed so that it is incremented each time a broker delivers a message to a consumer (after the initial delivery). The problem with this logic is that it counts messages that are sent to a consumer's receiveQueue, but not actually received by the client application, as "delivered" messages. This is especially problematic for the DLQ implementation because it relies on the counter to track deliveries, and this eager incrementing of the redeliveryCount could lead to fewer retries than an application would like.

This PR returns the broker's behavior to the original state before #5881.

Note that the DLQ logic is only triggered by messages that hit their ack timeout or are negatively acknowledged. This means that in some cases, a message could be delivered many times to a receiveQueue and once to the application and then sent to the DLQ. Given that our DLQ implementation has an intentional preference towards over delivery instead of under delivery, I think this logic should be fixed.

One of the consequences of this PR is that the message filter logic for redelivering messages triggers this logic for incrementing redeliveryCount. See this code here:

if (CollectionUtils.isNotEmpty(entriesToRedeliver)) {
this.subscription.getTopic().getBrokerService().getPulsar().getExecutor()
.schedule(() -> {
// simulate the Consumer rejected the message
subscription
.redeliverUnacknowledgedMessages(consumer, entriesToRedeliver);
}, serviceConfig.getDispatcherEntryFilterRescheduledMessageDelay(), TimeUnit.MILLISECONDS);
}

I'll need feedback from someone more familiar with message filtering to understand if this is a problematic change. If it is, I think we might need to revisit the logic in filterEntriesForConsumer.

Modifications

Verifying this change

This change includes modifications to tests as well as existing test coverage.

Does this pull request potentially affect one of the following parts:

This change is a break in current behavior, so I will send an email to the dev mailing list: https://lists.apache.org/thread/ts9d6zbtlz3y5xtv7p0c3dslk0vljpj2.

Documentation

  • doc-not-needed

@codelipenghui codelipenghui added the release/important-notice The changes which are important should be mentioned in the release note label Aug 15, 2022
@codelipenghui codelipenghui added this to the 2.12.0 milestone Aug 15, 2022
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I support this revert.

I have checked the server side filtering feature code (PIP-105)
in case of filtered messages (REJECT outcome from the filter) we call PersistentSubscription#acknowledgeMessage(List<Position> positions, AckType ackType, Map<String, Long> properties

public void acknowledgeMessage(List<Position> positions, AckType ackType, Map<String, Long> properties) {

in case of RESCHEDULED messages we call PersistentSubscription#redeliverUnacknowledgedMessages

public void redeliverUnacknowledgedMessages(Consumer consumer, List<PositionImpl> positions) {

that goes down to redeliveryTracker.addIfAbsent(position);
that (IIUC) it does not increment the delivery count

trackerCache.putIfAbsent(positionImpl.getLedgerId(), positionImpl.getEntryId(), 0, 0L);

we could add more testing on the delivery count in case of RESCHEDULED messages (we can improve the existing tests).
by the way, RESCHEDULED should not increment the delivery count because the message is not sent to the Consumer

@michaeljmarshall
Copy link
Member Author

@eolivelli - your analysis matches my understanding as well. I agree that incrementing the redelivery count is not the right design for the message filters. I am going to merge this as is and then we can work on fixing that design in a future PR.

@michaeljmarshall michaeljmarshall merged commit 2fd3509 into apache:master Aug 24, 2022
@michaeljmarshall michaeljmarshall deleted the only-inc-redelivery-on-explicit-redelivery branch August 24, 2022 17:04
michaeljmarshall added a commit to datastax/pulsar that referenced this pull request Aug 24, 2022
Reverts: apache#5881

### Motivation

The `redeliveryCount` was introduced in [PIP 22](https://github.com/apache/pulsar/wiki/PIP-22%3A-Pulsar-Dead-Letter-Topic) with this PR apache#2508. It is an extra field on a message that indicates how many times a message has been redelivered. In the original design, it was only incremented for shared subscriptions when the consumer sent `REDELIVER_UNACKNOWLEDGED_MESSAGES` to the broker.

In apache#5881, this field's logic changed so that it is incremented each time a broker delivers a message to a consumer (after the initial delivery). The problem with this logic is that it counts messages that are sent to a consumer's `receiveQueue`, but not actually received by the client application, as "delivered" messages. This is especially problematic for the DLQ implementation because it relies on the counter to track deliveries, and this eager incrementing of the `redeliveryCount` could lead to fewer retries than an application would like.

This PR returns the broker's behavior to the original state before apache#5881.

Note that the DLQ logic is only triggered by messages that hit their ack timeout or are negatively acknowledged. This means that in some cases, a message could be delivered many times to a `receiveQueue` and once to the application and then sent to the DLQ. Given that our DLQ implementation has an intentional preference towards over delivery instead of under delivery, I think this logic should be fixed.

One of the consequences of this PR is that the message filter logic for redelivering messages triggers this logic for incrementing `redeliveryCount`. See this code here: https://github.com/apache/pulsar/blob/b1a29b520d34d60e60160e3a7b9b0e26926063ee/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L198-L206

I'll need feedback from someone more familiar with message filtering to understand if this is a problematic change. If it is, I think we might need to revisit the logic in `filterEntriesForConsumer`.

### Modifications

* Revert the relevant changes from apache#5895. I kept the test that was added in the PR and modified the assertion.
* Fix test assertion ordering and modify expected value to align with new paradigm.

### Verifying this change

This change includes modifications to tests as well as existing test coverage.

### Does this pull request potentially affect one of the following parts:

This change is a break in current behavior, so I will send an email to the dev mailing list: https://lists.apache.org/thread/ts9d6zbtlz3y5xtv7p0c3dslk0vljpj2.

### Documentation

- [x] `doc-not-needed`

(cherry picked from commit 2fd3509)
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Aug 29, 2022
Reverts: apache#5881

### Motivation

The `redeliveryCount` was introduced in [PIP 22](https://github.com/apache/pulsar/wiki/PIP-22%3A-Pulsar-Dead-Letter-Topic) with this PR apache#2508. It is an extra field on a message that indicates how many times a message has been redelivered. In the original design, it was only incremented for shared subscriptions when the consumer sent `REDELIVER_UNACKNOWLEDGED_MESSAGES` to the broker.

In apache#5881, this field's logic changed so that it is incremented each time a broker delivers a message to a consumer (after the initial delivery). The problem with this logic is that it counts messages that are sent to a consumer's `receiveQueue`, but not actually received by the client application, as "delivered" messages. This is especially problematic for the DLQ implementation because it relies on the counter to track deliveries, and this eager incrementing of the `redeliveryCount` could lead to fewer retries than an application would like.

This PR returns the broker's behavior to the original state before apache#5881.

Note that the DLQ logic is only triggered by messages that hit their ack timeout or are negatively acknowledged. This means that in some cases, a message could be delivered many times to a `receiveQueue` and once to the application and then sent to the DLQ. Given that our DLQ implementation has an intentional preference towards over delivery instead of under delivery, I think this logic should be fixed.

One of the consequences of this PR is that the message filter logic for redelivering messages triggers this logic for incrementing `redeliveryCount`. See this code here: https://github.com/apache/pulsar/blob/b1a29b520d34d60e60160e3a7b9b0e26926063ee/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L198-L206

I'll need feedback from someone more familiar with message filtering to understand if this is a problematic change. If it is, I think we might need to revisit the logic in `filterEntriesForConsumer`.

### Modifications

* Revert the relevant changes from apache#5895. I kept the test that was added in the PR and modified the assertion.
* Fix test assertion ordering and modify expected value to align with new paradigm.

### Verifying this change

This change includes modifications to tests as well as existing test coverage.

### Does this pull request potentially affect one of the following parts:

This change is a break in current behavior, so I will send an email to the dev mailing list: https://lists.apache.org/thread/ts9d6zbtlz3y5xtv7p0c3dslk0vljpj2.

### Documentation
  
- [x] `doc-not-needed`
@asafm
Copy link
Contributor

asafm commented Oct 3, 2022

I'm "riding" on this topic to ask a tangent question. We have "redeliveryCount" meta field in the message as written in this PR description, and we also have RECONSUMETIMES as written here. What's the relationship between those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs release/important-notice The changes which are important should be mentioned in the release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants