Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][client] Fix multi-topic consumer stuck after redeliver messages #18491

Merged

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Nov 16, 2022

Fixes #18480

Motivation

The multi-topic consumer's mechanism for pulling messages looks like this:

  • If the incoming queue is not full, do loop executes receiveMessageFromConsumer.
  • Call resumeReceivingFromPausedConsumersIfNeeded when the incoming queue is full.
    • If the incoming queue is (High light)half full, stop the loop
    • else call receiveMessageFromConsumer
  • If we call redeliver after the loop task stopped, will trigger a new resumeReceivingFromPausedConsumersIfNeeded.

But there has a race condition when doing redeliver: and

  • the task-redeliver runs asynchrony.
    • (High light) race condition-1 clear incoming queue
  • the task which triggers a new resumeReceivingFromPausedConsumersIfNeeded runs in another thread.
    • (High light) race condition-2 ensure incoming queue is not half full,then call receiveMessageFromConsumer

Modifications

Make this two task above runs in the same thread.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 16, 2022
@poorbarcode poorbarcode force-pushed the fix/multi_consumer_receive_null branch from 75581fe to 5af3256 Compare November 16, 2022 06:35
@poorbarcode poorbarcode changed the title [do not merge] [fix] [client] Multi topic consumer receive null even if has backlog [fix] [client] Multi topic consumer receive null even if has backlog Nov 16, 2022
@codelipenghui codelipenghui added this to the 2.12.0 milestone Nov 16, 2022
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

The PR title should change to

[fix][client] Fix multi-topic consumer stuck after redeliver messages

* see https://github.com/apache/pulsar/pull/18491
*/
@Test
public void testMultiTopicConsumerConcurrentRedeliverAndReceive() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried the test, but it seems unable to cover the change or reproduce the issue without this fix.

After changing the test like this

diff --git a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/NegativeAcksTest.java b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/NegativeAcksTest.java
index 56f2d3f29ec..c476cb516aa 100644
--- a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/NegativeAcksTest.java
+++ b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/NegativeAcksTest.java
@@ -474,9 +474,12 @@ public class NegativeAcksTest extends ProducerConsumerBase {
 
         Awaitility.await().until(() -> consumer.incomingMessages.size() == receiverQueueSize);
 
+        // For testing the race condition of issue #18491
+        // We need to inject a delay for the pinned internal thread
+        injectDelayToInternalThread(consumer, 1000L);
         consumer.redeliverUnacknowledgedMessages();
-        waitMultiTopicConsumerRedeliverFinish(consumer);
-
+        // Make sure the message redelivery is completed. The incoming queue will be cleaned up during the redelivery.
+        waitForAllTasksForInternalThread(consumer);
         Set<Integer> receivedMsgs = new HashSet<>();
         for (;;){
             Message<Integer> msg = consumer.receive(2, TimeUnit.SECONDS);
@@ -491,7 +494,18 @@ public class NegativeAcksTest extends ProducerConsumerBase {
     /**
      * If the task after "redeliver" finish, means task-redeliver finish.
      */
-    private void waitMultiTopicConsumerRedeliverFinish(MultiTopicsConsumerImpl consumer){
+    private void injectDelayToInternalThread(MultiTopicsConsumerImpl<?> consumer, long delayInMillis){
+        ExecutorService internalPinnedExecutor =
+                WhiteboxImpl.getInternalState(consumer, "internalPinnedExecutor");
+        internalPinnedExecutor.execute(() -> {
+            try {
+                Thread.sleep(delayInMillis);
+            } catch (InterruptedException ignore) {
+            }
+        });
+    }
+
+    private void waitForAllTasksForInternalThread(MultiTopicsConsumerImpl<?> consumer) {
         ExecutorService internalPinnedExecutor =
                 WhiteboxImpl.getInternalState(consumer, "internalPinnedExecutor");
         CompletableFuture<Void> taskAfterRedeliver = new CompletableFuture<>();

The issue can be reproduced without the fix and passed with the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not a 100% hit test problem, I tried to fix it

截屏2022-11-16 16 55 42

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe it's better to add a sleep before receive() messages from the consumer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add a sleep before receiving messages.

I have tried to increase the probability of test failure, but can't get to 100%. Because that:

  • method receiveMessageFromConsumer will call consumerImpl.batchReceiveAsync and consumerImpl.receiveAsync .
  • When consumerImpl.batchReceiveAsync and consumerImpl.receiveAsync is finished, it will trigger another receiveMessageFromConsumer.
  • consumerImpl.batchReceiveAsync and consumerImpl.receiveAsync calls each other
    • if receive messages is empty call consumerImpl.receiveAsync
    • else call consumerImpl.batchReceiveAsync
  • consumerImpl.batchReceiveAsync and consumerImpl.receiveAsync executed asynchronously.

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Nov 16, 2022

Could you also explain why the flaky test is caused by this reason? I added some logs:

diff --git a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/MultiTopicsConsumerImpl.java b/pulsar-client/src/main/java/org/apache/pulsar/client/im
pl/MultiTopicsConsumerImpl.java
index 1853be55f9..809a590451 100644
--- a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/MultiTopicsConsumerImpl.java
+++ b/pulsar-client/src/main/java/org/apache/pulsar/client/impl/MultiTopicsConsumerImpl.java
@@ -705,6 +705,7 @@ public class MultiTopicsConsumerImpl<T> extends ConsumerBase<T> {
     @Override
     public void redeliverUnacknowledgedMessages() {
         internalPinnedExecutor.execute(() -> {
+            log.info("XYZ rededeliverUnacked");
             CONSUMER_EPOCH.incrementAndGet(this);
             consumers.values().stream().forEach(consumer -> {
                 consumer.redeliverUnacknowledgedMessages();
@@ -713,6 +714,7 @@ public class MultiTopicsConsumerImpl<T> extends ConsumerBase<T> {
             clearIncomingMessages();
             unAckedMessageTracker.clear();
         });
+        log.info("XYZ resumeReceiving");
         resumeReceivingFromPausedConsumersIfNeeded();
     }

And in my local env, resumeReceivingFromPausedConsumersIfNeeded and clearIncomingMessages were all executed in the same thread:

2022-11-16T17:24:38,714 - INFO  - [pulsar-client-internal-39-1:MultiTopicsConsumerImpl@717] - XYZ resumeReceiving
2022-11-16T17:24:38,718 - INFO  - [pulsar-client-internal-39-1:MultiTopicsConsumerImpl@708] - XYZ rededeliverUnacked

It's because redeliverUnacknowledgedMessages is called in the callback of receiveAsync, which should be run in the internal executor as well.

@poorbarcode poorbarcode changed the title [fix] [client] Multi topic consumer receive null even if has backlog [fix][client] Fix multi-topic consumer stuck after redeliver messages Nov 16, 2022
@poorbarcode
Copy link
Contributor Author

Hi @BewareMyPower

And in my local env, resumeReceivingFromPausedConsumersIfNeeded and clearIncomingMessages were all executed in the same thread:

if resumeReceivingFromPausedConsumersIfNeeded and clearIncomingMessages runs in the same thread, the cmd resumeReceivingFromPausedConsumersIfNeeded will always executed before clearIncomingMessages, this leads to a higher probability of problems.

@BewareMyPower
Copy link
Contributor

this leads to a higher probability of problems.

Could you explain why? Or is there a way to make testFailoverConsumerBatchCumulateAck more likely to fail in local env? This test is hard to fail in local env but easy to fail in CI.

@poorbarcode
Copy link
Contributor Author

Hi @BewareMyPower

this leads to a higher probability of problems.

Could you explain why?

The reason in Menu-Motivation

Or is there a way to make testFailoverConsumerBatchCumulateAck more likely to fail in local env? This test is hard to fail in local env but easy to fail in CI.

I guess that in CI the incoming queue can be filled quickly enough before the receive has not started. We can't make it runs faster locally, but we can make receive execution slower, you can change the test like this which makes the test fail in the local env easily:

截屏2022-11-17 00 18 32

@BewareMyPower
Copy link
Contributor

Got it. Thanks. The motivation part is a little confusing. The root cause is that the consumer won't be removed from pausedConsumers if the incoming messages are greater than half of the receiver queue size. Therefore, resumeReceivingFromPausedConsumersIfNeeded must be called after clearIncomingMessages.

BTW, it's not a race condition in redeliverUnacknowledgedMessages for previous code:

    public void redeliverUnacknowledgedMessages() {
        internalPinnedExecutor.execute(() -> {
            /* ... */
            clearIncomingMessages();
            unAckedMessageTracker.clear();
        });
        resumeReceivingFromPausedConsumersIfNeeded();
    }

internalPinnedExecutor is a single thread executor, when redeliverUnacknowledgedMessages is called in internalPinnedExecutor, resumeReceivingFromPausedConsumersIfNeeded will always be called before clearIncomingMessages().

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the tests like

        ExecutorService internalPinnedExecutor =
                WhiteboxImpl.getInternalState(consumer, "internalPinnedExecutor");
        internalPinnedExecutor.execute(() -> {
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        });
        consumer.redeliverUnacknowledgedMessages();
        // Make sure the message redelivery is completed. The incoming queue will be cleaned up during the redelivery.
        internalPinnedExecutor.submit(() -> {}).get();

Because we reused the internal executor. Then we can remove injectDelayToInternalThread and waitForAllTasksForInternalThread and the code will be more clear.

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid sleeping by calling redeliverUnacknowledgedMessages in the internal pinned executor.

        ExecutorService internalPinnedExecutor =
                WhiteboxImpl.getInternalState(consumer, "internalPinnedExecutor");
        CountDownLatch redeliverDone = new CountDownLatch(1);
        internalPinnedExecutor.execute(() -> {
            consumer.redeliverUnacknowledgedMessages();
            redeliverDone.countDown();
        });
        redeliverDone.await();
        internalPinnedExecutor.submit(() -> {}).get();

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Nov 17, 2022

Hi @BewareMyPower

We can avoid sleeping by calling redeliverUnacknowledgedMessages in the internal pinned executor.

Yes, you are right. This is definitely a good idea, and I tried it in the previous commit, but it didn't make more present of failure than sleep(Here's why:#18491 (comment)), so I ended up using sleep instead.

截屏2022-11-18 00 08 17

@BewareMyPower BewareMyPower dismissed congbobo184’s stale review November 18, 2022 05:03

The issue mentioned is not related to this PR

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2022

Codecov Report

Merging #18491 (6c391d9) into master (7975023) will increase coverage by 1.34%.
The diff coverage is 57.57%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18491      +/-   ##
============================================
+ Coverage     45.62%   46.96%   +1.34%     
- Complexity    10075    10392     +317     
============================================
  Files           697      697              
  Lines         68024    68015       -9     
  Branches       7293     7285       -8     
============================================
+ Hits          31033    31946     +913     
+ Misses        33413    32498     -915     
+ Partials       3578     3571       -7     
Flag Coverage Δ
unittests 46.96% <57.57%> (+1.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...he/pulsar/client/impl/MultiTopicsConsumerImpl.java 22.26% <0.00%> (ø)
...nt/impl/PulsarClientImplementationBindingImpl.java 72.41% <ø> (-0.47%) ⬇️
...ar/client/impl/conf/ConsumerConfigurationData.java 92.55% <ø> (+2.12%) ⬆️
...he/pulsar/client/impl/TypedMessageBuilderImpl.java 30.76% <59.37%> (+3.56%) ⬆️
.../apache/pulsar/broker/admin/impl/PackagesBase.java 54.12% <0.00%> (-13.77%) ⬇️
...ce/schema/validator/StructSchemaDataValidator.java 52.38% <0.00%> (-9.53%) ⬇️
...rg/apache/pulsar/broker/web/PulsarWebResource.java 55.34% <0.00%> (-2.95%) ⬇️
...pache/pulsar/broker/admin/v2/PersistentTopics.java 71.68% <0.00%> (-2.14%) ⬇️
.../org/apache/pulsar/broker/admin/v2/Namespaces.java 57.36% <0.00%> (-1.93%) ⬇️
...he/pulsar/broker/admin/v2/NonPersistentTopics.java 60.64% <0.00%> (-1.39%) ⬇️
... and 70 more

@BewareMyPower BewareMyPower merged commit 7a93ff9 into apache:master Nov 18, 2022
congbobo184 pushed a commit that referenced this pull request Nov 18, 2022
congbobo184 pushed a commit that referenced this pull request Nov 18, 2022
@poorbarcode poorbarcode deleted the fix/multi_consumer_receive_null branch November 18, 2022 08:43
congbobo184 pushed a commit that referenced this pull request Nov 18, 2022
congbobo184 pushed a commit that referenced this pull request Nov 27, 2022
liangyepianzhou pushed a commit that referenced this pull request Dec 5, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Dec 6, 2022
@congbobo184 congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Dec 8, 2022
codelipenghui pushed a commit that referenced this pull request Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: NegativeAcksTest.testFailoverConsumerBatchCumulateAck
6 participants