-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647
[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647
Conversation
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
https://github.com/apache/pulsar/actions/runs/5373765736/jobs/9748615177?pr=20647 Please help check this failed job. |
@@ -455,23 +455,17 @@ public synchronized void producerRemoved(String producerName) { | |||
public synchronized void purgeInactiveProducers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to keep this 'synchronized' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need it, and I also want to remove it in this PR initially. But considering it is not related to the publish latency spike issue, maybe we'd better just keep this PR only for the publish latency spike issue. And I can create the following PR to remove this synchronized
, but no need to cherry-pick to release branches.
After doing a benchmark test
|
@codelipenghui exception is
|
@codelipenghui |
@mattisonchao I need to use iterator if the ConcurrentHashMap is the final decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
…ssageDeduplication
618e055
to
b0cebfd
Compare
Codecov Report
@@ Coverage Diff @@
## master #20647 +/- ##
=============================================
+ Coverage 33.58% 73.12% +39.54%
- Complexity 12127 32016 +19889
=============================================
Files 1613 1867 +254
Lines 126241 138683 +12442
Branches 13770 15240 +1470
=============================================
+ Hits 42396 101410 +59014
+ Misses 78331 29249 -49082
- Partials 5514 8024 +2510
Flags with carried forward coverage won't be shown. Click here to find out more.
|
…ssageDeduplication (#20647)
…ssageDeduplication (#20647)
…ssageDeduplication (apache#20647) (cherry picked from commit fa68bf3)
Motivation
This issue has occurred with a topic that has many producers, where the P99 publish latency of the broker will increase to hundreds of milliseconds when many producers connect or disconnect from the topic. This level of latency is unacceptable for a messaging system.
In this case, each producer add or remove operation goes to MessageDeduplication to update a map for inactive producers, regardless of whether message deduplication is enabled or disabled. My initial impression is that if message deduplication is disabled, it should not go to MessageDeduplication. However, users can enable message deduplication for an active topic, which may be the reason why this is happening. @merlimat, do you have any additional context on this matter? We may also need to find a solution to avoid any operations to MessageDeduplication if message deduplication is disabled in the future.
This PR provides a fix without introducing any changes in behavior, which will give us more confidence to cherry-pick it to release branches.
broker_lock_0621_1.html.txt
Modifications
Use ConcurrentHashMap instead of synchronized HashMap to reduce the contention between IO threads.
Here is the benchmark result from benchmark test
CLHM_CON (ConcurrentOpenHashMap)
CHM_CON(ConcurrentHashMap)
HM_CON(SynchronizedHashMap)
Verifying this change
The existing tests can cover the new changes.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete