Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Deprecated aggregatePublisherStatsByProducerName config #18254

Closed
wants to merge 1 commit into from

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Oct 28, 2022

Motivation

The index-based publisher stat aggregation(configured by aggregatePublisherStatsByProducerName=false, default) can burst memory or wrongly aggregate publisher metrics if each partition stat returns a different size or order of the publisher stat list.
In the worst case, if there are many partitions and publishers created and closed concurrently, the current code can create PublisherStatsImpl objects exponentially, and this can cause a high GC time or OOM.

Issue Code reference:

2c428f7#diff-02e50674125a597f8ae3405a884590759f2fdaa10104cea511d5ea44b6ff6490R224-R247

Modifications

  • Deprecated the aggregatePublisherStatsByProducerName broker config because the default, the index-based aggregation is inherently wrong in a highly concurrent producer environment(where the order and size of the publisher stat list are not guaranteed to be the same). The publisher stats need to be aggregated by a unique key, the producer name(aggregatePublisherStatsByProducerName=true).
  • If PublisherStatsImpl.publisherName happens to be null, it will aggregate it with the stat object with "null" name. (However, this is unlikely because the latest version of the brokers generates a globally unique name If not given.)
  • Use HashMap instead List to aggregate publisher stats.
  • Removed sorting from getPublisher(), as this can be expensive for a large number of publishers.
  • Simplified the stat aggregation code.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added unit tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • [] doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#13

… stat aggregation (deprecated aggregatePublisherStatsByProducerName)
@github-actions github-actions bot added the doc-required Your PR changes impact docs and you will update later. label Oct 28, 2022
@codelipenghui codelipenghui added this to the 2.12.0 milestone Oct 31, 2022
@Technoboy- Technoboy- closed this Nov 1, 2022
@Technoboy- Technoboy- reopened this Nov 1, 2022
public boolean supportsPartialProducer;
/** Whether partial producer is supported at client. supportsPartialProducer is true always. */
@Deprecated
public boolean supportsPartialProducer = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why still keep this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getter of this member var is already exposed to the client interface. Are we allowed to make a breaking change?

@@ -44,6 +44,7 @@ public interface PublisherStats {
long getProducerId();

/** Whether partial producer is supported at client. */
@Deprecated
boolean isSupportsPartialProducer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already exposed to the client interface. Are we allowed to make a breaking change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct change. If it has already been released, we cannot remove it without first deprecating it. I don't know that we have documented guidance on how long to keep methods like this marked as deprecated.

@codecov-commenter
Copy link

Codecov Report

Merging #18254 (6228a5c) into master (6c65ca0) will increase coverage by 3.96%.
The diff coverage is 35.90%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18254      +/-   ##
============================================
+ Coverage     34.91%   38.88%   +3.96%     
- Complexity     5707     8307    +2600     
============================================
  Files           607      685      +78     
  Lines         53396    67330   +13934     
  Branches       5712     7215    +1503     
============================================
+ Hits          18644    26180    +7536     
- Misses        32119    38132    +6013     
- Partials       2633     3018     +385     
Flag Coverage Δ
unittests 38.88% <35.90%> (+3.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/bookkeeper/mledger/LedgerOffloader.java 0.00% <ø> (ø)
...che/bookkeeper/mledger/LedgerOffloaderFactory.java 0.00% <ø> (ø)
...pache/bookkeeper/mledger/LedgerOffloaderStats.java 0.00% <ø> (ø)
...ookkeeper/mledger/LedgerOffloaderStatsDisable.java 0.00% <ø> (ø)
...a/org/apache/bookkeeper/mledger/ManagedCursor.java 85.71% <ø> (ø)
...che/bookkeeper/mledger/ManagedLedgerException.java 41.17% <ø> (ø)
...bookkeeper/mledger/ManagedLedgerFactoryConfig.java 86.66% <ø> (ø)
...g/apache/bookkeeper/mledger/ManagedLedgerInfo.java 22.22% <ø> (ø)
...he/bookkeeper/mledger/OffloadedLedgerMetadata.java 0.00% <ø> (ø)
...ava/org/apache/bookkeeper/mledger/ScanOutcome.java 0.00% <ø> (ø)
... and 729 more

@poorbarcode
Copy link
Contributor

/pulsarbot rerun-failure-checks

@equanz
Copy link
Contributor

equanz commented Nov 7, 2022

In my understanding, old partitioned producer clients are not aggregated by producerName completely. So, this PR introduce breaking changes to them.
Partitioned producers have the same producerName between partitions as default from #10279 commit.
Therefore, I kept a list-indexed aggregation for backward compatibility.
#12401
I think we should not introduce breaking changes, at least between the same major versions.

In addition, I think this approach doesn't deprecate a feature but removes a feature.
Users can't disable aggregatePublisherStatsByProducerName from config.

@heesung-sn
Copy link
Contributor Author

Yes, this pr is not backward-compatible to deprecate the broken feature.

@equanz
Copy link
Contributor

equanz commented Nov 11, 2022

For example, when enabling aggregatePublisherStatsByProducerName in the v2.10.1 server

% grep 'aggregatePublisherStatsByProducerName' conf/standalone.conf
aggregatePublisherStatsByProducerName=true

and create the partitioned producer to the partitioned topic by the v2.8.4 java client,

 ./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
11:28:34.053 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Starting Pulsar perf producer with config: {
...
  "producerName" : null,
...
11:28:34.969 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-1] Created producer on cnx [id: 0x0e5ab2e4, L:/127.0.0.1:49990 - R:localhost/127.0.0.1:6650]
11:28:34.992 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-0] Created producer on cnx [id: 0x32a3290d, L:/127.0.0.1:49989 - R:localhost/127.0.0.1:6650]
...

then the broker returns partitioned topic stats like the one below.

% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  } ],
...

However, when running with your fixes on the server side and creating a producer to a partitioned topic by the v2.8.4 java client,

% ./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
  "producerName" : null,
...
11:30:37.289 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-1] Created producer on cnx [id: 0x393b0652, L:/127.0.0.1:50035 - R:localhost/127.0.0.1:6650]
11:30:37.307 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-0] Created producer on cnx [id: 0x9b73aa95, L:/127.0.0.1:50034 - R:localhost/127.0.0.1:6650]
...

then the broker returns partitioned topic stats like the one below. As you can see, publishers can't be aggregated by producerName even if the partitioned producer is alone in the partitioned topic. It is breaking changes.

% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-1"
  }, {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-0"
  } ],
...

I understand the current index-based aggregation procedure has an issue you mentioned in this PR but are these braking changes allowed on this project?

@heesung-sn
Copy link
Contributor Author

heesung-sn commented Nov 11, 2022

@equanz,

Thank you for sharing the example.

Although it is not desirable, I think we could be lenient on this change in the stat API response(assuming this publishers stat struct is used for human admins only for ad-hoc checks).

If this change is not acceptable, I guess we could push this fix to the next major version.

Also, when there are thousands of producers(consumers) for a (partitioned-)topic, it is expensive to aggregate each publisher(subscriptions)'s stats on-the-fly across the brokers.

So, alternatively for the next major version, I think we could further define producers(subscriptions)' API like the below and drop the publishers and subscriptions structs from topics (partitioned-)stats returns.

pulsar-admin publishers list my-topic --last-pagination-key xyz
pulsar-admin publishers stats my-producer

# similarly for subscriptions

@heesung-sn
Copy link
Contributor Author

heesung-sn commented Nov 14, 2022

@equanz,

To fix the issue, do you have any suggestions other than what I mentioned above?

@equanz
Copy link
Contributor

equanz commented Nov 15, 2022

I have no other suggestions. Why don't you discuss or vote on the dev@ ML before introducing this feature?

@heesung-sn
Copy link
Contributor Author

Opened a pulsar community discussion thread here.

https://lists.apache.org/thread/vofv1oz0wvzlwk4x9vk067rhkscn8bqo

@heesung-sn heesung-sn closed this Dec 2, 2022
@heesung-sn heesung-sn deleted the bug-fix-topic-stats branch April 2, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-required Your PR changes impact docs and you will update later. ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants