Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][storage] Autorecovery default reppDnsResolverClass to ZkBookieRackAffinityMapping #15640

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented May 18, 2022

Fixes: #18012

Motivation

The current Bookkeeper configuration defaults to using org.apache.bookkeeper.net.ScriptBasedMapping for the DNSToSwitchMapping implementation. However, this default configuration does not align with the Broker's default configuration, which is org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's ZkBookieRackAffinityMapping class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

Modifications

  • Add default value for reppDnsResolverClass to the conf/bookkeeper.conf configuration. This change effectively switches the default from org.apache.bookkeeper.net.ScriptBasedMapping to org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping.

Verifying this change

I manually verified that the ZkBookieRackAffinityMapping works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the ScriptBasedMapping default might accidentally get switched to using the ZkBookieRackAffinityMapping implementation. Given that ScriptBasedMapping doesn't work out of the box, and that the broker's default to ZkBookieRackAffinityMapping, I think this is an acceptable tradeoff.

  • doc

@github-actions
Copy link

@michaeljmarshall:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@michaeljmarshall michaeljmarshall added doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. and removed doc-label-missing labels May 18, 2022
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codelipenghui
Copy link
Contributor

Please have a discussion under the dev mailing list first, the PR changed the default value.
Here is an example of enabling system topic https://lists.apache.org/thread/styoxk31q82bdp3qcv2hwxg7wvsmk8dy

I support this change, but we should let it happen on the mailing list first.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline with @codelipenghui we should discuss about this change on the ML.

It is kind of a breaking change for people who changed the configuration on the broker.conf files.

@rdhabalia FYI

@michaeljmarshall
Copy link
Member Author

I'll put together a PIP on Thursday or Friday. I agree that we should discuss this on the mailing list.

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 18, 2022
@codelipenghui codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022
@lhotari
Copy link
Member

lhotari commented Aug 9, 2022

Related explanation about ZkBookieRackAffinityMapping: #151 (comment) .

On the broker side, ZkBookieRackAffinityMapping will get used for the bookie client when bookkeeperClientRackawarePolicyEnabled=true in broker.conf.

conf/bookkeeper.conf Outdated Show resolved Hide resolved
@michaeljmarshall michaeljmarshall force-pushed the pulsar-autorecovery-rack-awareness branch from b7f19be to 108a0d4 Compare August 9, 2022 04:28
@github-actions github-actions bot removed the Stale label Aug 10, 2022
@michaeljmarshall
Copy link
Member Author

@codelipenghui @eolivelli - I agree that this PR changes a default. However, I think the current default in the bookkeeper.conf should be considered a bug because it does not align with the broker default. I am fine with discussing this change on the mailing list, but I don't think that it should require a PIP. Further, I think that we should cherry pick it to all active branches of Pulsar in order to ensure that autorecovery correctly uses rack awareness. Let me know what you think, thanks!

@michaeljmarshall michaeljmarshall force-pushed the pulsar-autorecovery-rack-awareness branch from edaafe7 to 804b250 Compare August 23, 2022 05:55
@michaeljmarshall michaeljmarshall added doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. and removed doc-label-missing Stale labels Oct 11, 2022
@michaeljmarshall michaeljmarshall changed the title [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping [fix][Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping Oct 11, 2022
@michaeljmarshall
Copy link
Member Author

The build is failing with the following error:

Error: Unknown scope "Autorecovery" found in pull request title "[fix][Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping". Use one of the available scopes: admin, broker, build, ci, cli, client, doc, fn, io, meta, misc, monitor, offload, proxy, schema, sec, site, sql, storage, test, txn, ws.

I wonder if we need to add a new scope? None of the available ones seem appropriate for this setting. I assume I'm supposed to use "storage", but it seems like it would be reasonable to indicate that the change is BK specific.

@michaeljmarshall michaeljmarshall changed the title [fix][Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping [fix][storage] Autorecovery default reppDnsResolverClass to ZkBookieRackAffinityMapping Oct 11, 2022
@michaeljmarshall
Copy link
Member Author

/pulsarbot rerun-failure-checks

@michaeljmarshall michaeljmarshall force-pushed the pulsar-autorecovery-rack-awareness branch from 804b250 to d090bc2 Compare October 13, 2022 03:51
@michaeljmarshall
Copy link
Member Author

Rebased and pushed with force to get new CI to run.

@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2022

Codecov Report

Merging #15640 (d090bc2) into master (6c65ca0) will increase coverage by 17.65%.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #15640       +/-   ##
=============================================
+ Coverage     34.91%   52.56%   +17.65%     
- Complexity     5707     7269     +1562     
=============================================
  Files           607      393      -214     
  Lines         53396    43418     -9978     
  Branches       5712     4462     -1250     
=============================================
+ Hits          18644    22824     +4180     
+ Misses        32119    18046    -14073     
+ Partials       2633     2548       -85     
Flag Coverage Δ
unittests 52.56% <100.00%> (+17.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/apache/pulsar/compaction/CompactedTopicImpl.java 81.42% <100.00%> (+70.71%) ⬆️
...ersistentStickyKeyDispatcherMultipleConsumers.java 44.71% <0.00%> (-11.06%) ⬇️
...ervice/persistent/MessageRedeliveryController.java 56.09% <0.00%> (-4.88%) ⬇️
.../apache/pulsar/broker/namespace/LookupOptions.java 87.50% <0.00%> (ø)
...ulsar/broker/namespace/NamespaceEphemeralData.java 86.36% <0.00%> (ø)
...ent/opensearch/RandomExponentialBackoffPolicy.java
...java/org/apache/pulsar/io/kinesis/fbs/Message.java
...n/java/org/apache/pulsar/io/mongodb/MongoSink.java
.../apache/pulsar/io/hdfs2/sink/HdfsAbstractSink.java
...l/presto/decoder/avro/PulsarAvroColumnDecoder.java
... and 385 more

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaeljmarshall michaeljmarshall merged commit 9812297 into apache:master Oct 24, 2022
@michaeljmarshall michaeljmarshall deleted the pulsar-autorecovery-rack-awareness branch October 24, 2022 18:17
michaeljmarshall added a commit that referenced this pull request Oct 24, 2022
…ackAffinityMapping (#15640)

* [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping

* Improve documentation

Fixes: #18012

### Motivation

The current Bookkeeper configuration defaults to using `org.apache.bookkeeper.net.ScriptBasedMapping` for the `DNSToSwitchMapping` implementation. However, this default configuration does not align with the Broker's default configuration, which is `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's `ZkBookieRackAffinityMapping` class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

### Modifications

* Add default value for `reppDnsResolverClass` to the `conf/bookkeeper.conf` configuration. This change effectively switches the default from `org.apache.bookkeeper.net.ScriptBasedMapping` to `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`.

### Verifying this change

I manually verified that the `ZkBookieRackAffinityMapping` works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

### Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the `ScriptBasedMapping` default might accidentally get switched to using the `ZkBookieRackAffinityMapping` implementation. Given that `ScriptBasedMapping` doesn't work out of the box, and that the broker's default to `ZkBookieRackAffinityMapping`, I think this is an acceptable tradeoff.

- [x] `doc`

(cherry picked from commit 9812297)
michaeljmarshall added a commit that referenced this pull request Oct 24, 2022
…ackAffinityMapping (#15640)

* [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping

* Improve documentation

Fixes: #18012

### Motivation

The current Bookkeeper configuration defaults to using `org.apache.bookkeeper.net.ScriptBasedMapping` for the `DNSToSwitchMapping` implementation. However, this default configuration does not align with the Broker's default configuration, which is `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's `ZkBookieRackAffinityMapping` class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

### Modifications

* Add default value for `reppDnsResolverClass` to the `conf/bookkeeper.conf` configuration. This change effectively switches the default from `org.apache.bookkeeper.net.ScriptBasedMapping` to `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`.

### Verifying this change

I manually verified that the `ZkBookieRackAffinityMapping` works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

### Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the `ScriptBasedMapping` default might accidentally get switched to using the `ZkBookieRackAffinityMapping` implementation. Given that `ScriptBasedMapping` doesn't work out of the box, and that the broker's default to `ZkBookieRackAffinityMapping`, I think this is an acceptable tradeoff.

- [x] `doc`

(cherry picked from commit 9812297)
michaeljmarshall added a commit that referenced this pull request Oct 24, 2022
…ackAffinityMapping (#15640)

* [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping

* Improve documentation

Fixes: #18012

### Motivation

The current Bookkeeper configuration defaults to using `org.apache.bookkeeper.net.ScriptBasedMapping` for the `DNSToSwitchMapping` implementation. However, this default configuration does not align with the Broker's default configuration, which is `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's `ZkBookieRackAffinityMapping` class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

### Modifications

* Add default value for `reppDnsResolverClass` to the `conf/bookkeeper.conf` configuration. This change effectively switches the default from `org.apache.bookkeeper.net.ScriptBasedMapping` to `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`.

### Verifying this change

I manually verified that the `ZkBookieRackAffinityMapping` works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

### Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the `ScriptBasedMapping` default might accidentally get switched to using the `ZkBookieRackAffinityMapping` implementation. Given that `ScriptBasedMapping` doesn't work out of the box, and that the broker's default to `ZkBookieRackAffinityMapping`, I think this is an acceptable tradeoff.

- [x] `doc`

(cherry picked from commit 9812297)
michaeljmarshall added a commit that referenced this pull request Oct 24, 2022
…ackAffinityMapping (#15640)

* [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping

* Improve documentation

Fixes: #18012

### Motivation

The current Bookkeeper configuration defaults to using `org.apache.bookkeeper.net.ScriptBasedMapping` for the `DNSToSwitchMapping` implementation. However, this default configuration does not align with the Broker's default configuration, which is `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's `ZkBookieRackAffinityMapping` class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

### Modifications

* Add default value for `reppDnsResolverClass` to the `conf/bookkeeper.conf` configuration. This change effectively switches the default from `org.apache.bookkeeper.net.ScriptBasedMapping` to `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`.

### Verifying this change

I manually verified that the `ZkBookieRackAffinityMapping` works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

### Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the `ScriptBasedMapping` default might accidentally get switched to using the `ZkBookieRackAffinityMapping` implementation. Given that `ScriptBasedMapping` doesn't work out of the box, and that the broker's default to `ZkBookieRackAffinityMapping`, I think this is an acceptable tradeoff.

- [x] `doc`

(cherry picked from commit 9812297)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Oct 25, 2022
…ackAffinityMapping (apache#15640)

* [Autorecovery] Default reppDnsResolverClass to ZkBookieRackAffinityMapping

* Improve documentation

Fixes: apache#18012

### Motivation

The current Bookkeeper configuration defaults to using `org.apache.bookkeeper.net.ScriptBasedMapping` for the `DNSToSwitchMapping` implementation. However, this default configuration does not align with the Broker's default configuration, which is `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`. As such, the default configuration for a Pulsar cluster does not lead to ideal rack awareness when ledgers need to be recovered. The result is that a user can configure a cluster for rack awareness and the brokers will honor that configuration, but the autorecovery process will not because it does not have the correct bookkeeper cluster topology view.

I propose we configure bookkeeper to use the broker's `ZkBookieRackAffinityMapping` class. That way, autorecovery will honor the operator's configured rack awareness policies out of the box.

### Modifications

* Add default value for `reppDnsResolverClass` to the `conf/bookkeeper.conf` configuration. This change effectively switches the default from `org.apache.bookkeeper.net.ScriptBasedMapping` to `org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping`.

### Verifying this change

I manually verified that the `ZkBookieRackAffinityMapping` works by running some tests in a minikube cluster deployed with the DataStax helm chart. I set up 3 racks, 4 bookies, and a topic with a E=2, Qw=2, and Qa=2. I then verified that the autorecovery pod correctly discovered the racks and then identified when an ensemble was not following the rack placement policy after two bookies were removed. I documented my testing a bit more here: datastax/pulsar-helm-chart#214.

### Does this pull request potentially affect one of the following parts:

It changes a default value. The tradeoff is that a user relying on the `ScriptBasedMapping` default might accidentally get switched to using the `ZkBookieRackAffinityMapping` implementation. Given that `ScriptBasedMapping` doesn't work out of the box, and that the broker's default to `ZkBookieRackAffinityMapping`, I think this is an acceptable tradeoff.

- [x] `doc`

(cherry picked from commit 9812297)
(cherry picked from commit fc692c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 cherry-picked/branch-2.11 doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. ready-to-test release/2.8.5 release/2.9.4 release/2.10.3 release/2.11.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PIP-212: Default reppDnsResolverClass to ZkBookieRackAffinityMapping
5 participants