Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix lookup heartbeat and sla namespace bundle when using extensible load manager #21213

Conversation

Demogorgon314
Copy link
Member

@Demogorgon314 Demogorgon314 commented Sep 21, 2023

Motivation

Currently, if the cluster has multiple brokers, and the cluster is doing rolling restart, the heartbeat namespace topic's lookup result might be wrong, because the ExtensibleLoadManagerImpl does not check the heartbeat and SLA namespace bundle lookup candidate broker to let them own by the specified broker.

This ownership selection is wrong:

2023-09-20T08:27:32,188+0000 [ForkJoinPool.commonPool-worker-1] INFO  org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - Selected new owner broker: broker-2:8080 for bundle: pulsar/broker-0:8080/0x00000000_0xffffffff.

After the ownership assignment, the broker-0 will fail to start.

pulsar-broker 2023-09-20T08:27:54,707+0000 [main] INFO  org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - Try acquiring ownership for bundle: pulsar/broker-0:8080/0x00000000_0xffffffff - broker-0:8080.
pulsar-broker 2023-09-20T08:27:54,707+0000 [main] ERROR org.apache.pulsar.broker.namespace.NamespaceService - namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:400) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:343) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:863) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:276) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:356) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker 2023-09-20T08:27:54,710+0000 [main] ERROR org.apache.pulsar.broker.PulsarService - Failed to start Pulsar service: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker org.apache.pulsar.broker.PulsarServerException: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:403) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:343) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:863) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:276) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:356) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker Caused by: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:400) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     ... 4 more
pulsar-broker 2023-09-20T08:27:54,710+0000 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
pulsar-broker org.apache.pulsar.broker.PulsarServerException: org.apache.pulsar.broker.PulsarServerException: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:938) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:276) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:356) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker Caused by: org.apache.pulsar.broker.PulsarServerException: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:403) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:343) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:863) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     ... 2 more
pulsar-broker Caused by: java.lang.IllegalStateException: namespace already owned by other broker : ns=pulsar/broker-0:8080 expected=pulsar://broker-0:6650 actual=pulsar://broker-2:6650
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:400) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:343) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:863) ~[io.streamnative-pulsar-broker-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
pulsar-broker     ... 2 more
pulsar-broker 2023-09-20T08:27:54,711+0000 [main] WARN  org.apache.pulsar.common.util.ShutdownUtil - Triggering immediate shutdown of current process with status 1

Modifications

  • Fix lookup heartbeat and sla namespace bundle when using extensible load manager
  • When updating the topK bundles, filter out the heartbeat and SLA namespace bundles.
  • Skip override orphan heartbeat namespace bundle

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@Demogorgon314 Demogorgon314 added type/bug The PR fixed a bug or issue reported a bug area/broker release/3.0.2 release/3.1.1 labels Sep 21, 2023
@Demogorgon314 Demogorgon314 self-assigned this Sep 21, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 21, 2023
this.isolationPoliciesHelper = new IsolationPoliciesHelper(policies);
this.brokerFilterPipeline.add(new BrokerIsolationPoliciesFilter(isolationPoliciesHelper));

createSystemTopic(pulsar, BROKER_LOAD_DATA_STORE_TOPIC);
Copy link
Member Author

@Demogorgon314 Demogorgon314 Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heesung-sn Somehow, the creation topic might time out, so the broker will shut down. And the brokerRegistry is not closed, the zk path exists, and the broker start will fail again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we might need to add a retry for system topic creation, and ignore the topic already exists exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do skip if it already exists. If the system topic creation fails here, k8s should restart the broker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the topic creation will time out? I saw several times, but after broker restarted, it will be successful start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. Do we have some logs/strack-traces?

@Technoboy- Technoboy- added this to the 3.2.0 milestone Sep 26, 2023
this.isolationPoliciesHelper = new IsolationPoliciesHelper(policies);
this.brokerFilterPipeline.add(new BrokerIsolationPoliciesFilter(isolationPoliciesHelper));

createSystemTopic(pulsar, BROKER_LOAD_DATA_STORE_TOPIC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do skip if it already exists. If the system topic creation fails here, k8s should restart the broker.

@@ -207,6 +210,36 @@ public Set<NamespaceBundle> getOwnedServiceUnits() {
var bundle = entry.getKey();
return getNamespaceBundle(pulsar, bundle);
}).collect(Collectors.toSet());
// Add heartbeat and SLA monitor namespace bundle.
NamespaceName heartbeatNamespace = NamespaceService.getHeartbeatNamespace(brokerId, pulsar.getConfiguration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add these static bundles(bundles that are not stored in BSC) for getOwnedServiceUnits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for getOwnedServiceUnits it needs to know the heartbeat and SLA monitor the ownership. it just keeps the behavior the same as previously.

orphanServiceUnits.put(serviceUnit, stateData);
}
} else if (now - stateData.timestamp() > semiTerminalStateWaitingTimeInMillis) {
if (isActiveState(state) && StringUtils.isNotBlank(srcBroker) && !activeBrokers.contains(srcBroker)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isActiveState(state) check is repeated in the below cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more like to return first, so the logic can be easy to understand. Of course, I can revert this change since not relevant, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im fine with this change.

@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/fix-lookup-heartbeat-and-sla-namespace-bundle branch from 6586a05 to decf7fa Compare September 30, 2023 02:22
@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/fix-lookup-heartbeat-and-sla-namespace-bundle branch from decf7fa to a812637 Compare October 7, 2023 08:50
@codecov-commenter
Copy link

Codecov Report

Merging #21213 (a812637) into master (643428b) will increase coverage by 36.27%.
Report is 3 commits behind head on master.
The diff coverage is 80.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #21213       +/-   ##
=============================================
+ Coverage     36.96%   73.23%   +36.27%     
- Complexity    12294    32505    +20211     
=============================================
  Files          1698     1887      +189     
  Lines        130510   140223     +9713     
  Branches      14260    15435     +1175     
=============================================
+ Hits          48240   102694    +54454     
+ Misses        75982    29435    -46547     
- Partials       6288     8094     +1806     
Flag Coverage Δ
inttests 24.17% <6.11%> (-0.16%) ⬇️
systests 24.69% <5.00%> (-0.20%) ⬇️
unittests 72.51% <80.00%> (+40.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/bookkeeper/mledger/util/RangeCache.java 93.10% <100.00%> (+19.23%) ⬆️
...n/java/org/apache/pulsar/broker/PulsarService.java 82.69% <100.00%> (+13.80%) ⬆️
...ker/loadbalance/extensions/models/TopKBundles.java 90.78% <100.00%> (+90.78%) ⬆️
...org/apache/pulsar/broker/loadbalance/LoadData.java 91.66% <0.00%> (+25.00%) ⬆️
...ache/pulsar/broker/namespace/NamespaceService.java 71.94% <75.00%> (+28.17%) ⬆️
...xtensions/channel/ServiceUnitStateChannelImpl.java 84.48% <58.82%> (+83.94%) ⬆️
...dbalance/extensions/ExtensibleLoadManagerImpl.java 77.75% <82.39%> (+75.66%) ⬆️

... and 1446 files with indirect coverage changes

@Demogorgon314 Demogorgon314 merged commit f85e0dc into apache:master Oct 8, 2023
45 checks passed
@Demogorgon314 Demogorgon314 deleted the Demogorgon314/fix-lookup-heartbeat-and-sla-namespace-bundle branch October 8, 2023 00:52
Demogorgon314 added a commit to Demogorgon314/pulsar that referenced this pull request Oct 8, 2023
…g extensible load manager (apache#21213)

(cherry picked from commit f85e0dc)
Demogorgon314 added a commit to Demogorgon314/pulsar that referenced this pull request Oct 8, 2023
…g extensible load manager (apache#21213)

(cherry picked from commit f85e0dc)
liangyuanpeng pushed a commit to liangyuanpeng/pulsar that referenced this pull request Oct 11, 2023
vinayakmalik95 pushed a commit to tmdc-io/pulsar that referenced this pull request Oct 12, 2023
Demogorgon314 added a commit that referenced this pull request Oct 18, 2023
@Technoboy-
Copy link
Contributor

Cherry-pick by #21314

continue;
}

if (now - stateData.timestamp() > semiTerminalStateWaitingTimeInMillis) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I missed this.
This should be if(!isActiveState(state) && now - stateData.timestamp() > semiTerminalStateWaitingTimeInMillis) . Otherwise, we will clean active states, including Owned.

nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 29, 2024
…le when using extensible load manager (apache#21213) (apache#21314)

(cherry picked from commit 0454410)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
…le when using extensible load manager (apache#21213) (apache#21314)

(cherry picked from commit 0454410)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants