Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix ownership loss #23515

Merged
merged 1 commit into from
Nov 7, 2024
Merged

Conversation

nodece
Copy link
Member

@nodece nodece commented Oct 25, 2024

Motivation

While testing Pulsar 3.0, I encountered a failure in the org.apache.pulsar.client.api.BrokerServiceLookupTest#testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx test, both locally and in the ascentstream/pulsar CI environment.

Ownership assignment and verification flow

Step 1: Ownership verification (Lookup phase)

The lookup mechanism first checks if the bundle is present in ownedBundlesCache. If absent, it queries ZooKeeper (zk) to determine the ownership status. Details on this process can be found in NamespaceService#findBrokerServiceUrl and OwnershipCache#getOwnerAsync.

Step 2: Assigning ownership

If no owner is identified:

  1. NamespaceService#searchForCandidateBroker is used to find a broker.
  2. If the candidateBroker matches the local broker ID (pulsar.getBrokerId()), Pulsar caches the bundle by calling ownershipCache.tryAcquiringOwnership(bundle).

If an owner is found, the lookup result is returned to the client.

Step 3: Ownership verification for topic loading or creation

During topic loading or creation, BrokerService#loadOrCreatePersistentTopic invokes BrokerService#checkTopicNsOwnership, which verifies the bundle's presence in ownedBundlesCache. If missing, an error is raised, indicating the namespace bundle is not managed by the current broker.

Ownership loss

Direct deletion of ownership information from ownedBundlesCache or zk is generally avoided. However, the test BrokerServiceLookupTest#testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx simulates such conditions to verify error handling in the lookup process.

  1. Simulating ownership release
    Using bundleOfTopic.releaseBundleLockAndMakeAcquireFail():

    • Clears the ownedBundlesCache.
    • Deletes zk data.
    • Triggers an OPERATIONTIMEOUT error:
    2024-11-04T17:44:55,479 - INFO  - [metadata-store-2-1:ResourceLockImpl] - Lock on resource /namespace/public/default/0x00000000_0xffffffff was invalidated. state Valid
    2024-11-04T17:44:55,484 - WARN  - [metadata-store-2-1:ResourceLockImpl] - Failed to revalidate the lock at /namespace/public/default/0x00000000_0xffffffff: org.apache.zookeeper.KeeperException$OperationTimeoutException: KeeperErrorCode = OperationTimeout for /namespace/public/default/0x00000000_0xffffffff - Retrying in 0.1 seconds
    
  2. Restoring zk
    The method bundleOfTopic.makeAcquireBundleLockSuccess() removes the OPERATIONTIMEOUT error:

    2024-11-04T17:44:55,597 - INFO  - [metadata-store-2-1:ResourceLockImpl] - Acquired resource lock on /namespace/public/default/0x00000000_0xffffffff
    2024-11-04T17:44:55,598 - INFO  - [metadata-store-2-1:ResourceLockImpl] - Successfully re-acquired missing lock at /namespace/public/default/0x00000000_0xffffffff
    2024-11-04T17:44:55,598 - INFO  - [metadata-store-2-1:ResourceLockImpl] - Successfully revalidated the lock on /namespace/public/default/0x00000000_0xffffffff
    

This action restores the ownership in zk, though ownedBundlesCache remains empty.

Loop triggered by ownership check during client connection

When a client connects and the bundle is missing from ownedBundlesCache, the following log message appears:

2024-11-04T17:44:55,606 - WARN  - [pulsar-io-8-4:BrokerService] - Namespace bundle for topic (persistent://public/default/tp-d9153b61-42fa-49da-8f01-76a544bccf09) not served by this instance:localhost:50265. Please redo the lookup. Request is denied: namespace=public/default

This triggers the client to retry, resulting in a loop of repeated ownership verification and lookup attempts. Since zk shows ownership, NamespaceService#searchForCandidateBroker is bypassed, causing persistent retries.

Modifications

  • If the ownership belongs to the current broker in the getOwnerAsync, the broker will try to acquire ownership by the ownedBundlesCache to avoid the ownership loss in the cache. In one case, the user may delete the ownership on the zk. This changes can fix that to help the user recover ownership.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 25, 2024
@lhotari lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label Oct 26, 2024
@nodece
Copy link
Member Author

nodece commented Oct 29, 2024

org.apache.pulsar.client.api.BrokerServiceLookupTest#testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx should be a flaky test.

Signed-off-by: Zixuan Liu <nodeces@gmail.com>
@nodece
Copy link
Member Author

nodece commented Nov 4, 2024

@lhotari @BewareMyPower @poorbarcode @heesung-sn This PR description has been updated, could you have a chance to review this PR?

@heesung-sn
Copy link
Contributor

Please notice that the org.apache.pulsar.client.api.BrokerServiceLookupTest.BundleOfTopic class in the test.

    private void releaseBundleLockAndMakeAcquireFail() throws Exception {
        ownedBundlesCache.synchronous().invalidateAll();
        mockZooKeeper.delete(ServiceUnitUtils.path(namespaceBundle), -1);
        mockZooKeeper.setAlwaysFail(KeeperException.Code.OPERATIONTIMEOUT);
    }

This method directly discards the cache and deletes zk data, which may break the test case.

I will double-check the ownership mechanism to avoid the racing condition.

This sounds like this ownership metadata and cache inconsistency issue only occurs from the wrong test setup. Should we just fix the test code in this case?

@nodece
Copy link
Member Author

nodece commented Nov 7, 2024

This sounds like this ownership metadata and cache inconsistency issue only occurs from the wrong test setup.

@heesung-sn Sure, this is the root cause.

Should we just fix the test code in this case?

In one case, the user may delete the ownership on the zk. This PR can fix this issue and help the user recover ownership.

What do you think that?

@heesung-sn
Copy link
Contributor

In one case, the user may delete the ownership on the zk. This PR can fix this issue and help the user recover ownership.

Please consider adding comments to explain this.

I think this PR is useful to fix the possible inconsistency state, although I think that we better fix the root cause that makes this inconsistency state, if possible.

@nodece
Copy link
Member Author

nodece commented Nov 7, 2024

Ok, I'll try to fix the root cause.

@nodece nodece merged commit 576d341 into apache:master Nov 7, 2024
54 checks passed
@nodece nodece deleted the fix-ownership-loss branch November 7, 2024 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.8 release/3.3.3 release/4.0.1 triage/lhotari/important lhotari's triaging label for important issues or PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants