Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths #18518

Merged
merged 1 commit into from
Dec 27, 2022

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Nov 17, 2022

This is only a POC, opened for discussion

Explanation

When a operation is redirected to another broker, it may happen that the local view over metadata on the other broker is not up-to-date with the view seen by the initial broker.
In order to guarantee casual consistency among different peers you have to ways using the ZooKeeper APIs:

  • perform a write
  • use the sync() API

Both of the two operations force (or wait for) the ZooKeeper server to which the client is connected to be in sync with the Leader ZooKeeper server and also that the local ZooKeeper client is up-to-date.

See more at https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html at the "Consistency Guarantees" paragraph.

When the control passes from one broker to another broker the second broker needs to ensure that its local view is not stale.
And we can do it by using the sync() API and then invalidating the local MetadataCache.
The guarantee here is that the operation executed on the initial broker "happened before" the operation on the second broker.

A good example of the need for this feature is described on PR #18193.
at comment https://github.com/apache/pulsar/pull/18193/files#r1005919247

Modifications

Add to MetadataStore the support for using the sync() API and add refreshAndGetAsync() to MetadataCache.
Fix the problem reported by https://github.com/apache/pulsar/pull/18193/files#r1005919247 regarding Partitioned Topic deletion (the same fix should be applied to Namespace deletion, about the Policies#deleted flag)

PR in the forked repository: eolivelli#20

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@eolivelli eolivelli self-assigned this Nov 17, 2022
@eolivelli eolivelli added doc-not-needed Your PR changes do not impact docs ready-to-test and removed doc-label-missing labels Nov 17, 2022
@eolivelli eolivelli added this to the 2.12.0 milestone Nov 17, 2022
@github-actions github-actions bot added doc-label-missing and removed doc-not-needed Your PR changes do not impact docs labels Nov 17, 2022
@eolivelli eolivelli changed the title [proof-of-concept] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths [improvement] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths Nov 17, 2022
@eolivelli eolivelli changed the title [improvement] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths [improve] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths Nov 17, 2022
@eolivelli eolivelli marked this pull request as ready for review November 17, 2022 12:49
@apache apache deleted a comment from github-actions bot Nov 17, 2022
@eolivelli eolivelli added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Nov 17, 2022
@github-actions github-actions bot added doc-label-missing and removed doc-not-needed Your PR changes do not impact docs labels Nov 17, 2022
@github-actions
Copy link

@eolivelli Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.21%. Comparing base (b42aed1) to head (c7b7c31).
Report is 1903 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18518      +/-   ##
============================================
+ Coverage     47.11%   47.21%   +0.09%     
- Complexity    10595    10685      +90     
============================================
  Files           710      711       +1     
  Lines         69423    69455      +32     
  Branches       7449     7452       +3     
============================================
+ Hits          32709    32792      +83     
+ Misses        33037    32993      -44     
+ Partials       3677     3670       -7     
Flag Coverage Δ
unittests 47.21% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 62 files with indirect coverage changes

@merlimat
Copy link
Contributor

merlimat commented Nov 17, 2022

Apart from the fact that sync() is a very ZK specific concept, my main problem with this is what is this supposed to achieve and where is this supposed to be used.

Is this to fix the problem of a partitioned topic created and immediately used by a client and seeing error?

The part that is important to highlight is that sync() is essentially forcing a write in the ZK log, in order to guarantee read-after-write in ZK. Imposing a sync each time a client connects, will open a direct write path from clients to ZK.

That would mean that even when using a fully-consistent data store, we would still have the problem that we cannot cache this information.

I would rather find a better way, where we only force the sync on very specific occasions. eg:

  1. try to get partitioned-topic-metadata
  2. It's not in local cache
  3. Try to read from metadata store
  4. We get NotFound error
  5. Force sync() and redo the read
  6. Cache the result of the read, (including the NotFound result)

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. By the way, should we need a test for this?

@poorbarcode
Copy link
Contributor

poorbarcode commented Nov 18, 2022

Hi @merlimat

Is this to fix the problem of a partitioned topic created and immediately used by a client and seeing error?

No, What this PR is trying to fix is that when the metadata attribute deleted of a partitioned topic has changed to true, this value is still cached in other brokers. This PR makes sync and read in other brokers. The first discussion is https://github.com/apache/pulsar/pull/18193/files#r1005921385

I would rather find a better way.

I agree. +1

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we need a test for this?

@eolivelli
Copy link
Contributor Author

Thank you @merlimat and @poorbarcode for your comments.

@merlimat this proposal is only to cover very specific cases, as you say we are already doing well for most of the cases.

The case in which you need "sync()" is when there is this combo happening:

  • you have 2 brokers
  • broker A mutates the metadata (op1)
  • another action triggers a "read" (op2) followed by a write (op3) on broker B

In this case when broker B executes op2 we must guarantee that the value that observes includes the effects generated by op1, otherwise op3 may be executed using stale data (any compare-and-set operation in op3 is not on the same znode as op2, so we not protection, no BadVersion).

In the partitioned topic deletion example we have:

  • op1 -> PartitionedTopicMetadata.deleted = true on Broker A
  • op2 -> Read (from cache or from the ZK follower node in case of cache miss) PartitionedTopicMetadata on Broker B
  • op3 -> Do something (like creating again a partition in case of autocreation enabled) on Broker B

You need sync() (on broker B) before op2 because the action that causes op2 is out-of-band in respect to ZooKeeper, and so Broker B may use a stale version of the Z-Node (either because it is connected to a Follower that is not up-to-date or because it has a stale value in the local cache and the watch notification still hasn't arrived, mostly because the Follower is still not up-to-date).

So we have to use this "consistent refreshAndGet" only when this pattern happens.

No need to change Pulsar everywhere, but only on SOME of the Write paths that touch Metadata.

I would really reject any proposal to call sync() before every read as it would kill the existence of the localcache and kill also the performances (as you say, because sync() is a dymmy write)or every write as CAS already covers 99% of the operations

@eolivelli
Copy link
Contributor Author

@poorbarcode I will add tests only when we reach consensus on the approach.

@eolivelli
Copy link
Contributor Author

@merlimat do you have more comments please ?

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Dec 23, 2022
@eolivelli
Copy link
Contributor Author

@merlimat @codelipenghui @hangc0276 @rdhabalia I believe that this PR adds value and that we need to complete this work.

I can't find any other way to achieve the goals of this PR

@github-actions github-actions bot removed the Stale label Dec 24, 2022
@poorbarcode
Copy link
Contributor

@eolivelli I want to cherry-pick this PR to 2.10 and 2.11 to solve the issue, which #21063 fixed. Do you feel okay?

@poorbarcode
Copy link
Contributor

@eolivelli I want to cherry-pick this PR to 2.10 and 2.11 to solve the issue, which #21063 fixed. Do you feel okay?

I sent a discuss to cherry-pick this PR to branch-2.10 and branch-2.11

poorbarcode pushed a commit that referenced this pull request Aug 31, 2023
…ing critical metadata operation paths (#18518)

(cherry picked from commit 492a9c3)
@poorbarcode
Copy link
Contributor

I sent a discuss to cherry-pick this PR to branch-2.10 and branch-2.11

Cherry-picked.

Since the method isPartitionedTopicBeingDeletedAsync only exists in branch-3.0, the change https://github.com/apache/pulsar/pull/18518/files#diff-38fac527320b93f0e9ea2459420f56638299627f7f29c01484f7e4aae0c5f0b7R330 was not be cherry-picked into branch-2.10 and branch-2.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants