Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4655: Communicate the Zxid that triggered a WatchEvent to fire #1950

Merged
merged 3 commits into from
Jun 16, 2023

Conversation

PapaCharlie
Copy link
Contributor

With the recent addition of persistent watches, many doors have opened up to significantly more performant and intuitive local caches of remote state, but the actual implementation can be difficult because to cache data locally, one needs to execute the following steps:

  1. Set the watch
  2. Bootstrap the watched subtree
  3. Catch up on the events that fired during the bootstrap

The issue is it's now very difficult to deduplicate and sanely resolve the remote state during step 3 because it's unknown whether an event arrived during the bootstrap or after. For example, imagine that between steps 1 and 2, a node /a was deleted then re-created. By the time step 3 is executed, there will be a NodeDeleted event queued up followed by a NodeCreated, causing at best a double read (one from the bootstrap, one from the NodeCreated) or at worst some data inconsistencies in the local cache.

This change sets the Zxid in the response header whenever the watch event type is NodeCreated, NodeDeleted, NodeDataChanged or NodeChildrenChanged.

@PapaCharlie
Copy link
Contributor Author

Sorry I can't open a ticket for this. I emailed the mailing list already

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this useful feature!
I created this JIRA, can you please amend you commit and update the commit message?
https://issues.apache.org/jira/browse/ZOOKEEPER-4655

Over LGTM

I have one question about backward compatibility.

What happens if a "new client" connects to a "old version server" ?
the observed ID will be -1

correct ?

We should state it clearly in the javadocs.

Follow up question/work:
if possible would you like to contribute support for this feature in apache/curator ?

@PapaCharlie
Copy link
Contributor Author

Hey @eolivelli, thanks for taking a look at this! Happy to hear this looks good. I'm just lucky there was room in the existing API for me to squeeze this.

Let me update the commit message and poke around the docs and see where I can make it more obvious that this is available.

Regarding the new client/old server issue: yeah old servers will always send -1, so it's a pretty clear way to know whether you're talking to an old server. I'll make sure to state that in the docs.

Let me take a look at curator and see what can be done. It may be feasible to expose this in a sane way

@PapaCharlie PapaCharlie changed the title Communicate the Zxid that triggered a WatchEvent to fire ZOOKEEPER-4655: Communicate the Zxid that triggered a WatchEvent to fire Jan 5, 2023
@PapaCharlie
Copy link
Contributor Author

@eolivelli Done

@PapaCharlie
Copy link
Contributor Author

Hey @eolivelli can you take another look at this? I'd really appreciate it!

@PapaCharlie PapaCharlie force-pushed the master branch 5 times, most recently from 7d938b5 to be904a4 Compare March 8, 2023 00:16
@PapaCharlie
Copy link
Contributor Author

Finally was able to get all the checks to pass. Sorry about that! The branch is looking good now

Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a nice feature. Looks good generally except WatchedEvent.equals.

Normally, I think local cache for zookeeper data tree can only guarantee eventual consistent. Does this feature affect any correctness in building a local cache ?

For example, imagine that between steps 1 and 2, a node /a was deleted then re-created. By the time step 3 is executed, there will be a NodeDeleted event queued up followed by a NodeCreated, causing at best a double read (one from the bootstrap, one from the NodeCreated) or at worst some data inconsistencies in the local cache.

I saw that it cloud filter dated event by comparing with stat.mzxid. But how inconsistencies could be introduced ? Or it will be tricky to build a eventual consistent local cache for zookeeper data tree without this feature ?

@PapaCharlie
Copy link
Contributor Author

Hey @kezhuw thanks so much for taking a look at this! Regarding your comment, yes without this feature building an eventually consistent cache can be tricky because it requires handling a lot of edge cases when a read does not match what's expected based on the watch events received. Depending on what the local cache does when it receives events, it could end up in a weird state. I banged my head against the wall enough trying to implement a nice consistent local cache that I opened this PR :)
Lucky for me there was room in the API!

@PapaCharlie PapaCharlie force-pushed the master branch 2 times, most recently from 72b9a95 to 9aaead8 Compare March 8, 2023 04:20
@PapaCharlie
Copy link
Contributor Author

I think there's a flaky test with WatcherCleanerTest.testDeadWatcherMetrics. I can repro it locally it and it doesn't seem related to my change. I will try to fix it

@PapaCharlie
Copy link
Contributor Author

#1989 @kezhuw I fixed the flaky test, can you take a look?

@PapaCharlie PapaCharlie force-pushed the master branch 2 times, most recently from 3b81a47 to 13bd711 Compare March 8, 2023 20:35
Because the metrics were updated _after_ the listener is invoked, the listener does not always see
the fresh metric value. This fixes it so that the test waits for the value to become what we expect.
@PapaCharlie PapaCharlie force-pushed the master branch 2 times, most recently from 559312e to 50a032a Compare March 8, 2023 20:37
@kezhuw
Copy link
Member

kezhuw commented May 11, 2023

Just for info, ZOOKEEPER-1289: Multi Op Watch Events is probably related.

@PapaCharlie
Copy link
Contributor Author

Hey @kezhuw, what's the usecase for having a single multi modify the same node more than once? Is there a specific behavior of ZK that is only available via this sort of multi? Based on my understanding of common uses for ZK, it feels like this is a rare usage pattern, so this change is still very useful for anyone that does not do this

@PapaCharlie
Copy link
Contributor Author

Also, based on my experience in building a client-side cache, having the zxid in the header would resolve a lot of internal inconsistencies, so ZOOKEEPER-1289 may be at least partially solved :)

@kezhuw
Copy link
Member

kezhuw commented May 16, 2023

Hi @PapaCharlie, I have opened ZOOKEEPER-4695. I agree your point, I think it is a matter of breaking change.

based on my experience in building a client-side cache, having the zxid in the header would resolve a lot of internal inconsistencies

Seems that you are building something new in another language, I think ZooKeeper Watches could be help.

@PapaCharlie
Copy link
Contributor Author

Yeah I am building something in go (go-zookeeper/zk#89) but it still would help significantly to have the zxid in the watch.

I think, fundamentally, because this fits in the existing API, I don't think that the fact that the multis can have this behavior is a strong enough reason not to have this. This can greatly simplify client-side behavior in many applications and avoid double lookups in many circumstances.

@gu0keno0
Copy link

@kezhuw : agreed that regular ZK watch can achieve the ordering the modification events; however, for maintaining an in-memory cache for a Znode tree, a single recursive persistent watch is easier to use than regular watches, which has to be set on each Znode in the tree to trigger events on Znode content changes.

Regarding the multi-op API, will the order of watch event be different from the order of write operations in the same multi-op tx? If the orders are the same (e.g. write op1, op2 ==> watch event op1 , watch event op2), then I'd say even if they have the same txid, we can still just order the events in the main memory on the watch event consumer side, and deal with the watch events one by one to maintain the cache consistency. FWIW, we only need the cache to be eventually consistent, yet the simplification gain we can get from leveraging persistent watch will be a lot.

kezhuw added a commit to kezhuw/zookeeper that referenced this pull request May 22, 2023
…nt watches

I found this in reply to
apache#1950 (comment).

But it turns out a known issue
apache#1106 (comment).

> An important question to all committers. In DataTree.setWatches
> persistent watchers are not applied. This means that after a
> network partition, no persistent watchers will trigger. I
> don't have a feeling about this one way or another - the current
> implementation works fine for Curator's use cases.
@kezhuw
Copy link
Member

kezhuw commented May 22, 2023

Regarding the multi-op API, will the order of watch event be different from the order of write operations in the same multi-op tx?

@gu0keno0 They are same from my knowledge.

order the events in the main memory on the watch event consumer side, and deal with the watch events one by one to maintain the cache consistency

I think this is crucial. The "deal" should tolerate to disconnection. autoWatchReset is fragile on its own from my observation.

Says, "/foo" and "/bar" are watched in a sub tree. Two changes(setData for example) are made in server to "/foo" and "/bar" in order. The first "deal" succeed and update lastZxid in client side. The updated lastZxid covers change to "/bar". Assumes that the second "deal" fails due to connection issue.

  • In standard watch: No event for "/bar" after reconnected as client's lastZxid is up to date. We have to retry "/bar" here.
  • In persistent watch: There is no autoWatchReset. So all events during disconnection are simple lost. Currently, Curator rebuilt on reconnected.

FWIW, we only need the cache to be eventually consistent, yet the simplification gain we can get from leveraging persistent watch will be a lot.

Though, persistent watch does not work well with autoWatchReset, it is still a great feature.

@gu0keno0
Copy link

In persistent watch: There is #1106 (comment). So all events during disconnection are simple lost. Currently, Curator rebuilt on reconnected.

My understanding of @PapaCharlie 's use case is that his cache would also reduild upon reconnection. And AFAICT, this feature should be able to simplify the rebuilding process, during which we re-read all Znodes from ZK and new persistent watch events may fire at the same time. Then as @PapaCharlie suggested in his initial statement, in this situation this new feature should be able to simplify the cache maintainenace logic, because if the persistent watch's txid is lower than / equal to the txid of the cached data, we know that the newer data is already read.

@kezhuw
Copy link
Member

kezhuw commented May 23, 2023

@PapaCharlie I think you can send an email to dev mailing list to ask for opinions and reviews. pr has less exposure chance if it is not get attentions from its creation. I did once and saw others. But before doing that, I recommend you to filter out #1989 related changes to make this pr more concentrated.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support this patch.
Unfortunately I had to add one blocker comment.
There is a public class in which we are changing the constructor.
This may have side effects on applications especially testing frameworks who mock ZK events.

Can you please address that comment ? then I will be +1

*/
public WatchedEvent(EventType eventType, KeeperState keeperState, String path) {
public WatchedEvent(EventType eventType, KeeperState keeperState, String path, long zxid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public API, we should keep the original constructors, maybe you can mark them "@deprecated"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preserved below, it's just not marked as @deprecated because there are many uses of WatchedEvent that correctly do not have a zxid.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.
Nice refactoring in the tests.

@anmolnar
Copy link
Contributor

@PapaCharlie Please assign the jira ticket to yourself and set the status to patch available:
https://issues.apache.org/jira/browse/ZOOKEEPER-4655

@PapaCharlie
Copy link
Contributor Author

Hey @anmolnar and @eolivelli thanks for taking a look! Really appreciate the review. I didn't have an account so I asked @abhilash1in to update the ticket. I have an account now so I'll assign it to myself

@anmolnar anmolnar merged commit 880f606 into apache:master Jun 16, 2023
kezhuw added a commit to kezhuw/zookeeper that referenced this pull request Jun 16, 2023
PR apache#1950(ZOOKEEPER-4655) was created before apache#1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by apache#1859.
anmolnar pushed a commit that referenced this pull request Jun 16, 2023
…2012)

PR #1950(ZOOKEEPER-4655) was created before #1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by #1859.
anmolnar pushed a commit that referenced this pull request Jun 16, 2023
PR #1950(ZOOKEEPER-4655) was created before #1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by #1859.
anurag-harness pushed a commit to anurag-harness/zookeeper that referenced this pull request Aug 31, 2023
…ire (apache#1950)

* Fix a race condition in WatcherCleanerTest.testDeadWatcherMetrics

Because the metrics were updated _after_ the listener is invoked, the listener does not always see
the fresh metric value. This fixes it so that the test waits for the value to become what we expect.

* Leverage an existing method and refactor the rest of the code to match

Since there was an existing waitFor method in ZKTestCase, along with an existing implementation of a
waitForMetric LearnerMetricsTest, this commit moves waitForMetric to ZKTestCase and refactors the
metric-related usages of waitFor.

* Communicate the Zxid that triggered a WatchEvent to fire

With the recent addition of persistent watches, many doors have opened up to significantly more
performant and intuitive local caches of remote state, but the actual implementation can be
difficult because to cache data locally, one needs to execute the following steps:

1. Set the watch
2. Bootstrap the watched subtree
3. Catch up on the events that fired during the bootstrap

The issue is it's now very difficult to deduplicate and sanely resolve the remote state during step
3 because it's unknown whether an event arrived during the bootstrap or after. For example,
imagine that between steps 1 and 2, a node /a was deleted then re-created. By the time step 3 is
executed, there will be a NodeDeleted event queued up followed by a NodeCreated, causing at best a
double read (one from the bootstrap, one from the NodeCreated) or at worst some data inconsistencies
in the local cache.

This change sets the Zxid in the response header whenever the watch event type is NodeCreated,
NodeDeleted, NodeDataChanged or NodeChildrenChanged.
anurag-harness added a commit to anurag-harness/zookeeper that referenced this pull request Aug 31, 2023
…ire (apache#1950) (#55)

* Fix a race condition in WatcherCleanerTest.testDeadWatcherMetrics

Because the metrics were updated _after_ the listener is invoked, the listener does not always see
the fresh metric value. This fixes it so that the test waits for the value to become what we expect.

* Leverage an existing method and refactor the rest of the code to match

Since there was an existing waitFor method in ZKTestCase, along with an existing implementation of a
waitForMetric LearnerMetricsTest, this commit moves waitForMetric to ZKTestCase and refactors the
metric-related usages of waitFor.

* Communicate the Zxid that triggered a WatchEvent to fire

With the recent addition of persistent watches, many doors have opened up to significantly more
performant and intuitive local caches of remote state, but the actual implementation can be
difficult because to cache data locally, one needs to execute the following steps:

1. Set the watch
2. Bootstrap the watched subtree
3. Catch up on the events that fired during the bootstrap

The issue is it's now very difficult to deduplicate and sanely resolve the remote state during step
3 because it's unknown whether an event arrived during the bootstrap or after. For example,
imagine that between steps 1 and 2, a node /a was deleted then re-created. By the time step 3 is
executed, there will be a NodeDeleted event queued up followed by a NodeCreated, causing at best a
double read (one from the bootstrap, one from the NodeCreated) or at worst some data inconsistencies
in the local cache.

This change sets the Zxid in the response header whenever the watch event type is NodeCreated,
NodeDeleted, NodeDataChanged or NodeChildrenChanged.

Co-authored-by: Paul Chesnais <paul.chesnais@gmail.com>
anurag-harness pushed a commit to anurag-harness/zookeeper that referenced this pull request Aug 31, 2023
PR apache#1950(ZOOKEEPER-4655) was created before apache#1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by apache#1859.
anurag-harness added a commit to anurag-harness/zookeeper that referenced this pull request Aug 31, 2023
PR apache#1950(ZOOKEEPER-4655) was created before apache#1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by apache#1859.

Co-authored-by: Kezhu Wang <kezhuw@gmail.com>
rahulrane50 pushed a commit to rahulrane50/zookeeper that referenced this pull request Sep 15, 2023
…ire (apache#1950)

* Fix a race condition in WatcherCleanerTest.testDeadWatcherMetrics

Because the metrics were updated _after_ the listener is invoked, the listener does not always see
the fresh metric value. This fixes it so that the test waits for the value to become what we expect.

* Leverage an existing method and refactor the rest of the code to match

Since there was an existing waitFor method in ZKTestCase, along with an existing implementation of a
waitForMetric LearnerMetricsTest, this commit moves waitForMetric to ZKTestCase and refactors the
metric-related usages of waitFor.

* Communicate the Zxid that triggered a WatchEvent to fire

With the recent addition of persistent watches, many doors have opened up to significantly more
performant and intuitive local caches of remote state, but the actual implementation can be
difficult because to cache data locally, one needs to execute the following steps:

1. Set the watch
2. Bootstrap the watched subtree
3. Catch up on the events that fired during the bootstrap

The issue is it's now very difficult to deduplicate and sanely resolve the remote state during step
3 because it's unknown whether an event arrived during the bootstrap or after. For example,
imagine that between steps 1 and 2, a node /a was deleted then re-created. By the time step 3 is
executed, there will be a NodeDeleted event queued up followed by a NodeCreated, causing at best a
double read (one from the bootstrap, one from the NodeCreated) or at worst some data inconsistencies
in the local cache.

This change sets the Zxid in the response header whenever the watch event type is NodeCreated,
NodeDeleted, NodeDataChanged or NodeChildrenChanged.
rahulrane50 pushed a commit to rahulrane50/zookeeper that referenced this pull request Sep 15, 2023
PR apache#1950(ZOOKEEPER-4655) was created before apache#1859(ZOOKEEPER-4466) merged.
It changes `assertEvent`'s signature which is depended by apache#1859.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants