-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Partition Store snapshot restore policy #1999
Conversation
stack-info: PR: #1999, branch: pcholakov/stack/2
5da9e65
to
3c9b7b9
Compare
6b5ac97
to
7da0dc9
Compare
Test Results 5 files ±0 5 suites ±0 2m 32s ⏱️ -20s Results for commit 5c6fc5f. ± Comparison against base commit d2ca091. This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
stack-info: PR: #1999, branch: pcholakov/stack/2
3c9b7b9
to
6c68207
Compare
Testing With this change we introduce the ability to enable restoring a snapshot when the partition store is empty. We can test this by dropping the partition column family and re-starting restate-server with restore enabled. Create snapshot:
Server writes snapshot on-demand:
Sample metadata file:
Optionally, we can also trim the log to prevent replay from Bifrost.
With Restate stopped, we drop the partition store:
Using this config:
When Restate server comes up, we can see that it successfully restores from the latest snapshot:
|
stack-info: PR: #1999, branch: pcholakov/stack/2
6c68207
to
7ea7be6
Compare
I'm not sure how this ties into the bigger picture for partition store recovery, so maybe we should hide the configuration option until we have an end-to-end design specced out. The primary unanswered question is about who makes the decision and where does the knowledge about the snapshot come from. One option is cluster controller passing this information down through the attachment plan, or whether it's self-decided like how you are proposing here. I can see one fallback strategy which follows your proposal, i.e. if we don't have a local partition store, and we didn't get information about a snapshot to restore, then try and fetch one. But I guess we'll need to check the trim point of the log to figure if the snapshot we have is good enough or not before we commit to being a follower or leader. |
stack-info: PR: #1999, branch: pcholakov/stack/2
7ea7be6
to
7d32af4
Compare
stack-info: PR: #1999, branch: pcholakov/stack/2
7d32af4
to
6ab3aa6
Compare
stack-info: PR: #1999, branch: pcholakov/stack/2
6ab3aa6
to
66c9f49
Compare
In chatting with @tillrohrmann this morning, we figured that it's probably better to park this PR for now until we have a better idea about how the bootstrap process will fit in with the cluster control plane overall. This was useful to demo that restoring partition stores works, but it's likely not the long-term experience we want. |
stack-info: PR: #1999, branch: pcholakov/stack/2
66c9f49
to
5e9785e
Compare
5e9785e
to
5c6fc5f
Compare
Closing this for now, will reopen with a clearer picture of how the CP will manage these once we get back to worker bootstrap. |
Stacked PRs:
Add a Partition Store snapshot restore policy