-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a partition worker CreateSnapshot RPC #1998
Conversation
6b5ac97
to
7da0dc9
Compare
7da0dc9
to
4fb62ee
Compare
Testing notesWith the new admin RPC exposed, we can now request a snapshot of the worker's partition store on demand:
Server writes snapshot to db-snapshots relative to base dir:
Snapshot metadata will contain output similar to the below:
This can later be used to restore the column family to the same state. |
type MessageType = CreateSnapshotRequest; | ||
|
||
async fn on_message(&self, msg: Incoming<Self::MessageType>) { | ||
info!("Received '{:?}' from {}", msg.body(), msg.peer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional? if so, does this need to be at INFO level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was, but happy to turn it down or eliminate altogether. Seemed like something that happens infrequently enough that it wouldn't hurt to log - but perhaps debug is much more appropriate here.
4fb62ee
to
1f4ce9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @pcholakov. The change look good to me. I think if we remove the general purpose rpc timeout from the scope of this PR, it is good to get merged.
node_id: GenerationalNodeId, | ||
partition_id: PartitionId, | ||
) -> anyhow::Result<SnapshotId> { | ||
let snapshot_timeout = self.networking_options.rpc_call_timeout.as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether the generic rpc call timeout is a good fit for the snapshot timeout. If the snapshot is large, then uploading it to S3 will probably take a bit of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I had the idea that the scope of the PRC is purely to produce the snapshot and write it to the filesystem, with a completely separate feedback mechanism for communicating the "offloaded" snapshot LSN back to (possibly) a metadata store entry. But I haven't given this a lot of thought and we might well prefer for the upload to happen as part of CreateSnapshot
. In any event, I have made the change to eliminate the generic timeout already - will push it shortly.
d621a61
to
76ddc3b
Compare
PARTITION_CREATE_SNAPSHOT_REQUEST = 42; | ||
PARTITION_CREATE_SNAPSHOT_RESPONSE = 43; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AhmedSoliman I see you left some gaps elsewhere - would you like me to start these in a new range, e.g. 50+? I think we'll have more RPCs for managing worker nodes that could be grouped together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea.
76ddc3b
to
8df89b3
Compare
8df89b3
to
2bbad1d
Compare
stack-info: PR: #1998, branch: pcholakov/stack/1
2bbad1d
to
d349d16
Compare
Stacked PRs:
Expose a partition worker CreateSnapshot RPC