Add a Partition Store snapshot restore policy #1999

pcholakov · 2024-09-27T15:57:37Z

Stacked PRs:

Add a Partition Store snapshot restore policy

stack-info: PR: #1999, branch: pcholakov/stack/2

github-actions · 2024-09-27T16:14:41Z

Test Results

5 files ±0 5 suites ±0 2m 32s ⏱️ -20s
45 tests ±0 44 ✅ - 1 1 💤 +1 0 ❌ ±0
113 runs - 1 111 ✅ - 3 2 💤 +2 0 ❌ ±0

Results for commit 5c6fc5f. ± Comparison against base commit d2ca091.

This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.

dev.restate.sdktesting.tests.RunRetry ‑ withExhaustedAttempts(Client)
dev.restate.sdktesting.tests.RunRetry ‑ withSuccess(Client)

dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov · 2024-09-27T16:41:33Z

Testing
Start a restate-server with config:

With this change we introduce the ability to enable restoring a snapshot when the partition store is empty. We can test this by dropping the partition column family and re-starting restate-server with restore enabled.

Create snapshot:

> restatectl snapshots
Partition snapshots

Usage: restatectl snapshots [OPTIONS] <COMMAND>

Commands:
  create-snapshot  Create [aliases: create]
  help             Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...                               Increase logging verbosity
  -q, --quiet...                                 Decrease logging verbosity
      --table-style <TABLE_STYLE>                Which table output style to use [default: compact] [possible values: compact, borders]
      --time-format <TIME_FORMAT>                [default: human] [possible values: human, iso8601, rfc2822]
  -y, --yes                                      Auto answer "yes" to confirmation prompts
      --connect-timeout <CONNECT_TIMEOUT>        Connection timeout for network calls, in milliseconds [default: 5000]
      --request-timeout <REQUEST_TIMEOUT>        Overall request timeout for network calls, in milliseconds [default: 13000]
      --cluster-controller <CLUSTER_CONTROLLER>  Cluster Controller host:port (e.g. http://localhost:5122/) [default: http://localhost:5122/]
  -h, --help                                     Print help (see more with '--help')
> restatectl snapshots create -p 1
Snapshot created: snap_12PclG04SN8eVSKYXCFgXx7

Server writes snapshot on-demand:

2024-09-26T07:31:49.261080Z INFO restate_admin::cluster_controller::service
  Create snapshot command received
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261133Z INFO restate_admin::cluster_controller::service
  Asking node to snapshot partition
    node_id: GenerationalNodeId(PlainNodeId(0), 3)
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261330Z INFO restate_worker::partition_processor_manager
  Received 'CreateSnapshotRequest { partition_id: PartitionId(1) }' from N0:3
on rs:worker-9
  in restate_core::network::connection_manager::network-reactor
    peer_node_id: N0:3
    protocol_version: 1
    task_id: 32
2024-09-26T07:31:49.264763Z INFO restate_worker::partition::snapshot_producer
  Partition snapshot written
    lsn: 3
    metadata: "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7/metadata.json"
on rt:pp-1

Sample metadata file: snap_12PclG04SN8eVSKYXCFgXx7/metadata.json

{
  "version": "V1",
  "cluster_name": "snap-test",
  "partition_id": 1,
  "node_name": "n1",
  "created_at": "2024-09-26T07:31:49.264522000Z",
  "snapshot_id": "snap_12PclG04SN8eVSKYXCFgXx7",
  "key_range": {
    "start": 9223372036854775808,
    "end": 18446744073709551615
  },
  "min_applied_lsn": 3,
  "db_comparator_name": "leveldb.BytewiseComparator",
  "files": [
    {
      "column_family_name": "",
      "name": "/000030.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1267,
      "level": 0,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 11,
      "largest_seqno": 12,
      "num_entries": 0,
      "num_deletions": 0
    },
    {
      "column_family_name": "",
      "name": "/000029.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1142,
      "level": 6,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 0,
      "largest_seqno": 0,
      "num_entries": 0,
      "num_deletions": 0
    }
  ]
}
> restatectl snapshots create -p 0

Optionally, we can also trim the log to prevent replay from Bifrost.

> restatectl logs trim -l 0 -t 1000

With Restate stopped, we drop the partition store:

> rocksdb_ldb drop_column_family --db=../test/n1/db data-0

Using this config:

[worker]
snapshot-restore-policy = "on-init"

When Restate server comes up, we can see that it successfully restores from the latest snapshot:

2024-09-27T15:39:27.704350Z INFO restate_partition_store::partition_store_manager
  Restoring partition from snapshot
    partition_id: PartitionId(0)
    snapshot_id: snap_16mzxFw4Ve8MPbfVRKOwBON
    lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.704415Z INFO restate_partition_store::partition_store_manager
  Initializing partition store from snapshot
    partition_id: PartitionId(0)
    min_applied_lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.717951Z INFO restate_worker::partition
  PartitionProcessor starting up.
on rt:pp-0
  in restate_worker::partition::run
    partition_id: 0

stack-info: PR: #1999, branch: pcholakov/stack/2

AhmedSoliman · 2024-10-01T07:48:34Z

I'm not sure how this ties into the bigger picture for partition store recovery, so maybe we should hide the configuration option until we have an end-to-end design specced out. The primary unanswered question is about who makes the decision and where does the knowledge about the snapshot come from. One option is cluster controller passing this information down through the attachment plan, or whether it's self-decided like how you are proposing here.

I can see one fallback strategy which follows your proposal, i.e. if we don't have a local partition store, and we didn't get information about a snapshot to restore, then try and fetch one. But I guess we'll need to check the trim point of the log to figure if the snapshot we have is good enough or not before we commit to being a follower or leader.

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov · 2024-10-04T08:46:51Z

In chatting with @tillrohrmann this morning, we figured that it's probably better to park this PR for now until we have a better idea about how the bootstrap process will fit in with the cluster control plane overall. This was useful to demo that restoring partition stores works, but it's likely not the long-term experience we want.

stack-info: PR: #1999, branch: pcholakov/stack/2

stack-info: PR: #1998, branch: pcholakov/stack/1

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov · 2024-10-15T15:39:29Z

Closing this for now, will reopen with a clearer picture of how the CP will manage these once we get back to worker bootstrap.

pcholakov added a commit that referenced this pull request Sep 27, 2024

Add a Partition Store snapshot restore policy

3c9b7b9

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 5da9e65 to 3c9b7b9 Compare September 27, 2024 15:57

pcholakov force-pushed the pcholakov/stack/1 branch from 6b5ac97 to 7da0dc9 Compare September 27, 2024 15:57

This was referenced Sep 27, 2024

Expose a partition worker CreateSnapshot RPC #1998

Merged

[Snapshots] Introduce a Snapshot Producer component #1981

Merged

pcholakov changed the title ~~Add a Partition Store snapshot restore policy~~ [Snapshots] Add a Partition Store snapshot restore policy Sep 27, 2024

pcholakov mentioned this pull request Sep 27, 2024

[Snapshots] 4: Add support for restoring a Partition Store from snapshot #1992

Closed

pcholakov changed the base branch from pcholakov/stack/1 to main September 27, 2024 16:36

pcholakov added a commit that referenced this pull request Sep 27, 2024

Add a Partition Store snapshot restore policy

6c68207

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 3c9b7b9 to 6c68207 Compare September 27, 2024 16:37

pcholakov changed the title ~~[Snapshots] Add a Partition Store snapshot restore policy~~ Add a Partition Store snapshot restore policy Sep 27, 2024

pcholakov changed the base branch from main to pcholakov/stack/1 September 27, 2024 16:37

pcholakov requested review from tillrohrmann and muhamadazmy September 27, 2024 16:42

pcholakov changed the base branch from pcholakov/stack/1 to main September 30, 2024 17:07

pcholakov added a commit that referenced this pull request Sep 30, 2024

Add a Partition Store snapshot restore policy

7ea7be6

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 6c68207 to 7ea7be6 Compare September 30, 2024 17:07

pcholakov changed the base branch from main to pcholakov/stack/1 September 30, 2024 17:07

pcholakov changed the base branch from pcholakov/stack/1 to main October 4, 2024 08:00

pcholakov added a commit that referenced this pull request Oct 4, 2024

Add a Partition Store snapshot restore policy

7d32af4

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 7ea7be6 to 7d32af4 Compare October 4, 2024 08:00

pcholakov changed the base branch from main to pcholakov/stack/1 October 4, 2024 08:00

pcholakov changed the base branch from pcholakov/stack/1 to main October 4, 2024 08:02

pcholakov added a commit that referenced this pull request Oct 4, 2024

Add a Partition Store snapshot restore policy

6ab3aa6

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 7d32af4 to 6ab3aa6 Compare October 4, 2024 08:02

pcholakov changed the base branch from main to pcholakov/stack/1 October 4, 2024 08:02

pcholakov changed the base branch from pcholakov/stack/1 to main October 4, 2024 08:07

pcholakov added a commit that referenced this pull request Oct 4, 2024

Add a Partition Store snapshot restore policy

66c9f49

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 6ab3aa6 to 66c9f49 Compare October 4, 2024 08:07

pcholakov changed the base branch from main to pcholakov/stack/1 October 4, 2024 08:07

pcholakov marked this pull request as draft October 4, 2024 08:46

pcholakov changed the base branch from pcholakov/stack/1 to main October 11, 2024 09:11

pcholakov added a commit that referenced this pull request Oct 11, 2024

Add a Partition Store snapshot restore policy

5e9785e

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov force-pushed the pcholakov/stack/2 branch from 66c9f49 to 5e9785e Compare October 11, 2024 09:11

pcholakov changed the base branch from main to pcholakov/stack/1 October 11, 2024 09:12

pcholakov added 2 commits October 11, 2024 11:16

Expose a partition worker CreateSnapshot RPC

d349d16

stack-info: PR: #1998, branch: pcholakov/stack/1

Add a Partition Store snapshot restore policy

5c6fc5f

stack-info: PR: #1999, branch: pcholakov/stack/2

pcholakov changed the base branch from pcholakov/stack/1 to main October 11, 2024 09:17

pcholakov force-pushed the pcholakov/stack/2 branch from 5e9785e to 5c6fc5f Compare October 11, 2024 09:17

pcholakov changed the base branch from main to pcholakov/stack/1 October 11, 2024 09:17

pcholakov changed the base branch from pcholakov/stack/1 to main October 11, 2024 09:18

pcholakov changed the base branch from main to pcholakov/stack/1 October 11, 2024 09:18

Base automatically changed from pcholakov/stack/1 to main October 11, 2024 09:41

muhamadazmy removed their request for review October 15, 2024 06:58

pcholakov closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Partition Store snapshot restore policy #1999

Add a Partition Store snapshot restore policy #1999

pcholakov commented Sep 27, 2024 •

edited

Loading

github-actions bot commented Sep 27, 2024 •

edited

Loading

pcholakov commented Sep 27, 2024

AhmedSoliman commented Oct 1, 2024

pcholakov commented Oct 4, 2024

pcholakov commented Oct 15, 2024

Add a Partition Store snapshot restore policy #1999

Add a Partition Store snapshot restore policy #1999

Conversation

pcholakov commented Sep 27, 2024 • edited Loading