Implement partition store restore-from-snapshot #2353

pcholakov · 2024-11-22T18:12:58Z

With this change, Partition Processor startup now checks the snapshot repository
for a partition snapshot before creating a blank store database. If a recent
snapshot is available, we will restore that instead of replaying the log from
the beginning.

This PR builds on #2310

Closes: #2000

Testing

Created snapshot by running restatectl create-snapshot -p 0, then dropped the partition CF with rocksdb_ldb drop_column_family --db=./restate-data/.../db data-0.

Running restate-server correctly restores the most recent available snapshot:

2024-11-27T18:43:40.042527Z TRACE restate_worker::partition_processor_manager::spawn_processor_task: Looking for snapshot to bootstrap partition store
2024-11-27T18:43:40.042670Z DEBUG on_asynchronous_event: restate_worker::partition_processor_manager: Partition processor was successfully created. target_run_mode=Leader partition_id=0 event=Started
2024-11-27T18:43:40.044494Z  INFO get_latest: aws_config::profile::credentials: constructed abstract provider from config file chain=ProfileChain { base: Sso { sso_session_name: Some("restate"), sso_region: "eu-central-1", sso_start_url: "https://d-99671f0c4b.awsapps.com/start", sso_account_id: Some("663487780041"), sso_role_name: Some("EngineerAccess") }, chain: [] } partition_id=0
2024-11-27T18:43:40.048798Z DEBUG on_asynchronous_event: restate_worker::partition_processor_manager::processor_state: Instruct partition processor to run as leader. leader_epoch=e56 partition_id=0 event=NewLeaderEpoch
2024-11-27T18:43:40.948928Z  INFO get_latest: aws_config::profile::credentials: loaded base credentials creds=Credentials { provider_name: "SSO", access_key_id: "ASIAZU6XUCTE3QRJBHL4", secret_access_key: "** redacted **", expires_after: "2024-11-28T02:43:39Z" } partition_id=0
2024-11-27T18:43:41.494631Z TRACE get_latest: restate_worker::partition::snapshots::repository: Latest snapshot metadata: LatestSnapshot { version: V1, partition_id: PartitionId(0), cluster_name: "localcluster", node_name: "Pavels-MacBook-Pro.local", created_at: Timestamp(SystemTime { tv_sec: 1732731695, tv_nsec: 510557000 }), snapshot_id: snap_14QnEfyWDrRj0GBbOz0XFiV, min_applied_lsn: Lsn(1212), path: "lsn_00000000000000001212-snap_14QnEfyWDrRj0GBbOz0XFiV" } partition_id=0
2024-11-27T18:43:42.390392Z  INFO get_latest: aws_config::profile::credentials: loaded base credentials creds=Credentials { provider_name: "SSO", access_key_id: "ASIAZU6XUCTE5UO2I244", secret_access_key: "** redacted **", expires_after: "2024-11-28T02:43:41Z" } partition_id=0
2024-11-27T18:43:42.556935Z DEBUG get_latest: restate_worker::partition::snapshots::repository: Getting snapshot data snapshot_id=snap_14QnEfyWDrRj0GBbOz0XFiV path="/Users/pavel/restate/restate/restate-data/snap_14QnEfyWDrRj0GBbOz0XFiV-OClIZB" partition_id=0
2024-11-27T18:43:43.400784Z  INFO aws_config::profile::credentials: loaded base credentials creds=Credentials { provider_name: "SSO", access_key_id: "ASIAZU6XUCTEW7NOHHGS", secret_access_key: "** redacted **", expires_after: "2024-11-28T02:43:42Z" }
2024-11-27T18:43:43.402483Z  INFO aws_config::profile::credentials: loaded base credentials creds=Credentials { provider_name: "SSO", access_key_id: "ASIAZU6XUCTEYPZHMLQY", secret_access_key: "** redacted **", expires_after: "2024-11-28T02:43:42Z" }
2024-11-27T18:43:43.607012Z TRACE restate_worker::partition::snapshots::repository: Downloaded snapshot data file to "/Users/pavel/restate/restate/restate-data/snap_14QnEfyWDrRj0GBbOz0XFiV-OClIZB/000398.sst" key=test-cluster-snapshots/0/lsn_00000000000000001212-snap_14QnEfyWDrRj0GBbOz0XFiV/000398.sst size=1144
2024-11-27T18:43:43.803874Z TRACE restate_worker::partition::snapshots::repository: Downloaded snapshot data file to "/Users/pavel/restate/restate/restate-data/snap_14QnEfyWDrRj0GBbOz0XFiV-OClIZB/000405.sst" key=test-cluster-snapshots/0/lsn_00000000000000001212-snap_14QnEfyWDrRj0GBbOz0XFiV/000405.sst size=1269
2024-11-27T18:43:43.804296Z  INFO get_latest: restate_worker::partition::snapshots::repository: Downloaded partition snapshot snapshot_id=snap_14QnEfyWDrRj0GBbOz0XFiV path="/Users/pavel/restate/restate/restate-data/snap_14QnEfyWDrRj0GBbOz0XFiV-OClIZB" partition_id=0
2024-11-27T18:43:43.804403Z TRACE restate_worker::partition_processor_manager::spawn_processor_task: Restoring partition snapshot partition_id=0
2024-11-27T18:43:43.804912Z  INFO restate_partition_store::partition_store_manager: Importing partition store snapshot partition_id=0 min_lsn=1212 path="/Users/pavel/restate/restate/restate-data/snap_14QnEfyWDrRj0GBbOz0XFiV-OClIZB"
2024-11-27T18:43:43.821260Z  INFO run: restate_worker::partition: Starting the partition processor. partition_id=0
2024-11-27T18:43:43.821502Z DEBUG run: restate_worker::partition: PartitionProcessor creating log reader last_applied_lsn=1212 current_log_tail=1220 partition_id=0
2024-11-27T18:43:43.821548Z DEBUG run: restate_worker::partition: Replaying the log from lsn=1213, log tail lsn=1220 partition_id=0
2024-11-27T18:43:43.821653Z  INFO run: restate_worker::partition: PartitionProcessor starting event loop. partition_id=0

github-actions · 2024-11-22T18:31:15Z

Test Results

7 files ±0 7 suites ±0 4m 23s ⏱️ +2s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 1b817bb. ± Comparison against base commit e541d98.

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks for creating this PR @pcholakov. The changes look really nice! I had a few minor question. It would be great to add the streaming write before merging. Once this is resolved +1 for merging :-)

tillrohrmann · 2024-11-22T22:26:52Z

crates/worker/src/partition/snapshots/repository.rs

+            ));
+            let file_path = snapshot_dir.path().join(filename);
+            let file_data = self.object_store.get(&key).await?;
+            tokio::fs::write(&file_path, file_data.bytes().await?).await?;


Yes, it would indeed be great to write the file in streaming fashion to disk. Especially once our SSTs grow.

Maybe something like

let mut file_data = self.object_store.get(&key).await?.into_stream(); let mut snapshot_file = tokio::fs::File::create_new(&file_path).await?; while let Some(data) = file_data.next().await { snapshot_file.write_all(&data?).await?; }

can already be enough. Do you know how large the chunks of the stream returned by self.object_store.get(&key).await?.into_stream() will be?

tillrohrmann · 2024-11-22T22:36:54Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                let partition_store = if !partition_store_manager
+                    .has_partition_store(pp_builder.partition_id)
+                    .await


Out of scope of this PR: What is the plan how to handle a PP that has some data but the data is lagging too far behind? So starting the PP would result into a trim gap. Would we then drop the respective column family and restart it?

I haven't tracked down all the places that would need to be updated yet but I believe we can shut down the PP, drop the existing CF (or just move it out of the way), and follow the bootstrap path from there.

tillrohrmann · 2024-11-22T22:40:58Z

crates/worker/src/partition/snapshots/repository.rs

+    /// Discover and download the latest snapshot available. Dropping the returned
+    /// `LocalPartitionSnapshot` will delete the local snapshot data files.


Is it because the files are stored in a temp directory? On LocalPartitionSnapshot itself I couldn't find how the files are deleted when dropping it.

Is the temp dir also the mechanism to clean things up if downloading it failed?

It seems that TempDir::with_prefix_in takes care of it since it deletes the files when it gets dropped. This is a nice solution!

This is exactly right! I'm not 100% in love with it - it works but it's a big magical as deletion happens implicitly when LocalPartitionSnapshot is dropped, and that could be quite far removed from the find_latest call. Something that's hard to do with this approach is to retain the snapshot files if importing it fails, which could be useful for troubleshooting.

The description seems off though. If I understand the code correctly, then dropping LocalPartitionSnapshot won't delete the files. What will happen via TempDir is that if an error occurs before we call snapshot_dir.into_path(), then the snapshot_dir will be deleted. Afterwards its the responsibility of the whoever owns LocalPartitionSnapshot to clean things up.

I believe the description is accurate :-) Ownership of the TempDir moves into the LocalPartitionSnapshot instance when we construct it with base_dir: snapshot_dir.into_path(). Dropping the LocalPartitionSnapshot thus drops the TempDir.

Did you verify that this is really how things work? According to the Rust docs of TempDir::into_path it reads a bit differently:

Persist the temporary directory to disk, returning the PathBuf where it is located.
This consumes the TempDir without deleting directory on the filesystem, meaning that the directory will no longer be automatically deleted.

What into_path() returns is a PathBuf and PathBuf does not have the functionality to delete the file it points to once it gets dropped.

tillrohrmann · 2024-11-22T22:43:24Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                            "Found snapshot to bootstrap partition, restoring it",
+                        );
+                        partition_store_manager
+                            .open_partition_store_from_snapshot(


In

restate/crates/rocksdb/src/rock_access.rs

Line 156 in 531b987

fn import_cf(

, we seem to copy the snapshot files to keep them intact. What is the reason for this? Wouldn't it be more efficient to move the files because it wouldn't inflict any I/O costs if the snapshot directory is on the same filesystem as the target directory?

RocksDB will create hard-links if possible - so no real I/O happens as long as the snapshot files and the DB are on the same filesystem. I kept it this way because it was useful for my initial testing but I think it would be better to set move_files = true now. In the future, we may add a config option to retain the snapshot files on import/export, and even then maybe only if an error occurs.

Yeah, moving by default and copying if explicitly configured sounds like the right settings.

pcholakov · 2024-11-27T18:49:14Z

Next revision is up ready for review:

RocksDB files now moved on import
S3 downloads are streamed to disk
S3 downloads are parallel with bounded concurrency
Partition key range check on import

muhamadazmy

Thanks @pcholakov for creating this PR. I left few comments below. I hope they help

muhamadazmy · 2024-11-29T08:43:40Z

crates/worker/src/partition/snapshots/repository.rs

+                let _permit = concurrency_limiter.acquire().await?;
+                let mut file_data = StreamReader::new(object_store.get(&key).await?.into_stream());
+                let mut snapshot_file = tokio::fs::File::create_new(&file_path).await?;
+                let size = io::copy(&mut file_data, &mut snapshot_file).await?;


Maybe a sanity check that the downloaded file size matches what is expected from the snapshot metadata

I want to do another pass and add some kind of checksum to the metadata file, but a size sanity check is a good and cheap start! Thanks!

muhamadazmy · 2024-11-29T08:47:10Z

crates/worker/src/partition/snapshots/repository.rs

+                    downloads.abort_all();
+                    return Err(e.into());
+                }
+                Some(Ok(_)) => {}


Is it intentional to not handle errors returned by the download routine ? the _ here is actually an anyhow::Result which itself can be an error.

Check suggestion below

I think you are right and we need to handle the inner error case as well. Right now, we might accept an incomplete snapshot as complete if any of the file downloads fails with an error and not a panic.

We probably also want to include in the error message which file failed the download process.

This is a pretty big miss, thanks for catching it!

Updated, let me know how you like it! I might have overdone the error handling a bit 😅

muhamadazmy · 2024-11-29T08:49:26Z

crates/worker/src/partition/snapshots/repository.rs

+        loop {
+            match downloads.join_next().await {
+                None => {
+                    break;
+                }
+                Some(Err(e)) => {
+                    downloads.abort_all();
+                    return Err(e.into());
+                }
+                Some(Ok(_)) => {}
+            }
+        }


Suggested change

loop {

match downloads.join_next().await {

None => {

break;

}

Some(Err(e)) => {

downloads.abort_all();

return Err(e.into());

}

Some(Ok(_)) => {}

}

}

for result in downloads.join_next().await {

match result {

Err(err) => {

// join error

},

Ok(Err(err)) => {

// anyhow error

}

Ok(Ok(_)) => {

// download succeeded

},

}

}

tillrohrmann

Thanks for updating this PR @pcholakov. The changes look good to me. The one thing we need to fix is the handling of failed downloads as pointed out by Azmy. Otherwise we might consider a snapshot completely downloaded while some ssts are missing.

tillrohrmann · 2024-11-29T09:11:25Z

crates/partition-store/src/partition_store_manager.rs

+        if snapshot.key_range.start() > partition_key_range.start()
+            || snapshot.key_range.end() < partition_key_range.end()
+        {
+            warn!(
+                %partition_id,
+                snapshot_range = ?snapshot.key_range,
+                partition_range = ?partition_key_range,
+                "The snapshot key range does not fully cover the partition key range"
+            );
+            return Err(RocksError::SnapshotKeyRangeMismatch);
+        }


That's a good check :-)

tillrohrmann · 2024-11-29T09:18:57Z

crates/worker/src/partition/snapshots/repository.rs

+    /// Discover and download the latest snapshot available. Dropping the returned
+    /// `LocalPartitionSnapshot` will delete the local snapshot data files.


The description seems off though. If I understand the code correctly, then dropping LocalPartitionSnapshot won't delete the files. What will happen via TempDir is that if an error occurs before we call snapshot_dir.into_path(), then the snapshot_dir will be deleted. Afterwards its the responsibility of the whoever owns LocalPartitionSnapshot to clean things up.

tillrohrmann · 2024-11-29T09:20:59Z

crates/worker/src/partition/snapshots/repository.rs

+        };
+
+        let latest: LatestSnapshot =
+            serde_json::from_slice(latest.bytes().await?.iter().as_slice())?;


What does iter().as_slice() do? Would &latest.bytes().await? work instead?

tillrohrmann · 2024-11-29T09:22:03Z

crates/worker/src/partition/snapshots/repository.rs

+        trace!("Latest snapshot metadata: {:?}", latest);
+
+        let snapshot_metadata_path = object_store::path::Path::from(format!(
+            "{prefix}{partition_id}/{path}/metadata.json",


These paths are probably used on the write and read path. Should we share them through a function. That makes it easier to keep them in sync between the two paths.

Are you gonna unify these paths once #2310 gets merged?

I didn't get to this unfortunately - let me track this separately as #2389. I don't want to do any last minute fixes right now for fear of breaking things, and it would be great to merge this into main so it doesn't rot sitting in PR for another week 😅

tillrohrmann · 2024-11-29T09:33:34Z

crates/worker/src/partition/snapshots/repository.rs

+                info!("Latest snapshot points to a snapshot that was not found in the repository!");
+                return Ok(None); // arguably this could also be an error


I am wondering whether this does not denote a "corruption" of our snapshots and therefore might even warrant a panic? I mean we might still be lucky and don't encounter a trim gap because a) we haven't trimmed yet or b) our applied index is still after the trim point. So I guess this might have been the motivation to return None, here? This is actually more resilient than panicking in some cases. The downside is that we might be stuck in a retry loop if we are encountering a trim gap. Maybe raise the log level to warn so that this becomes more visible?

Thanks for flagging this! I haven't thought about this path extensively; let me make it WARN for now, and I'll revisit what's the best way to behave here when I implement trim gap handling. I would be inclined to use some combination of retry-with-backoff while posting a metric that we can't make progress with this partition.

tillrohrmann · 2024-11-29T09:41:03Z

crates/worker/src/partition/snapshots/repository.rs

+    pub(crate) async fn get_latest(
+        &self,
+        partition_id: PartitionId,
+    ) -> anyhow::Result<Option<LocalPartitionSnapshot>> {


Maybe something for the future: It feels as if callers might be interested in why get_latest failed in the future. I could imagine that different errors are handled differently (e.g. retried because connection to S3 failed vs. unsupported snapshot format). So anyhow::Result (while convienent) might not be the perfect fit here.

Definitely, as soon as there is any difference in how they're handled, this should become a properly typed error.

tillrohrmann · 2024-11-29T09:44:12Z

crates/worker/src/partition/snapshots/repository.rs

+                let _permit = concurrency_limiter.acquire().await?;
+                let mut file_data = StreamReader::new(object_store.get(&key).await?.into_stream());
+                let mut snapshot_file = tokio::fs::File::create_new(&file_path).await?;
+                let size = io::copy(&mut file_data, &mut snapshot_file).await?;


I like this solution for copying the file in streaming fashion :-)

tillrohrmann · 2024-11-29T09:53:48Z

crates/worker/src/partition/snapshots/repository.rs

+                    downloads.abort_all();
+                    return Err(e.into());


While I think it does not make a difference right now for correctness, I would still recommend to drain downloads after aborting all because abort_all does not guarantee that tasks have completely finished (e.g. if something calls spawn_blocking).

tillrohrmann · 2024-11-29T09:56:10Z

crates/worker/src/partition/snapshots/repository.rs

+                    downloads.abort_all();
+                    return Err(e.into());
+                }
+                Some(Ok(_)) => {}


I think you are right and we need to handle the inner error case as well. Right now, we might accept an incomplete snapshot as complete if any of the file downloads fails with an error and not a panic.

tillrohrmann · 2024-11-29T09:59:13Z

crates/worker/src/partition/snapshots/repository.rs

+                    downloads.abort_all();
+                    return Err(e.into());
+                }
+                Some(Ok(_)) => {}


We probably also want to include in the error message which file failed the download process.

pcholakov · 2024-12-01T12:14:25Z

Hey folks! I haven't addressed the snapshot key factoring suggestion (which I strongly agree with but ran out of time!) but everything else should be covered. PTAL when you have a chance 😊 @tillrohrmann @muhamadazmy

tillrohrmann

Thanks for creating this PR @pcholakov. The changes look good to me :-) I left a few minor comments which you could resolve before merging.

tillrohrmann · 2024-12-05T11:12:47Z

crates/rocksdb/src/rock_access.rs

@@ -163,7 +165,7 @@ impl RocksAccess for rocksdb::DB {
        let options = prepare_cf_options(&cf_patterns, default_cf_options, &name)?;

        let mut import_opts = ImportColumnFamilyOptions::default();
-        import_opts.set_move_files(false); // keep the snapshot files intact
+        import_opts.set_move_files(true);


tillrohrmann · 2024-12-05T11:16:40Z

crates/worker/src/partition/snapshots/repository.rs

+    /// Discover and download the latest snapshot available. Dropping the returned
+    /// `LocalPartitionSnapshot` will delete the local snapshot data files.


Did you verify that this is really how things work? According to the Rust docs of TempDir::into_path it reads a bit differently:

Persist the temporary directory to disk, returning the PathBuf where it is located.
This consumes the TempDir without deleting directory on the filesystem, meaning that the directory will no longer be automatically deleted.

What into_path() returns is a PathBuf and PathBuf does not have the functionality to delete the file it points to once it gets dropped.

tillrohrmann · 2024-12-05T11:18:34Z

crates/worker/src/partition/snapshots/repository.rs

+        trace!("Latest snapshot metadata: {:?}", latest);
+
+        let snapshot_metadata_path = object_store::path::Path::from(format!(
+            "{prefix}{partition_id}/{path}/metadata.json",


Are you gonna unify these paths once #2310 gets merged?

tillrohrmann · 2024-12-05T11:28:35Z

crates/worker/src/partition/snapshots/repository.rs

+            panic!("Snapshot does not match the cluster name of latest snapshot at destination in snapshot id {}! Expected: cluster name=\"{}\", found: \"{}\"",
+                   snapshot_metadata.snapshot_id,
+                   self.cluster_name,
+                   snapshot_metadata.cluster_name);


This might become a problem in the future once we want to start a new cluster from a snapshot of another cluster.

Yeah, definitely! But then we should really be explicit about which cluster we are importing from, once we add that path.

tillrohrmann · 2024-12-05T11:32:02Z

crates/worker/src/partition/snapshots/repository.rs

+                let size = io::copy(&mut file_data, &mut snapshot_file)
+                    .await
+                    .map_err(|e| anyhow!("Failed to download snapshot file {:?}: {}", key, e))?;
+                if size != expected_size as u64 {


nit: I don't this that this is problem here but a good practice is imo to use u64::try_from(expected_size).expect("usize to fit into u64"). That way we do observe overflows.

tillrohrmann · 2024-12-05T11:35:11Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                        let snapshot = if snapshot_repository.is_none() {
+                            debug!(
+                                partition_id = %partition_id,
+                                "No snapshot repository configured",
+                            );
+                            None
+                        } else {
+                            debug!(
+                                partition_id = %partition_id,
+                                "Looking for partition snapshot from which to bootstrap partition store",
+                            );
+                            snapshot_repository.expect("is some").get_latest(partition_id).await?
+                        };


nit: You could use if let Some(snapshot_repository) = snapshot_respository { } else {} instead of is_none and then expecting.

I tried to get it to work this way initially but the code ended up being deeply nested and looked less readable to my eyes. I'll give this another pass when I come back to this.

With this change, Partition Processor startup now checks the snapshot repository for a partition snapshot before creating a blank store database. If a recent snapshot is available, we will restore that instead of replaying the log from the beginning.

pcholakov

Thanks, Till! Punting on a couple of the comments so I can get this into main - rather not let it sit in PR another week. I really appreciate your feedback throughout this cycle!

pcholakov · 2024-12-06T09:13:10Z

crates/worker/src/partition/snapshots/repository.rs

+        trace!("Latest snapshot metadata: {:?}", latest);
+
+        let snapshot_metadata_path = object_store::path::Path::from(format!(
+            "{prefix}{partition_id}/{path}/metadata.json",


I didn't get to this unfortunately - let me track this separately as #2389. I don't want to do any last minute fixes right now for fear of breaking things, and it would be great to merge this into main so it doesn't rot sitting in PR for another week 😅

pcholakov · 2024-12-06T09:16:27Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                        let snapshot = if snapshot_repository.is_none() {
+                            debug!(
+                                partition_id = %partition_id,
+                                "No snapshot repository configured",
+                            );
+                            None
+                        } else {
+                            debug!(
+                                partition_id = %partition_id,
+                                "Looking for partition snapshot from which to bootstrap partition store",
+                            );
+                            snapshot_repository.expect("is some").get_latest(partition_id).await?
+                        };


I tried to get it to work this way initially but the code ended up being deeply nested and looked less readable to my eyes. I'll give this another pass when I come back to this.

pcholakov requested review from tillrohrmann and muhamadazmy November 22, 2024 18:12

tillrohrmann approved these changes Nov 22, 2024

View reviewed changes

pcholakov force-pushed the feat/snapshot-upload branch 3 times, most recently from defc6ee to 7291ede Compare November 26, 2024 21:35

pcholakov marked this pull request as draft November 27, 2024 12:29

pcholakov force-pushed the feat/snapshot-upload branch from 7291ede to ff6d9ce Compare November 27, 2024 12:43

pcholakov changed the title ~~Introduce SnapshotRepository find_latest and wire up partition restore~~ Add SnapshotRepository::find_latest and wire up partition restore Nov 27, 2024

pcholakov changed the title ~~Add SnapshotRepository::find_latest and wire up partition restore~~ Implement partition store restore-from-snapshot Nov 27, 2024

pcholakov force-pushed the feat/snapshot-bootstrap branch from 40e74fd to aceb786 Compare November 27, 2024 18:45

pcholakov marked this pull request as ready for review November 27, 2024 18:47

pcholakov requested a review from tillrohrmann November 27, 2024 18:49

muhamadazmy reviewed Nov 29, 2024

View reviewed changes

tillrohrmann requested changes Nov 29, 2024

View reviewed changes

pcholakov force-pushed the feat/snapshot-upload branch from ff6d9ce to 7a0242b Compare November 30, 2024 22:46

pcholakov force-pushed the feat/snapshot-bootstrap branch 2 times, most recently from 05cf7ea to 1b817bb Compare December 1, 2024 12:11

pcholakov requested review from tillrohrmann and muhamadazmy December 1, 2024 12:11

tillrohrmann approved these changes Dec 5, 2024

View reviewed changes

pcholakov force-pushed the feat/snapshot-upload branch from 3a87e07 to aee1a12 Compare December 5, 2024 17:06

Base automatically changed from feat/snapshot-upload to main December 6, 2024 08:49

pcholakov added 4 commits December 6, 2024 10:08

Validate object store uploads

693de9a

Self-review updates

e67214c

Move files on import

e23b784

pcholakov added 8 commits December 6, 2024 10:09

Stream snapshot files to disk, reusing read buffer

37dc778

Download snapshot data files concurrently

87f8745

Validate snapshot key range on import

8812b62

Fixup lint

1777ce2

Improve download file error handling

c3b1ce2

Further PR feedback

7a21108

Assert cluster_name on import

c56d2b1

Document & correctly handle local snapshot directory deletion

69b4a9a

pcholakov force-pushed the feat/snapshot-bootstrap branch from 1b817bb to 69b4a9a Compare December 6, 2024 09:09

pcholakov commented Dec 6, 2024

View reviewed changes

pcholakov merged commit 8c5ae2e into main Dec 6, 2024
3 checks passed

pcholakov deleted the feat/snapshot-bootstrap branch December 6, 2024 09:19

		/// Discover and download the latest snapshot available. Dropping the returned
		/// `LocalPartitionSnapshot` will delete the local snapshot data files.

-        loop {
-            match downloads.join_next().await {
-                None => {
-                    break;
-                }
-                Some(Err(e)) => {
-                    downloads.abort_all();
-                    return Err(e.into());
-                }
-                Some(Ok(_)) => {}
-            }
-        }
+       for result in downloads.join_next().await {
+            match result {
+                Err(err) => {
+                    // join error
+                },
+                Ok(Err(err)) => {
+                    // anyhow error
+                }
+                Ok(Ok(_)) => {
+                    // download succeeded
+                },
+            }
+        }

		info!("Latest snapshot points to a snapshot that was not found in the repository!");
		return Ok(None); // arguably this could also be an error

Implement partition store restore-from-snapshot #2353

Implement partition store restore-from-snapshot #2353

Conversation

pcholakov commented Nov 22, 2024 • edited Loading

Testing

github-actions bot commented Nov 22, 2024 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Nov 27, 2024

muhamadazmy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Dec 1, 2024

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Nov 22, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024 •

edited

Loading