Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal] Add a benchmark for Snapshot capture #12569

Merged
merged 3 commits into from
Aug 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test-cron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ jobs:

./cargo test --all --tests -- --nocapture

./cargo build --benches
./cargo check --benches

'
strategy:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ jobs:

./cargo test --all --tests -- --nocapture

./cargo build --benches
./cargo check --benches

'
strategy:
Expand Down
2 changes: 1 addition & 1 deletion build-support/bin/generate_github_workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@ def test_workflow_jobs(python_versions: list[str], *, cron: bool) -> Jobs:
sudo apt-get install -y pkg-config fuse libfuse-dev
./build-support/bin/check_rust_pre_commit.sh
./cargo test --all --tests -- --nocapture
./cargo build --benches
./cargo check --benches
"""
),
"if": DONT_SKIP_RUST,
Expand Down
166 changes: 121 additions & 45 deletions src/rust/engine/fs/store/benches/store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,22 @@
use criterion::{criterion_group, criterion_main, Criterion};

use std::collections::HashSet;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::io::{BufRead, BufReader, BufWriter, Write};
use std::os::unix::ffi::OsStrExt;
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Duration;

use bazel_protos::gen::build::bazel::remote::execution::v2 as remexec;
use bytes::Bytes;
use fs::{GlobExpansionConjunction, PreparedPathGlobs, RelativePath, StrictGlobMatching};
use futures::future;
use fs::{
File, GitignoreStyleExcludes, GlobExpansionConjunction, PathStat, PosixFS, PreparedPathGlobs,
StrictGlobMatching,
};
use hashing::{Digest, EMPTY_DIGEST};
use task_executor::Executor;
use tempfile::TempDir;

use store::{SnapshotOps, Store, SubsetParams};
use store::{OneOffStoreFileByDigest, Snapshot, SnapshotOps, Store, SubsetParams};

pub fn criterion_benchmark_materialize(c: &mut Criterion) {
// Create an executor, store containing the stuff to materialize, and a digest for the stuff.
Expand All @@ -60,7 +61,7 @@ pub fn criterion_benchmark_materialize(c: &mut Criterion) {
let parent_dest_path = parent_dest.path();
cgroup
.sample_size(10)
.measurement_time(Duration::from_secs(60))
.measurement_time(Duration::from_secs(30))
.bench_function(format!("materialize_directory({}, {})", count, size), |b| {
b.iter(|| {
// NB: We forget this child tempdir to avoid deleting things during the run.
Expand All @@ -75,6 +76,54 @@ pub fn criterion_benchmark_materialize(c: &mut Criterion) {
}
}

///
/// NB: More accurately, this benchmarks `Snapshot::digest_from_path_stats`, which avoids
/// filesystem traversal overheads and focuses on digesting/capturing.
///
pub fn criterion_benchmark_snapshot_capture(c: &mut Criterion) {
let executor = Executor::global(num_cpus::get(), num_cpus::get() * 4).unwrap();

let mut cgroup = c.benchmark_group("snapshot_capture");

// The number of files, file size, whether the inputs should be assumed to be immutable, and the
// number of times to capture (only the first capture actually stores anything: the rest should
// ignore the duplicated data.)
for params in vec![
(100, 100, false, 100),
(20, 10_000_000, true, 10),
(1, 200_000_000, true, 10),
] {
let (count, size, _immutable, captures) = params;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _immutable flag becomes relevant in #12563.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the point of captures > 1? If we expect them to be no-ops, why are we benchmarking them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime we Snapshot the filesystem, if we discover that we've already captured a file in the LMDB store, we'll skip re-storing it. But that still involves hashing it to determine that we don't need to store it again. So it's a noop in terms of storage, but not in terms of hashing: LMDB::put will skip storing the value, and in #12563, LMDB::reserve will fail with an "exists" error and prevent the second pass over the data.

The captures ratio here is supposed to align with usecases (I'll add a comment in #12563):

  • (100, 100, false, 100) is supposed to represent Snapshotting the workspace: lots of small hand written files that change relatively infrequently (100 captures is 1 write and 99 noops)
  • (20, 10_000_000, true, 10) and (1, 200_000_000, true, 10) both represent capturing out of local process execution sandboxes: immutable data that is more unique, so more writes and fewer noops.

let storedir = TempDir::new().unwrap();
let store = Store::local_only(executor.clone(), storedir.path()).unwrap();
let (tempdir, path_stats) = tempdir_containing(count, size);
let posix_fs = Arc::new(
PosixFS::new(
tempdir.path(),
GitignoreStyleExcludes::empty(),
executor.clone(),
)
.unwrap(),
);
cgroup
.sample_size(10)
.measurement_time(Duration::from_secs(30))
.bench_function(format!("snapshot_capture({:?})", params), |b| {
b.iter(|| {
for _ in 0..captures {
let _ = executor
.block_on(Snapshot::digest_from_path_stats(
store.clone(),
OneOffStoreFileByDigest::new(store.clone(), posix_fs.clone()),
path_stats.clone(),
))
.unwrap();
}
})
});
}
}

pub fn criterion_benchmark_subset_wildcard(c: &mut Criterion) {
let executor = Executor::global(num_cpus::get(), num_cpus::get() * 4).unwrap();
// NB: We use a much larger snapshot size compared to the materialize benchmark!
Expand Down Expand Up @@ -178,22 +227,19 @@ pub fn criterion_benchmark_merge(c: &mut Criterion) {
criterion_group!(
benches,
criterion_benchmark_materialize,
criterion_benchmark_snapshot_capture,
criterion_benchmark_subset_wildcard,
criterion_benchmark_merge
);
criterion_main!(benches);

///
/// Returns a Store (and the TempDir it is stored in) and a Digest for a nested directory
/// containing the given number of files, each with roughly the given size.
/// Creates and returns a TempDir containing the given number of files, each approximately of size
/// file_target_size.
///
pub fn snapshot(
executor: &Executor,
max_files: usize,
file_target_size: usize,
) -> (Store, TempDir, Digest) {
fn tempdir_containing(max_files: usize, file_target_size: usize) -> (TempDir, Vec<PathStat>) {
let henries_lines = {
let f = File::open(
let f = std::fs::File::open(
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("testdata")
.join("all_the_henries"),
Expand All @@ -202,7 +248,8 @@ pub fn snapshot(
BufReader::new(f).lines()
};

let henries_paths = henries_lines
let mut produced = HashSet::new();
let paths = henries_lines
.filter_map(|line| {
// Clean up to lowercase ascii.
let clean_line = line
Expand All @@ -222,49 +269,78 @@ pub fn snapshot(

// NB: Split the line by whitespace, then accumulate a PathBuf using each word as a path
// component!
let path_buf = clean_line.split_whitespace().collect::<PathBuf>();
let mut path_buf = clean_line.split_whitespace().collect::<PathBuf>();
// Drop empty or too-long candidates.
let components_too_long = path_buf.components().any(|c| c.as_os_str().len() > 255);
if components_too_long || path_buf.as_os_str().is_empty() || path_buf.as_os_str().len() > 512
{
None
} else {
Some(path_buf)
// We add an extension to files to avoid collisions with directories (which are created
// implicitly based on leading components).
path_buf.set_extension("txt");
Some(PathStat::file(
path_buf.clone(),
File {
path: path_buf,
is_executable: false,
},
))
}
})
.take(max_files);
.filter(move |path| produced.insert(path.clone()))
.take(max_files)
.collect::<Vec<_>>();

let tempdir = TempDir::new().unwrap();
for path in &paths {
// We use the (repeated) path as the content as well.
let abs_path = tempdir.path().join(path.path());
if let Some(parent) = abs_path.parent() {
fs::safe_create_dir_all(parent).unwrap();
}
let mut f = BufWriter::new(std::fs::File::create(abs_path).unwrap());
let bytes = path.path().as_os_str().as_bytes();
let lines_to_write = file_target_size / bytes.len();
for _ in 0..lines_to_write {
f.write_all(bytes).unwrap();
f.write_all(b"\n").unwrap();
}
}
(tempdir, paths)
}

///
/// Returns a Store (and the TempDir it is stored in) and a Digest for a nested directory
/// containing the given number of files, each with roughly the given size.
///
fn snapshot(
executor: &Executor,
max_files: usize,
file_target_size: usize,
) -> (Store, TempDir, Digest) {
// NB: We create the files in a tempdir rather than in memory in order to allow for more
// realistic benchmarking involving large files. The tempdir is dropped at the end of this method
// (after everything has been captured out of it).
let (tempdir, path_stats) = tempdir_containing(max_files, file_target_size);
let storedir = TempDir::new().unwrap();
let store = Store::local_only(executor.clone(), storedir.path()).unwrap();

let store2 = store.clone();
let digests = henries_paths
.map(|mut path| {
let store = store2.clone();
async move {
// We use the (repeated) path as the content as well.
let content: Bytes = {
let base = Bytes::copy_from_slice(path.as_os_str().as_bytes());
base.repeat(file_target_size / base.len()).into()
};
// We add an extension to files to avoid collisions with directories (which are created
// implicitly based on leading components).
path.set_extension("txt");
let relpath = RelativePath::new(path)?;
let digest = store.store_file_bytes(content, true).await?;
let snapshot = store.snapshot_of_one_file(relpath, digest, false).await?;
let res: Result<_, String> = Ok(snapshot.digest);
res
}
})
.collect::<Vec<_>>();

let digest = executor
.block_on({
async move {
let digests = future::try_join_all(digests).await?;
store2.merge(digests).await
}
.block_on(async move {
let posix_fs = PosixFS::new(
tempdir.path(),
GitignoreStyleExcludes::empty(),
executor.clone(),
)
.unwrap();
Snapshot::digest_from_path_stats(
store2.clone(),
OneOffStoreFileByDigest::new(store2, Arc::new(posix_fs)),
path_stats,
)
.await
})
.unwrap();

Expand Down