Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -o option to all e2e benches #5658

Merged
merged 7 commits into from
Mar 22, 2023
Merged

Conversation

jaylmiller
Copy link
Contributor

@jaylmiller jaylmiller commented Mar 20, 2023

Which issue does this PR close?

Part of #5561

Rationale for this change

For e2e benchmarks, the TCPH bin has an option to output a machine readable file, which can then be consumed the script from PR #5655 . It would be nice to be able to re-use this script for all bins in the e2e benches.

What changes are included in this PR?

This PR pulls out the existing logic from tpch.rs that (optionally) writes the run data to a machine readable json file. That logic is then used in all the other benchmarks, adding a -o option to every bin in the e2e benchmarks dir.

Are these changes tested?

Are there any user-facing changes?

let elapsed = start.elapsed().as_millis();

let elapsed = start.elapsed().as_secs_f64() * 1000.0;
let numrows = batches.iter().map(|b| b.num_rows()).sum::<usize>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaylmiller for looking into this.

Noticed for other testcases you calc numrows before elapsed, perhaps to prevent numrows runtime to be part of benchmark runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Good catch thank youi... was a mistake by me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I'm thinking is can it be calculating num rows triggers some system cache and benchmark will run faster, alhough its unexpected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think num_rows is pretty fast (it doesn't actually do any work , it just returns a field's value): https://docs.rs/arrow-array/35.0.0/src/arrow_array/record_batch.rs.html#278

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a definite improvement to me 🚀 -- thank you @jaylmiller

I had some suggestions about improving code ergonomics but I don't think they are required to merge this PR if you would prefer not to do them.

disjunction([
("Selective-ish filter", col("request_method").eq(lit("GET"))),
(
"Non-selective filter",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice to add the details into the output file

benchmarks/src/bin/h2o.rs Outdated Show resolved Hide resolved
benchmarks/src/lib.rs Outdated Show resolved Hide resolved
/// A single iteration of a benchmark query
#[derive(Debug, Serialize)]
struct QueryIter {
elapsed: f64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some documentation about what unit this is in (I think it is milliseconds?)

Relatedly I wonder if we could make this API easier to use by storing a Duration https://doc.rust-lang.org/std/time/struct.Duration.html, calculated with SystemTime::now() - start

Copy link
Contributor Author

@jaylmiller jaylmiller Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed elapsed to be a Duration object and am using a custom serializer to make it appear as unix secs in the output json

let elapsed = start.elapsed().as_millis();

let elapsed = start.elapsed().as_secs_f64() * 1000.0;
let numrows = batches.iter().map(|b| b.num_rows()).sum::<usize>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think num_rows is pretty fast (it doesn't actually do any work , it just returns a field's value): https://docs.rs/arrow-array/35.0.0/src/arrow_array/record_batch.rs.html#278

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great -- thank you @jaylmiller

println!(
"h2o groupby query {} took {} ms",
opt.query,
elapsed.as_secs_f64() * 1000.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit b9964d6 into apache:main Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants