Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run resolve/install benchmarks in ci #3281

Merged
merged 17 commits into from
Apr 30, 2024
Merged

Conversation

ibraheemdev
Copy link
Member

@ibraheemdev ibraheemdev commented Apr 26, 2024

Summary

Runs resolver benchmarks in CI with codspeed.

Copy link

codspeed-hq bot commented Apr 26, 2024

CodSpeed Performance Report

Congrats! CodSpeed is installed 🎉

🆕 12 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks

  • build_platform_tags[burntsushi-archlinux] (6.3 ms)
  • wheelname_parsing[flyte-long-compatible] (21 µs)
  • wheelname_parsing[flyte-long-incompatible] (26.3 µs)
  • wheelname_parsing[flyte-short-compatible] (11.9 µs)
  • wheelname_parsing[flyte-short-incompatible] (12.2 µs)
  • wheelname_parsing_failure[flyte-long-extension] (2.6 µs)
  • wheelname_parsing_failure[flyte-short-extension] (2.6 µs)
  • wheelname_tag_compatibility[flyte-long-compatible] (2.6 µs)
  • wheelname_tag_compatibility[flyte-long-incompatible] (1.8 µs)
  • wheelname_tag_compatibility[flyte-short-compatible] (2.5 µs)
  • wheelname_tag_compatibility[flyte-short-incompatible] (1.1 µs)
  • resolve_warm_jupyter (366.7 ms)

@ibraheemdev ibraheemdev marked this pull request as ready for review April 26, 2024 20:41
@ibraheemdev
Copy link
Member Author

Hmm it doesn't look like the benchmarks are running correctly under Codspeed, the performance report is showing the resolve/install benchmarks running in microseconds.

@adriencaccia
Copy link

adriencaccia commented Apr 26, 2024

Hey @ibraheemdev, I am a co-founder at @CodSpeedHQ!

Hmm it doesn't look like the benchmarks are running correctly under Codspeed, the performance report is showing the resolve/install benchmarks running in microseconds.

Yes, running arbitrary executables in a benchmark with CodSpeed will not give out relevant results, as most of the compute is done in a new process that is not instrumented.
It would be best to directly call the underlying functions of the library, without relying on the built executable.

For example, calling

pub(crate) async fn pip_compile(
instead of
https://github.com/ibraheemdev/uv/blob/4ebdc40f60562c05559ac6331abe1a56275e2c8b/crates/bench/benches/uv.rs#L41-L42.

Hope that helps you a bit 😃

@ibraheemdev
Copy link
Member Author

@adriencaccia Thanks! I suspected we would have to do this eventually, but didn't realize CodSpeed didn't support processing external commands at all.

@ibraheemdev ibraheemdev marked this pull request as draft April 26, 2024 21:06
@ibraheemdev ibraheemdev added benchmarks Related to benchmarking internal A refactor or improvement that is not user-facing labels Apr 29, 2024
@ibraheemdev ibraheemdev marked this pull request as ready for review April 29, 2024 20:32
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you! Open to giving this a shot. Do we have any sense for what the variance/noise will look like?

@adriencaccia
Copy link

Nice, thank you! Open to giving this a shot. Do we have any sense for what the variance/noise will look like?

I tested it out on my fork at adriencaccia#1, and I have the following variance results for 101 runs on the same commit:

Found 101 runs for adriencaccia/uv (fca26cde1b54f7467267ca4dff7a9b9cb6f10d29)
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┬───────────────────┬─────────────────────┬───────────┬──────────────────┐
│                                                                              (index)                                                                               │  average  │ standardDeviation │ varianceCoefficient │   range   │ rangeCoefficient │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┼───────────────────┼─────────────────────┼───────────┼──────────────────┤
│ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-incompatible] │ '1.1 µs'  │     '27.3 ns'     │       '2.5%'        │ '55.6 ns' │      '5.1%'      │
│  crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-compatible]  │ '2.5 µs'  │     '27.3 ns'     │       '1.1%'        │ '55.6 ns' │      '2.2%'      │
│  crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-compatible]   │ '2.6 µs'  │     '27.3 ns'     │       '1.0%'        │ '55.6 ns' │      '2.1%'      │
│ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-incompatible]  │ '1.8 µs'  │     '13.6 ns'     │       '0.7%'        │ '27.8 ns' │      '1.5%'      │
│    crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-short-extension]     │ '2.6 µs'  │     '13.6 ns'     │       '0.5%'        │ '27.8 ns' │      '1.1%'      │
│     crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-long-extension]     │ '2.6 µs'  │     '13.6 ns'     │       '0.5%'        │ '27.8 ns' │      '1.1%'      │
│            crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-compatible]            │  '12 µs'  │     '13.6 ns'     │       '0.1%'        │ '27.8 ns' │      '0.2%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-incompatible]           │ '12.2 µs' │     '13.6 ns'     │       '0.1%'        │ '27.8 ns' │      '0.2%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_build_platform_tags::build_platform_tags[burntsushi-archlinux]           │ '6.3 ms'  │     '13.6 ns'     │       '0.0%'        │ '27.8 ns' │      '0.0%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-incompatible]            │ '26.3 µs' │      '0 ns'       │       '0.0%'        │   '0 s'   │      '0.0%'      │
│            crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-compatible]             │  '21 µs'  │      '0 ns'       │       '0.0%'        │   '0 s'   │      '0.0%'      │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────────────────┴─────────────────────┴───────────┴──────────────────┘

It is fairly stable, so you should be able to set a low regression threshold to around 5% 🙂

@charliermarsh
Copy link
Member

Awesome, thanks so much Adrien!

@zanieb
Copy link
Member

zanieb commented Apr 30, 2024

Hm I don't see the resolver benchmarks there — I'd expect the distribution filename benches to be very stable but the resolver ones are probably less so.

@zanieb
Copy link
Member

zanieb commented Apr 30, 2024

Looks like there's something wrong and the resolver benches are missing on the latest commit.

@ibraheemdev
Copy link
Member Author

I forgot to use the codspeed-criterion-compat shim in the uv benchmarks, but it looks like the crate doesn't support async runs.

@adriencaccia
Copy link

@ibraheemdev let me know when you want me to run variance checks again on the new benchmarks

@ibraheemdev
Copy link
Member Author

@adriencaccia can you run them now? I'm also curious why benchmarks seem to be running ~15x slower on CodSpeed than locally, is it using an aggregate time instead of per-run?

@adriencaccia
Copy link

adriencaccia commented Apr 30, 2024

@adriencaccia can you run them now?

Alright, I started them. Will post the results once they are done 😉

I'm also curious why benchmarks seem to be running ~15x slower on CodSpeed than locally, is it using an aggregate time instead of per-run?

This is because we run the code with valgrind, it adds a 4x to 10x overhead, sometimes more. But that is how we get those consistent measures and flamegraphs 😉

@adriencaccia
Copy link

Results with the new benchmarks:

Found 101 runs for adriencaccia/uv (7bbc18a361ba078e21186db90b98d6b88b3a8a7c)
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────────────────┬─────────────────────┬───────────┬──────────────────┐
│                                                                              (index)                                                                               │  average   │ standardDeviation │ varianceCoefficient │   range   │ rangeCoefficient │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────────────────┼─────────────────────┼───────────┼──────────────────┤
│                                               crates/bench/benches/uv.rs::uv::resolve_warm_black::resolve_warm_black                                               │ '15.3 ms'  │    '527.5 µs'     │       '3.4%'        │ '2.3 ms'  │     '15.0%'      │
│ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-incompatible] │  '1.1 µs'  │     '27.5 ns'     │       '2.5%'        │ '55.6 ns' │      '5.1%'      │
│                                             crates/bench/benches/uv.rs::uv::resolve_warm_jupyter::resolve_warm_jupyter                                             │ '366.5 ms' │     '4.2 ms'      │       '1.1%'        │ '22.8 ms' │      '6.2%'      │
│  crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-compatible]  │  '2.5 µs'  │     '27.5 ns'     │       '1.1%'        │ '55.6 ns' │      '2.2%'      │
│  crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-compatible]   │  '2.6 µs'  │     '27.5 ns'     │       '1.0%'        │ '55.6 ns' │      '2.1%'      │
│ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-incompatible]  │  '1.9 µs'  │     '13.7 ns'     │       '0.7%'        │ '27.8 ns' │      '1.5%'      │
│    crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-short-extension]     │  '2.6 µs'  │     '13.7 ns'     │       '0.5%'        │ '27.8 ns' │      '1.1%'      │
│     crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-long-extension]     │  '2.6 µs'  │     '13.7 ns'     │       '0.5%'        │ '27.8 ns' │      '1.1%'      │
│            crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-compatible]            │  '12 µs'   │     '13.7 ns'     │       '0.1%'        │ '27.8 ns' │      '0.2%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-incompatible]           │ '12.2 µs'  │     '13.7 ns'     │       '0.1%'        │ '27.8 ns' │      '0.2%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_build_platform_tags::build_platform_tags[burntsushi-archlinux]           │  '6.3 ms'  │     '13.7 ns'     │       '0.0%'        │ '27.8 ns' │      '0.0%'      │
│           crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-incompatible]            │ '26.3 µs'  │      '0 ns'       │       '0.0%'        │   '0 s'   │      '0.0%'      │
│            crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-compatible]             │  '21 µs'   │      '0 ns'       │       '0.0%'        │   '0 s'   │      '0.0%'      │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────────────────┴─────────────────────┴───────────┴──────────────────┘

Indeed, it seems that crates/bench/benches/uv.rs::uv::resolve_warm_black::resolve_warm_black is a bit more inconsistent

@charliermarsh
Copy link
Member

Maybe we remove the Black test? It seems like the variance is way higher than the Jupyter test.

@zanieb
Copy link
Member

zanieb commented Apr 30, 2024

I wonder why that is. It shouldn't be that different? (as far as variance)

@ibraheemdev
Copy link
Member Author

@zanieb It's probably that the actual resolve step is faster, so the benchmark is more influenced by other factors (file I/O, etc.)

@ibraheemdev
Copy link
Member Author

I'm going to go ahead and merge this with just the jupyter benchmark. We'll see how consistent/useful the reports are.

@ibraheemdev ibraheemdev merged commit 1d2c57a into astral-sh:main Apr 30, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks Related to benchmarking internal A refactor or improvement that is not user-facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants