Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report and compare benchmark runs against two branches #5561

Closed
alamb opened this issue Mar 12, 2023 · 16 comments · Fixed by #5655
Closed

Report and compare benchmark runs against two branches #5561

alamb opened this issue Mar 12, 2023 · 16 comments · Fixed by #5655
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Mar 12, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When we make PRs like @jaylmiller 's #5292 or #3463 we often want to know "does this make existing benchmarks faster / slower". To answer this question we would like to:

  1. Run benchmarks on main
  2. Run benchmarks on the PR
  3. Compare the results

This workflow is supported well for the criterion based microbenchmarks in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/benches (by using criterion directly or using the https://github.com/BurntSushi/critcmp)

However, for the "end to end" benchmarks in https://github.com/apache/arrow-datafusion/tree/main/benchmarks there is no easy way I know of to do two runs and compare results.

Describe the solution you'd like
There is a "machine readable" output format generated with the -o parameter (as shown below)

  1. I would like a script that that compares the output of two benchmark runs. Ideally written either in bash or python.
  2. Instructions on how to run the script added to https://github.com/apache/arrow-datafusion/tree/main/benchmarks

So the workflow would be

Step 1: to create two or more output files using -o:

alamb@aal-dev:~/arrow-datafusion2/benchmarks$ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ~/tpch_data/parquet_data_SF1 --format parquet -o main

This produces files like in benchmarks.zip. Here is an example

{
  "context": {
    "benchmark_version": "19.0.0",
    "datafusion_version": "19.0.0",
    "num_cpus": 8,
    "start_time": 1678622986,
    "arguments": [
      "benchmark",
      "datafusion",
      "--iterations",
      "5",
      "--path",
      "/home/alamb/tpch_data/parquet_data_SF1",
      "--format",
      "parquet",
      "-o",
      "main"
    ]
  },
  "queries": [
    {
      "query": 1,
      "iterations": [
        {
          "elapsed": 1555.030709,
          "row_count": 4
        },
        {
          "elapsed": 1533.61753,
          "row_count": 4
        },
        {
          "elapsed": 1551.0951309999998,
          "row_count": 4
        },
        {
          "elapsed": 1539.953467,
          "row_count": 4
        },
        {
          "elapsed": 1541.992357,
          "row_count": 4
        }
      ],
      "start_time": 1678622986
    },
    ...

Step 2: Compare the two files and prepare a report

benchmarks/compare_results branch.json main.json

Which would produce an output report of some type. Here is an example of an output output (from @korowa on #5490 (comment)). Maybe they have a script they could share

Query               branch         main
----------------------------------------------
Query 1 avg time:   1047.93 ms     1135.36 ms
Query 2 avg time:   280.91 ms      286.69 ms
Query 3 avg time:   323.87 ms      351.31 ms
Query 4 avg time:   146.87 ms      146.58 ms
Query 5 avg time:   482.85 ms      463.07 ms
Query 6 avg time:   274.73 ms      342.29 ms
Query 7 avg time:   750.73 ms      762.43 ms
Query 8 avg time:   443.34 ms      426.89 ms
Query 9 avg time:   821.48 ms      775.03 ms
Query 10 avg time:  585.21 ms      584.16 ms
Query 11 avg time:  247.56 ms      232.90 ms
Query 12 avg time:  258.51 ms      231.19 ms
Query 13 avg time:  899.16 ms      885.56 ms
Query 14 avg time:  300.63 ms      282.56 ms
Query 15 avg time:  346.36 ms      318.97 ms
Query 16 avg time:  198.33 ms      184.26 ms
Query 17 avg time:  4197.54 ms     4101.92 ms
Query 18 avg time:  2726.41 ms     2548.96 ms
Query 19 avg time:  566.67 ms      535.74 ms
Query 20 avg time:  1193.82 ms     1319.49 ms
Query 21 avg time:  1027.00 ms     1050.08 ms
Query 22 avg time:  120.03 ms      111.32 ms

Describe alternatives you've considered
Another possibility might be to move the specialized benchmark binaries into criterion (so they look like "microbench"es but I think this is non ideal because of the number of parameters supported by the benchmarks

Additional context

@alamb alamb added the enhancement New feature or request label Mar 12, 2023
@alamb alamb added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 12, 2023
@jaylmiller
Copy link
Contributor

I could take this one up in the next few days if a new contributor does not end up picking this up as their first issue.

@korowa
Copy link
Contributor

korowa commented Mar 12, 2023

Here is an example of an output output (from @korowa on #5490 (comment)). Maybe they have a script they could share

Unfortunately it was just

cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./parquet --format parquet | grep "avg time"

and "multiline cursor" feature of IDE 🥲

@Taza53
Copy link

Taza53 commented Mar 12, 2023

Hi, I would like to take a crack at it. I will try to do it in python.

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2023

That would be great @Taza53 -- thank you.

I spent some time gathering data (into benchmarks.zip ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2023

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2023

@Taza53
Copy link

Taza53 commented Mar 13, 2023

I spent some time gathering data (into benchmarks.zip ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.

Thank you for gathering data, it's very helpful.

@isidentical had a script they shared here: https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f

I will take a look at it

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

I am a bit unsure, can you elaborate on this.

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2023

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

I am a bit unsure, can you elaborate on this.

Yes -- sorry -- all I was trying to say is that I am excited to use the script and will try it likely as soon as you have it available for a "real" usecase (basically to test #4085)

@jaylmiller
Copy link
Contributor

jaylmiller commented Mar 15, 2023

I think this should support all benches in benchmarks/src/bin: right now just tpch has the -o opt.

This means the other benches would be modified to have the -o option and made so that they output json in the same structure as tpch (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?

@alamb
Copy link
Contributor Author

alamb commented Mar 15, 2023

This means the other benches would be modified to have the -o option and made so that they output json in the same structure as tpch (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?

I think that sounds like a great idea -- thank you

@Taza53
Copy link

Taza53 commented Mar 19, 2023

python compare.py path1 path2
image
https://gist.github.com/Taza53/beb42c5918d352f9b760befaa87baef9
I have made some changes to the original. Any recommendations that I can do.

@jaylmiller
Copy link
Contributor

That output looks awesome! 🚀

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2023

Thanks @Taza53 -- I am testing it out now.

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2023

I tried it out and it worked great (see #5099 (comment)). I will prepare a PR with the script and some instructions.

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2023

Created #5655 -- thanks @Taza53 ❤️

@jaylmiller
Copy link
Contributor

I've added -o functionality to all the e2e benches in #5658 so this new script should be usable with every bench.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants