Automate running benchmarks #27

imiric · 2023-02-06T19:17:50Z

In PRs #21 - #25 (specifically #23), we added some scripts that automate running k6, collecting metrics and graphing the results.

We should setup a CI pipeline that automates the entire process:

Starts the required load generator instances.
For the v0.42.0 results we manually started EC2 instances, and while we could automate this using the AWS API from GitHub Actions, we might want to give large GHA runners a try instead. They only go up to 64 cores and 256 GB RAM, whereas the m5.24xlarge instance we tested with had 96 cores and 384 GB RAM, so we'd have to tweak our tests slightly (possibly making the v0.42.0 results useless for comparison purposes), but it would potentially be much simpler to setup and use. I wouldn't be surprised if the performance of these Azure VMs is as unreliable and bursty as the regular runners, but it's worth testing it out.
Note that this would require financial approval from whoever is responsible for our corporate GitHub plan.
Runs the k6bench.sh, k6bench.gnuplot and whatever else scripts are needed to generate the results. We're missing a script to generate the result Markdown document, which could be done with some template.
Compares the results with the previous k6 version, to generate performance deltas. This would be useful for tracking performance over time.
Commits and pushes all of this under the results/<k6 version>/ directory of this repo.
Maybe notify on Slack for bonus points. :)

The workflow should run automatically on new k6 versions, though we should be able to run it manually as well (e.g. to confirm the performance before a k6 release, or after a PR we suspect might impact performance). We should be able to run it selectively as well, to run a subset of tests on a subset of machines.

The text was updated successfully, but these errors were encountered:

imiric · 2023-02-06T19:40:35Z

I forgot to mention that the availability of the SUT is an open question.

Currently the scripts in the v0.42.0 update use test.staging.k6.io. This is a manually scaled setup that @vkarhaltsev is reluctant to keep online permanently, for obvious reasons (costs and abuse). I would prefer if we wouldn't need to have a special step during this automated process that prepares and scales the SUT specifically for the test. I think it would also be a benefit if users themselves could run the same script we use and verify our results.

I think the cost could be kept down by just using autoscaling, which is what it's for. And I doubt this would be abused more than test.k6.io currently is, particularly since the load generator would have to have substantial specs in order to push the kind of traffic we saw in the benchmark.

If it's not possible to keep an autoscaled public instance running, then we need to trigger the deployment and scaling specifically before each benchmark, which somewhat complicates things.

olegbespalov · 2023-02-09T13:18:17Z

Sharing my thoughts.

For me, the question here is, what's the better trade-off between the complications in CI and difficulties in interpreting the outcome of the tests?

Autoscaling sounds good, but it should have some time & chance to work. Keeping that in mind, we probably have to use ramping load profiles and wait (but we have control) till the SUT gets all the power to serve 😅 So all of that introduce some moving parts, which I believe could affect test results (maybe I'm wrong here since mostly we're interested in maximal and average).

The CI scaling way sounds a bit more complicated from the automation standpoint and introduces less affection to the test results. So since, in that case, we run the tests once scaling is complete, the test results are more straightforward.

So currently, I'm tending slightly towards the second (CI scaling) way or maybe the hybrid solution where at least we have a chance to scale up SUT pre-maturely, but later the auto-scaler cleans the state if the CI is stuck or whatever.

vkarhaltsev · 2023-02-16T15:38:32Z

@imiric I think the current SUT is not a good quality service, I scaled it in multiple ways and the results were not consistent. I am not sure what exactly the cause though, I have a feeling that it is the PHP application itself. So I would better re-consider the application we test in favor of one that is more stable and better utilizes resources of hardware in the 1st place and as a second step we could create a Kubernetes package that customers can deploy in their infrastructure internally to re-produce test results. I don't like the idea that we expose such a service to the World, since being available for everyone it will not show consistent results anyway along with the concerns that you already mentioned.

epompeii · 2023-04-17T14:05:21Z

Would you also be interested in detecting performance regressions on PRs?

I've been working on a tool for continuous benchmarking called Bencher: https://github.com/bencherdev/bencher
It seems to accomplish a lot of what you are going for here, and it also allows you to set statistical thresholds to detect performance regressions.
As for scaling, Bencher doesn't (yet) handle anything on the infra side of things.

imiric added the ci label Feb 6, 2023

imiric mentioned this issue Feb 9, 2023

k6 v0.42.0 updates - script fixes (1/5) #21

Merged

imiric mentioned this issue Feb 15, 2023

Track and improve k6 memory usage and performance grafana/k6#1167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate running benchmarks #27

Automate running benchmarks #27

imiric commented Feb 6, 2023 •

edited

Loading

imiric commented Feb 6, 2023

olegbespalov commented Feb 9, 2023

vkarhaltsev commented Feb 16, 2023

epompeii commented Apr 17, 2023

Automate running benchmarks #27

Automate running benchmarks #27

Comments

imiric commented Feb 6, 2023 • edited Loading

imiric commented Feb 6, 2023

olegbespalov commented Feb 9, 2023

vkarhaltsev commented Feb 16, 2023

epompeii commented Apr 17, 2023

imiric commented Feb 6, 2023 •

edited

Loading