-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark tooling #691
Conversation
Thanks for the PR.
Yes, it's desired! |
/ok-to-test |
4b5aed1
to
7f4a38c
Compare
Please take a look at the workflow failure. |
Makefile
Outdated
@@ -1,5 +1,6 @@ | |||
BRANCH=`git rev-parse --abbrev-ref HEAD` | |||
COMMIT=`git rev-parse --short HEAD` | |||
LATEST_RELEASE_BRANCH := release-$(shell grep -ohE "[0-9]+.[0-9]+" version/version.go) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with the current workflow failure is that it's trying to fetch the release-1.4
branch. However, because the main
branch has the next release, this should return 1.3
rather than 1.4
.
LATEST_RELEASE_BRANCH := release-$(shell grep -ohE "[0-9]+.[0-9]+" version/version.go) | |
LATEST_RELEASE_BRANCH := release-$(shell grep -ohE "[0-9]+.[0-9]+" version/version.go | awk -F. '{print $1 "." $2-1})' |
I think we should compare the performance between a |
if PR and main is fine, i can remove the comparison to the previous release. |
Yes, please. |
This adds a way to compare benchmarks that we use in kube-state-metrics already, hopefully allowing for better comparisons when applying changes. Signed-off-by: Manuel Rüger <manuel@rueg.eu>
Hi @mrueg, it looks like it's failing because the benchmark job is timing out. We could either just increase the timeout or also split the benchmark into two jobs. I think the latter option would be better because I feel it would optimize for runtime and generate the output faster. What do you think? I can do a PoC or continue the implementation if you don't have the capacity. Edit: added bold for emphasis on what I initially meant 😅 |
@ivanvc yes, please feel free to take over. I'm currently lacking time to work on this. |
I have a branch with the functional code for this. However, |
In that case, can we add it as a nightly check? |
What would it compare against? This would be running against a PR |
Right, it's unrealistic if we need to compare PR vs main.
Also it might make more sense to compare performance using the bbolt bench tool. I think we can add a separate workflow check for that. |
Here's some discussion and progress regarding these last comments: I currently managed to do three runs using However, I wonder if my original idea of splitting the benchmarks into two runners makes sense because if we have a job scheduled in a degraded host, it may impact the benchmark's result. Doing both runs on the same runner would mean that instead of 30 minutes, the run time would increase to one hour.
That was my intention with #739, which makes sense. But we'll need to define a set of parameters to run the benchmarks. I'm not sure who would be the best to define them (I'm guessing you @ahrtr 😅), but I suggest keeping that conversation in that issue. Another idea would be to run these benchmarks as Prow jobs in the Kubernetes infra. In that case, I understand that the max CPU we can request is 7 CPUs. So, I'm unsure if they will be any better than GitHub's |
Look great, thank.
It seems that you run the test in parallel. If you run them sequentially, it will be longer. #750 (comment)
I think we should follow this approach to ensure both tests have exact the same environment. |
Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable? |
+1 for running it in sequentially.
I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores. |
That's good to know. I'll try it out with them. |
Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424. I could try tuning the running parameters to make it finish faster. |
Hmm, I don't have permission to check this in etcd-io org. Org → Settings → Actions → Runner groups → Default Large Runners
Maybe @ahrtr can help check it 😆 |
Hey Team - I believe this will require raising issue to kubernetes as |
@jmhbnz thanks for quick update. Last time, I remember that we have 3 large runners: 4 cores, 8 cores and 16 cores. Read that issue mentioned, the runners had been shutdown during migration. So I believe it has been deleted. If we still need 16 cores, we have to file cncf ticket to re-able it. |
We only have
Have you confirmed that (run them sequentially)? It takes about 50m for the existing tests/coverage test, so one hour might be accepted. |
But I have no any objection if anyone wants to request a 16 core runner :) |
It takes about one hour to run the benchmarks sequentially: https://github.com/etcd-io/bbolt/actions/runs/9087376452 If this is acceptable, we can roll it out with 8-core runners. However, I still wonder if the Prow infra would better fit this. |
Thanks. It's accepted to me. We can add it into github workflow firstly, and consider migrating to Prow later. |
I'll clean up the code and mark the other PR ready for review. |
Thanks @ivanvc for getting the change in! |
This adds a way to compare benchmarks, hopefully allowing for better comparisons when applying changes.
We use this in https://github.com/kubernetes/kube-state-metrics and it can be integrated with via Github Actions as well if desired.