-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070
Comments
/assign Starting with a canary job to explore if/how different the presubmit runs off of RBE: #19069 For the release-blocking / CI variant of this job, we found the job took longer to run: #18652 (comment) There are also a number of release-branch-specific bug fixes that need to be cherry-picked back to support each release-branch variant running off of RBE: #18652 (comment) |
It's also worth keeping an eye on progress on kubernetes/kubernetes#93605 |
kubernetes/kubernetes#93605 has merged, and we started to see 3 consecutive failures in the CI jobs which don't run in RBE. Those flakes have since been addressed, but I think that's the cue to start moving on this again |
Test durations look roughly equivalent between these two jobs that run |
#19170 merged 2020-09-09 11:30pm PT, job has started failing a lot more and duration is up near 1h So, let's bump CPU and see if it's more of the same. If so, I think we're encountering many more flakes with kubernetes/kubernetes#93605 merged |
kubernetes/kubernetes#93605 bumped to run each test 3x. That didn't increase runtime in the RBE config, but apparently that was because the work was being done off machine in parallel. The post-submit that is not running in RBE takes ~3x as long (~45 minutes vs ~15 minutes), so the 3x run seems to affect local runtime linearly (which makes sense). |
Duration going up linearly makes sense, it's the volume of flakes that bothers me. Let's see if raising CPU via #19179 helps with that |
I'd also be ok with dropping the number of runs to 2... I thought we had some unit test caching in place that would let us skip running unit tests that weren't affected by particular merges so we wouldn't actually be running all the tests on every job run |
I'm not actually sure whether we're being smart about which tests get run or not #19179 merged at 8:30am PT today, looks like it's making a difference I'll tee up dropping the runs to 2 but would like to wait a but more to see what the failure/flake rate looks like as-is |
Opened kubernetes/kubernetes#94699 for running 2 instead of 3 times |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What should be cleaned up or changed:
This is part of #18550
To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.
Migrate pull-kubernetes-bazel-test to k8s-infra-prow-build by adding a
cluster: k8s-infra-prow-build
field to the job:NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:
Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.
Things to watch for the job
pull-kubernetes-bazel-test
for 6hpull-kubernetes-bazel-test
for 6hThings to watch for the build cluster
Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.
/wg k8s-infra
/sig testing
/area jobs
The text was updated successfully, but these errors were encountered: