Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

Closed
spiffxp opened this issue Aug 31, 2020 · 12 comments
Closed
Assignees
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 31, 2020

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-bazel-test to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:

  • other flags need to be removed to migrate it off of RBE
- --config=remote
- --remote_instance_name=projects/k8s-prow-builds/instances/default_instance

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

  • prow-build dashboard 1w
    • is the build cluster scaling as needed? (e.g. maybe it can't scale because we've hit some kind of quota)
    • (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
  • prowjobs-experiment 1w
    • (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 31, 2020
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/jobs labels Aug 31, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 31, 2020

/assign

Starting with a canary job to explore if/how different the presubmit runs off of RBE: #19069

For the release-blocking / CI variant of this job, we found the job took longer to run: #18652 (comment)

There are also a number of release-branch-specific bug fixes that need to be cherry-picked back to support each release-branch variant running off of RBE: #18652 (comment)

@spiffxp
Copy link
Member Author

spiffxp commented Aug 31, 2020

It's also worth keeping an eye on progress on kubernetes/kubernetes#93605

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

kubernetes/kubernetes#93605 has merged, and we started to see 3 consecutive failures in the CI jobs which don't run in RBE. Those flakes have since been addressed, but I think that's the cue to start moving on this again

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Test durations look roughly equivalent between these two jobs that run bazel test without RBE, but the CI job is requesting 7 cpu while the PR job is requesting 4. I'd like to stick with 4 for the non-canary job and see how it behaves with real PR traffic.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

#19170 merged 2020-09-09 11:30pm PT, job has started failing a lot more and duration is up near 1h

So, let's bump CPU and see if it's more of the same. If so, I think we're encountering many more flakes with kubernetes/kubernetes#93605 merged

@liggitt
Copy link
Member

liggitt commented Sep 10, 2020

kubernetes/kubernetes#93605 bumped to run each test 3x. That didn't increase runtime in the RBE config, but apparently that was because the work was being done off machine in parallel.

The post-submit that is not running in RBE takes ~3x as long (~45 minutes vs ~15 minutes), so the 3x run seems to affect local runtime linearly (which makes sense).

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Duration going up linearly makes sense, it's the volume of flakes that bothers me. Let's see if raising CPU via #19179 helps with that

@liggitt
Copy link
Member

liggitt commented Sep 10, 2020

I'd also be ok with dropping the number of runs to 2... I thought we had some unit test caching in place that would let us skip running unit tests that weren't affected by particular merges so we wouldn't actually be running all the tests on every job run

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

I'm not actually sure whether we're being smart about which tests get run or not

#19179 merged at 8:30am PT today, looks like it's making a difference
Screen Shot 2020-09-10 at 10 59 26 AM

I'll tee up dropping the runs to 2 but would like to wait a but more to see what the failure/flake rate looks like as-is

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Opened kubernetes/kubernetes#94699 for running 2 instead of 3 times

@spiffxp
Copy link
Member Author

spiffxp commented Oct 7, 2020

/close
Calling this done

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

3 participants