Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

spiffxp · 2020-08-31T21:13:27Z

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-bazel-test to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

Make sure any release-branch variants are also updated
Reference this issue, but do not "fix" it; instead leave this open to monitor the behavior of the job before/after
Here's an example PR that did this for pull-kubernetes-integration: Migrate pull-kubernetes-integration to k8s-infra-prow-build #18814

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:

other flags need to be removed to migrate it off of RBE

- --config=remote
- --remote_instance_name=projects/k8s-prow-builds/instances/default_instance

its behavior off of RBE may be substantially different or consume more resources than with RBE (ref: https://docs.bazel.build/versions/master/remote-execution.html)

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

https://prow.k8s.io/?job=pull-kubernetes-bazel-test
- does the job start failing more often?
- does the job start going into error state?
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-bazel-test&graph-metrics=test-duration-minutes
- does the job duration look worse than before? spikier than before?
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-bazel-test
- do more failures show up than before?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-bazel-test
- (can be used to answer some of the same questions as above)
metrics explorer: CPU limit utilization for pull-kubernetes-bazel-test for 6h
- is the job wildly underutilizing its CPU limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
metrics explorer: Memory limit utilization for pull-kubernetes-bazel-test for 6h
- is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)

Things to watch for the build cluster

prow-build dashboard 1w
- is the build cluster scaling as needed? (e.g. maybe it can't scale because we've hit some kind of quota)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
prowjobs-experiment 1w
- (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs

The text was updated successfully, but these errors were encountered:

spiffxp · 2020-08-31T21:16:59Z

/assign

Starting with a canary job to explore if/how different the presubmit runs off of RBE: #19069

For the release-blocking / CI variant of this job, we found the job took longer to run: #18652 (comment)

There are also a number of release-branch-specific bug fixes that need to be cherry-picked back to support each release-branch variant running off of RBE: #18652 (comment)

spiffxp · 2020-08-31T21:26:24Z

It's also worth keeping an eye on progress on kubernetes/kubernetes#93605

spiffxp · 2020-09-10T00:29:27Z

kubernetes/kubernetes#93605 has merged, and we started to see 3 consecutive failures in the CI jobs which don't run in RBE. Those flakes have since been addressed, but I think that's the cue to start moving on this again

spiffxp · 2020-09-10T00:38:19Z

Test durations look roughly equivalent between these two jobs that run bazel test without RBE, but the CI job is requesting 7 cpu while the PR job is requesting 4. I'd like to stick with 4 for the non-canary job and see how it behaves with real PR traffic.

spiffxp · 2020-09-10T15:00:18Z

#19170 merged 2020-09-09 11:30pm PT, job has started failing a lot more and duration is up near 1h

So, let's bump CPU and see if it's more of the same. If so, I think we're encountering many more flakes with kubernetes/kubernetes#93605 merged

liggitt · 2020-09-10T15:05:35Z

kubernetes/kubernetes#93605 bumped to run each test 3x. That didn't increase runtime in the RBE config, but apparently that was because the work was being done off machine in parallel.

The post-submit that is not running in RBE takes ~3x as long (~45 minutes vs ~15 minutes), so the 3x run seems to affect local runtime linearly (which makes sense).

spiffxp · 2020-09-10T15:07:48Z

Duration going up linearly makes sense, it's the volume of flakes that bothers me. Let's see if raising CPU via #19179 helps with that

liggitt · 2020-09-10T15:08:50Z

I'd also be ok with dropping the number of runs to 2... I thought we had some unit test caching in place that would let us skip running unit tests that weren't affected by particular merges so we wouldn't actually be running all the tests on every job run

spiffxp · 2020-09-10T18:33:12Z

I'm not actually sure whether we're being smart about which tests get run or not

#19179 merged at 8:30am PT today, looks like it's making a difference

I'll tee up dropping the runs to 2 but would like to wait a but more to see what the failure/flake rate looks like as-is

spiffxp · 2020-09-10T20:06:40Z

Opened kubernetes/kubernetes#94699 for running 2 instead of 3 times

spiffxp · 2020-10-07T20:58:02Z

/close
Calling this done

k8s-ci-robot · 2020-10-07T20:58:16Z

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 31, 2020

k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/jobs labels Aug 31, 2020

k8s-ci-robot assigned spiffxp Aug 31, 2020

spiffxp mentioned this issue Aug 31, 2020

Migrate pull-kubernetes-bazel-test-canary to k8s-infra #19069

Merged

This was referenced Aug 31, 2020

Kubernetes CI Policy: merge-blocking jobs must run in dedicated cluster #18550

Closed

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-build #19073

Closed

This was referenced Sep 1, 2020

[WIP] Please, don't merge this PR, testing migration of one of the prow jobs to the new cluster #94127 kubernetes/kubernetes#94127

Closed

DO NOT MERGE -- Add empty commit for testing CI kubernetes/kubernetes#92316

Closed

spiffxp mentioned this issue Sep 10, 2020

Migrate pull-kubernetes-bazel-test to k8s-infra-prow-build #19170

Merged

spiffxp mentioned this issue Sep 10, 2020

Bump cpu from 4 to 7 for pull-kubernetes-bazel-test #19179

Merged

spiffxp mentioned this issue Sep 10, 2020

Run unit tests 2 instead of 3 times via bazel kubernetes/kubernetes#94699

Merged

k8s-ci-robot closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

spiffxp commented Aug 31, 2020 •

edited

Loading

spiffxp commented Aug 31, 2020

spiffxp commented Aug 31, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020 •

edited

Loading

liggitt commented Sep 10, 2020

spiffxp commented Sep 10, 2020

liggitt commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Oct 7, 2020

k8s-ci-robot commented Oct 7, 2020

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-test #19070

Comments

spiffxp commented Aug 31, 2020 • edited Loading

spiffxp commented Aug 31, 2020

spiffxp commented Aug 31, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020 • edited Loading

liggitt commented Sep 10, 2020

spiffxp commented Sep 10, 2020

liggitt commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Sep 10, 2020

spiffxp commented Oct 7, 2020

k8s-ci-robot commented Oct 7, 2020

spiffxp commented Aug 31, 2020 •

edited

Loading

spiffxp commented Sep 10, 2020 •

edited

Loading