Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-blocking jobs must run in dedicated cluster: periodic-kubernetes-bazel-test #18652

Closed
spiffxp opened this issue Aug 4, 2020 · 10 comments
Assignees
Labels
area/jobs sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 4, 2020

Part of #18549

k8s-infra-prow-build doesn't have RBE, so this job needs to be configured not to use RBE (I proved this works and
determined the resource limits that should be used in #18607)

Per #18613 (comment) it is important we keep some variant of "bazel test" running periodically in the k8s-prow-builds cluster

So, per #18613 (comment) I suggested we rename jobs:

  • swap the names of the bazel-test jobs such that
    • periodic-kubernetes-bazel-test-canary = the job that runs on k8s-prow-builds, uses RBE, and is on sig-testing-canaries dashboard
    • periodic-kubernetes-bazel-test-master = the job that runs on k8s-infra-prow-build and is release-blocking

/sig testing
/sig release
/wg k8s-infra
/area jobs

/assign @ameukam
assigning since @ameukam started down this path with #18613

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/release Categorizes an issue or PR as relevant to SIG Release. wg/k8s-infra area/jobs labels Aug 4, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 8, 2020

#18613 landed 2020-08-05 ~5pm PT
Screen Shot 2020-08-08 at 1 18 02 PM

It's unclear how much compute the RBE variant of the job ended up consuming, but it looks like we run slower. This is still well within release-blocking bounds, but good to know.

I missed that we changed the interval when we made this change, we should change that back. And the release-branch variants need to be changed over as well.

@ameukam
Copy link
Member

ameukam commented Aug 25, 2020

It's unclear to me what's the correct value to set for the interval of the release-branch variants. Is it still necessary to make this change?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 25, 2020

It's unclear to me what's the correct value to set for the interval of the release-branch variants. Is it still necessary to make this change?

Both jobs have a fork-per-release-periodic-interval: 6h annotation set, so I think the intervals for the release-branch variants should be left as-is.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 25, 2020

What needs to be done for the release-branch variants is removal of the --config=remote and --remote_instance_name=... flags so that their args look like the args of periodic-kubernetes-bazel-test-master

(and probably update their resources to match as well)

ameukam added a commit to ameukam/test-infra that referenced this issue Aug 26, 2020
k8s-infra-prow-build

Ref: kubernetes#18652

Adjust resources of the job variants based on the master job.

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

So right after #18997 merged, release-1.18 thru release-1.16 versions of this job all started failing with

FAILED: //pkg/kubelet/oom:go_default_test (Summary)
      /bazel-scratch/.cache/bazel/_bazel_root/7989b31489f31aee54f32688da2f0120/execroot/io_k8s_kubernetes/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/oom/go_default_test/test.log
      /bazel-scratch/.cache/bazel/_bazel_root/7989b31489f31aee54f32688da2f0120/execroot/io_k8s_kubernetes/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/oom/go_default_test/test_attempts/attempt_1.log
      /bazel-scratch/.cache/bazel/_bazel_root/7989b31489f31aee54f32688da2f0120/execroot/io_k8s_kubernetes/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/oom/go_default_test/test_attempts/attempt_2.log
INFO: From Testing //pkg/kubelet/oom:go_default_test:
==================== Test output for //pkg/kubelet/oom:go_default_test:
=== RUN   TestStartingWatcher
--- FAIL: TestStartingWatcher (0.00s)
    oom_watcher_linux_test.go:48: 
        	Error Trace:	oom_watcher_linux_test.go:48
        	Error:      	Received unexpected error:
        	            	open /dev/kmsg: no such file or directory
        	Test:       	TestStartingWatcher
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x10c0bd5]
goroutine 20 [running]:
testing.tRunner.func1(0xc0001c2100)
	GOROOT/src/testing/testing.go:874 +0x69f
panic(0x11b04e0, 0x1b0f170)
	GOROOT/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/pkg/kubelet/oom.TestStartingWatcher(0xc0001c2100)
	pkg/kubelet/oom/oom_watcher_linux_test.go:49 +0x135
testing.tRunner(0xc0001c2100, 0x12eebf0)
	GOROOT/src/testing/testing.go:909 +0x19a
created by testing.(*T).Run
	GOROOT/src/testing/testing.go:960 +0x652
================================================================================

The release-1.19 and main branches are fine.

The release-1.17 thru release-1.16 branches also fail with

--- FAIL: TestTar (0.01s)
    --- FAIL: TestTar/Contents_preserved_and_no_self-reference (0.01s)
        tar_test.go:85: Expected data map[file1:file1 data file2:file2 data subdir/file4:file4 data] but got map[]
    --- PASS: TestTar/Errors_if_directory_does_not_exist (0.00s)
FAIL

The release-1.18 thru main branches are fine.

Wonder if we need to cherry-pick something back?

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

https://testgrid.k8s.io/sig-release-1.16-blocking#bazel-test-1.16 - cherry-picks for this branch have merged, all tests are green; the tab is still going to appear red / "failing" in summary until enough runs have happened to age out the oom_watcher test as stale (should be 10 IIRC)

@ameukam
Copy link
Member

ameukam commented Sep 17, 2020

@spiffxp Is there still something to do here?
Also Thanks for the help!

@spiffxp
Copy link
Member Author

spiffxp commented Oct 7, 2020

/close
Yeah this is done

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Yeah this is done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

3 participants