Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate controller tests hanging with golang 1.21.0 and unpin golang version #2768

Closed
georgethebeatle opened this issue Aug 11, 2023 · 9 comments

Comments

@georgethebeatle
Copy link
Member

georgethebeatle commented Aug 11, 2023

After bumping golang to 1.21.0 we started observing hangs in the controllers tests that went away when we pinned go to 1.20.7. The same hang would also manifest itself in concourse. This would block the pipeline so that no PR could get through.

We need to investigate this further since this blocks us from running on the lates go version. Once a proper solution is found we should rever the following commits

@danail-branekov
Copy link
Member

danail-branekov commented Aug 14, 2023

List of hopefully related findings

  • It seems to be relatively easy to reproduce tests getting stuck by bumping golang to 1.21.0 and then running UNTIL_IT_FAILS=true GINKGO_NODES=8 make -C controllers test

  • It always (local occurrences, Concourse, Github actions) seems to be the Workloads Controllers Integration Suite. FWIW, we saw similar occurrences for the kpack image builder tests, but lets keep it focused

  • The ginkgo related processes tree:

danails+  336333  0.0  0.0   6136  2748 pts/7    S+   15:41   0:00  |   \_ make -C controllers test
danails+  336907  0.0  0.0   8164  4708 pts/7    S+   15:41   0:00  |       \_ bash ../scripts/run-tests.sh
danails+  337013  0.1  0.1 1240636 21396 pts/7   Sl+  15:41   0:00  |           \_ go run github.com/onsi/ginkgo/v2/ginkgo -p --randomize-all --randomize-suites --procs=8 --poll-progress-after=60s --skip-package=e2e --coverp
danails+  337115  0.8  0.1 3010176 21988 pts/7   Sl+  15:41   0:02  |               \_ /tmp/go-build1603050358/b001/exe/ginkgo -p --randomize-all --randomize-suites --procs=8 --poll-progress-after=60s --skip-package=e2e --co
danails+  344765  7.8  1.1 2696980 180308 pts/7  Sl+  15:43   0:11  |                   \_ /home/danailster/workspace/korifi/controllers/controllers/workloads/workloads.test --test.timeout=0 --ginkgo.seed=1692027698 --ginkgo
danails+  344881  1.4  0.3 11222724 55656 pts/7  Sl+  15:43   0:02  |                   |   \_ /home/danailster/workspace/korifi/testbin/k8s/1.27.1-linux-amd64/etcd --advertise-client-urls=http://127.0.0.1:43027 --data-dir=/
danails+  345073  8.9  2.0 1034608 343004 pts/7  Sl+  15:43   0:13  |                   |   \_ /home/danailster/workspace/korifi/testbin/k8s/1.27.1-linux-amd64/kube-apiserver --allow-privileged=true --authorization-mode=RBAC
danails+  344769  8.4  1.1 2627680 186516 pts/7  Sl+  15:43   0:12  |                   \_ /home/danailster/workspace/korifi/controllers/controllers/workloads/workloads.test --test.timeout=0 --ginkgo.seed=1692027698 --ginkgo
danails+  344932  1.6  0.3 11222724 56480 pts/7  Sl+  15:43   0:02  |                       \_ /home/danailster/workspace/korifi/testbin/k8s/1.27.1-linux-amd64/etcd --advertise-client-urls=http://127.0.0.1:35395 --data-dir=/
danails+  345083  9.6  2.0 1102000 335640 pts/7  Sl+  15:43   0:14  |                       \_ /home/danailster/workspace/korifi/testbin/k8s/1.27.1-linux-amd64/kube-apiserver --allow-privileged=true --authorization-mode=RBAC

  • kill -SIGQUIT 344765 manages to kill the first test process, Ginkgo produces no output at this point

  • Kindly asking the second process (PID 344769) to stop has no effect. kill -9 344769 as a last resourt made it stop, at this point Ginkgo spilled quite a long thread dump. As I do not know what to look for, I am not pasting 8K lines here. Hopefully anyone can reproduce it.

danail-branekov added a commit to eirini-forks/eirini-home that referenced this issue Sep 8, 2023
Golang 1.21 causes some of our tests to hang:
cloudfoundry/korifi#2768

Co-authored-by: Danail Branekov <danailster@gmail.com>
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
@danail-branekov
Copy link
Member

Reproducible with golang 1.21.1 as well

danail-branekov added a commit to eirini-forks/eirini-station that referenced this issue Sep 8, 2023
Golang 1.21 causes some of our tests to hang:
cloudfoundry/korifi#2768

Co-authored-by: Danail Branekov <danailster@gmail.com>
@danail-branekov
Copy link
Member

There is some evidence on the internet that race detection might cause test times regression with 1.21: golang/go#61852

It might be completely unrelated, but we could try turning race detection off and see whether tests are still getting stuck.

@davewalter
Copy link
Member

I can run the tests with Golang v1.21.4 locally. Are we able to use that version in our GitHub actions?

@julian-hj
Copy link
Member

It looks to me like if we just revert the change to pin the version, we will get 1.21.5, since that's what golang:latest points to currently.

@julian-hj
Copy link
Member

...but i was able to reproduce the hanging issue with Golang 1.21.4, even on a mac. I am trying it again without the --race argument to see if it behaves better.

@julian-hj
Copy link
Member

It does appear that --race is involved. I ran

ginkgo --procs=4 --randomize-all --until-it-fails .

from the controllers/controllers/workloads directory a couple times
The first time, it finally crashed after 83 passes. The second time I gave up on it and stopped it after 75 attempts. I didn't see any evidence that it was leaking memory.

After adding --race, it hung on the second pass. I ran it again and it hung on the 8th pass.

I am retesting it now with --race turned on, but single-threaded.

@Birdrock
Copy link
Member

Running the test that @julian-hj ran with golang 1.21.5 on Ubuntu. So far, so good, but will let it continue for a bit.

@danail-branekov
Copy link
Member

Tests are no longer hanging and we have bumped golang quite some time ago. Closing therefore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants