Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly unit test failure due to race condition when initializing config stores #2815

Closed
dibyom opened this issue Jun 12, 2020 · 17 comments
Closed
Assignees
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test

Comments

@dibyom
Copy link
Member

dibyom commented Jun 12, 2020

Expected Behavior

Nightly build unit test passes

Actual Behavior

Nightly build unit test failed:

TaskRun is: pipeline-release-nightly-gkd24-unit-tests-l2pdb

Failure:

  --- FAIL: TestReconcile_ExplicitDefaultSA/success (0.08s)

Additional Info

  • Tekton Pipeline version: v0.13.2

  • I tried to reproduce this locally and failed:

 go test -race -cover -count=10 -run="TestReconcile_ExplicitDefaultSA" ./pkg/reconciler/taskrun
@dibyom dibyom added the kind/flake Categorizes issue or PR as related to a flakey test label Jun 12, 2020
@vdemeester
Copy link
Member

@dibyom any logs below the FAIL: ? (to see where did it fail)
/area testing

@tekton-robot tekton-robot added the area/testing Issues or PRs related to testing label Jun 15, 2020
@bobcatfish bobcatfish added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 15, 2020
@dibyom
Copy link
Member Author

dibyom commented Jun 15, 2020

Nothing unfortunately:

[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA (0.14s)
[unit-test]     --- FAIL: TestReconcile_ExplicitDefaultSA/success (0.08s)
[unit-test] FAIL
[unit-test] FAIL        github.com/tektoncd/pipeline/pkg/reconciler/taskrun     22.590s
[unit-test] FAIL

The command I used is (you have to be connected to the dogfooding cluster) :

tkn tr logs pipeline-release-nightly-gkd24-unit-tests-l2pdb

@dlorenc
Copy link
Contributor

dlorenc commented Jul 20, 2020

This looks like it's passing now: https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-snb9k

Should this be closed/dropped in priority? It's marked critical-urgent still

@bobcatfish
Copy link
Collaborator

I think the point was that this is non-deterministic - let's leave it open until we feel we understand why the race occurred.

@ghost
Copy link

ghost commented Jul 24, 2020

This test has failed during nightlies on two of my build cop rotations in a row. I'll assign this to myself and see if I can figure out what's triggering the random behaviour.

@ghost ghost self-assigned this Jul 24, 2020
@ghost
Copy link

ghost commented Jul 24, 2020

Error capture from latest nightly: https://gist.github.com/sbwsg/ce24e70e389e205dcb9d0890afc4193c

* failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task

@afrittoli afrittoli added this to the Pipelines v0.16 milestone Aug 10, 2020
@ghost
Copy link

ghost commented Aug 18, 2020

I'm not able to reproduce this. Tried with various combinations of -cpu, -race and -count with no failures. I'm going to unassign myself because I'm not sure how to make progress.

@ghost ghost removed their assignment Aug 18, 2020
@dibyom dibyom removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 24, 2020
@jerop
Copy link
Member

jerop commented Aug 28, 2020

@ghost
Copy link

ghost commented Oct 5, 2020

Closing for now. We've merged a couple fixes towards this so should reopen or start a new one if we see the problem surface again.

@ghost ghost closed this as completed Oct 5, 2020
@bobcatfish
Copy link
Collaborator

Nice!!! Thanks @sbwsg and @jerop :D

@bobcatfish
Copy link
Collaborator

Looks like this failed last night (https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-qnsg6?pipelineTask=unit-tests&step=unit-test):

[unit-test] 2020-10-09T02:04:12.139Z    ERROR   TestReconcile_ExplicitDefaultSA/success taskrun/reconciler.go:294       Returned an error       {"targetMethod": "ReconcileKind", "targetMethod": "ReconcileKind", "error": "1 error occurred:\n\t* failed to create task run pod \"test-taskrun-run-success\": translating TaskSpec to Pod: serviceaccounts \"default\" not found. Maybe missing or invalid Task foo/test-task\n\n"}
[unit-test] github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
[unit-test]     /workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294
[unit-test] github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
[unit-test]     /workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:532
[unit-test] testing.tRunner
[unit-test]     /usr/local/go/src/testing/testing.go:1127
[unit-test]     taskrun_test.go:533: expected no error. Got error 1 error occurred:
[unit-test]             * failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
...
[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA (0.17s)
[unit-test]     --- FAIL: TestReconcile_ExplicitDefaultSA/success (0.03s)
[unit-test]     --- PASS: TestReconcile_ExplicitDefaultSA/serviceaccount (0.14s)

@ghost
Copy link

ghost commented Nov 2, 2020

Failed again last night (https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-wpbpz?pipelineTask=unit-tests&step=unit-test):

=== RUN   TestReconcile_ExplicitDefaultSA/success
2020-11-02T02:05:13.149Z	ERROR	TestReconcile_ExplicitDefaultSA/success	taskrun/taskrun.go:394	Failed to create task run pod for taskrun "test-taskrun-run-success": failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).reconcile
	/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:394
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).ReconcileKind
	/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:192
github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
	/workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:240
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
	/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:536
testing.tRunner
	/usr/local/go/src/testing/testing.go:1123
2020-11-02T02:05:13.149Z	ERROR	TestReconcile_ExplicitDefaultSA/success	taskrun/taskrun.go:193	Reconcile: failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).ReconcileKind
	/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:193
github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
	/workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:240
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
	/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:536
testing.tRunner
	/usr/local/go/src/testing/testing.go:1123

@bobcatfish bobcatfish removed their assignment Nov 2, 2020
@dibyom
Copy link
Member Author

dibyom commented Nov 2, 2020

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2020
@dibyom dibyom removed this from the Pipelines v0.18 milestone Nov 2, 2020
@bobcatfish bobcatfish self-assigned this Nov 3, 2020
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Nov 24, 2020
This commit hacks in a test that tries to reproduce the issue in tektoncd#2815.
Even then, it only reproduced the issue every once in a while - running
doit.sh will usually cause the test to fail eventually (but it's running
a test that spawns 200 subtests 1000 times!)

(The other way to more easily make this failure happen is to introduce
a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of
the config map by the config store such that the test code always
executes first.)

I believe that what is happening is:
1. When the config store informed watcher starts, it starts go routines
   to handle incoming events
2. When this all starts for the first time, the shared informer will
   trigger "on changed" (added) events for the config maps so that
   the config store will process their values.
3. There is a race: if the goroutines started in (1) happen after the
   assertions in the test, then the config store will not be updated
   with the config map values and will resort to the defaults
   (i.e. not the CONFIGURED default in the configmap, but the harded
   coded fallback default, literally `default` in the case of the
   service account

What this means is that there is a (very small) chance that if a TaskRun
is created immediately after a controller starts up, the config store
might not have yet loaded the default values and will create the TaskRun
using the fall back hard coded default values and not whatever is
configured in the config maps (this would apply to all values
configurable via config maps).

I think this might be the same problem that is causing the knative/pkg
folks to see a flakey test knative/pkg#1907

If you look at the code that fails in that case:
https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92

The comment says:

> When Start returns the callbacks should have been called with the version of the objects that is available.

So it seems the intention is that when Start returns, the callbacks have
been called (and in the config store case, this means the callbacks that
process the config map), but the flakes in this test and the knative
test are showing that this isn't the case.

I think the fix is to introduce that guarantee; i.e. that after Start
has returned, the callbacks have been called with the available version
of the objects.
@bobcatfish
Copy link
Collaborator

Okay so I think I know what's going on here and I think it's a bug (though not one that would be hit very often, or maybe ever).

🐛 🐛 🐛
TL;DR: There is a race where the config store sometimes hasn't loaded default values before the test code executes b/c it loads them in separate go routines. This could manifest in actual problems if TaskRuns are created immediately on controller startup (tho it's quite unlikely).
🐛 🐛 🐛

Reproducing

Because of the way the reconciler code is generated I can't easily reproduce it in a test that we can commit back, but I hacked together a test that usually reproduced it in bobcatfish@1fbc388

Even then, it only reproduced the issue every once in a while - running doit.sh will usually cause the test to fail eventually (but it's running a test that spawns 200 subtests 1000 times!)

(The other way to more easily make this failure happen is to introduce a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of the config map by the config store such that the test code always executes first.)

What's happening?

I believe that what is happening is:

  1. When the config store informed watcher starts, it starts go routines to handle incoming events
  2. When this all starts for the first time, the shared informer will trigger "on changed" (added) events for the config maps so that the config store will process their values.
  3. There is a race: if the goroutines started in (1) happen after the assertions in the test, then the config store will not be updated with the config map values and will resort to the defaults (i.e. not the CONFIGURED default in the configmap, but the harded coded fallback default, literally default in the case of the service account

What this means is that there is a (very small) chance that if a TaskRun is created immediately after a controller starts up, the config store might not have yet loaded the default values and will create the TaskRun using the fall back hard coded default values and not whatever is configured in the config maps (this would apply to all values configurable via config maps).

What SHOULD be happening?

I think this might be the same problem that is causing the knative/pkg folks to see a flakey test knative/pkg#1907

If you look at the code that fails in that case:
https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92

The comment says:

When Start returns the callbacks should have been called with the version of the objects that is available.

So it seems the intention is that when Start returns, the callbacks have been called (and in the config store case, this means the callbacks that process the config map), but the flakes in this test and the knative test are showing that this isn't the case.

I think the fix is to introduce that guarantee; i.e. that after Start has returned, the callbacks have been called with the available version of the objects, so I'm gonna update that bug with my findings and see if the knative folks agree.

@bobcatfish
Copy link
Collaborator

(Oh yeah one last piece of trivia around this bug that I thought was interesting: note that this was reported June 12 and #2637 (making it so that we started using the config map watcher to hot reload these values) was merged on June 8.)

@bobcatfish bobcatfish changed the title Nightly build failure due to unit test failure Nightly build failure due to race condition when initializing config stores Dec 2, 2020
@bobcatfish bobcatfish changed the title Nightly build failure due to race condition when initializing config stores Nightly unit test failure due to race condition when initializing config stores Dec 2, 2020
@bobcatfish
Copy link
Collaborator

Being able to fix this will depend on how the knative folks want to handle the underlying issue: knative/pkg#1960

@bobcatfish
Copy link
Collaborator

Created #3720 to describe what I think is the root cause of this and some of our other flakey tests.

bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Mar 2, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
knative-prow-robot pushed a commit to knative/pkg that referenced this issue Mar 2, 2021
* Fix race: Make informed watcher start wait for Add event 🏎️

When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in #1929 which was working
around the race condition identified in #1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes #1960

* Make synced callback and named wait group private 🕵️

These utility objects don't really make sense to expose as part of the
informed watcher package and are only used by the informed watcher.
Writing tests for unexported code makes me a bit :( but hopefully these
will get moved to some other package one day. And there's really no
reason to expose these to users of knative/pkg at the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants