Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestInformedWatcher flaky #1907

Closed
runzexia opened this issue Nov 14, 2020 · 13 comments · Fixed by #1929
Closed

TestInformedWatcher flaky #1907

runzexia opened this issue Nov 14, 2020 · 13 comments · Fixed by #1929
Labels
area/test-and-release kind/bug Categorizes issue or PR as related to a bug.

Comments

@runzexia
Copy link
Contributor

/area test-and-release
/kind bug

Expected Behavior

test always success

Actual Behavior

test failed in https://github.com/knative/pkg/runs/1399755328,
after rerun test success

Steps to Reproduce the Problem

not easy to reproduce, but i guess it because informer.HasSynced not means all EventHandler has been called

@knative-prow-robot knative-prow-robot added area/test-and-release kind/bug Categorizes issue or PR as related to a bug. labels Nov 14, 2020
@runzexia
Copy link
Contributor Author

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Nov 24, 2020
This commit hacks in a test that tries to reproduce the issue in tektoncd#2815.
Even then, it only reproduced the issue every once in a while - running
doit.sh will usually cause the test to fail eventually (but it's running
a test that spawns 200 subtests 1000 times!)

(The other way to more easily make this failure happen is to introduce
a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of
the config map by the config store such that the test code always
executes first.)

I believe that what is happening is:
1. When the config store informed watcher starts, it starts go routines
   to handle incoming events
2. When this all starts for the first time, the shared informer will
   trigger "on changed" (added) events for the config maps so that
   the config store will process their values.
3. There is a race: if the goroutines started in (1) happen after the
   assertions in the test, then the config store will not be updated
   with the config map values and will resort to the defaults
   (i.e. not the CONFIGURED default in the configmap, but the harded
   coded fallback default, literally `default` in the case of the
   service account

What this means is that there is a (very small) chance that if a TaskRun
is created immediately after a controller starts up, the config store
might not have yet loaded the default values and will create the TaskRun
using the fall back hard coded default values and not whatever is
configured in the config maps (this would apply to all values
configurable via config maps).

I think this might be the same problem that is causing the knative/pkg
folks to see a flakey test knative/pkg#1907

If you look at the code that fails in that case:
https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92

The comment says:

> When Start returns the callbacks should have been called with the version of the objects that is available.

So it seems the intention is that when Start returns, the callbacks have
been called (and in the config store case, this means the callbacks that
process the config map), but the flakes in this test and the knative
test are showing that this isn't the case.

I think the fix is to introduce that guarantee; i.e. that after Start
has returned, the callbacks have been called with the available version
of the objects.
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Nov 24, 2020
By narrowing down the test and introducing some looping (maybe too much
looping? i dunno!) I can usually eventually reproduce this issue
eventually by running `doit.sh`.

For example:

```
+ go test ./configmap/... -run TestInformedWatcher -count=1
--- FAIL: TestInformedWatcher (0.00s)
    --- FAIL: TestInformedWatcher/AGAIN-175 (0.00s)
        informed_watcher_test.go:94: foo1.count = 0, want 1
        informed_watcher_test.go:94: foo2.count = 0, want 1
        informed_watcher_test.go:94: bar.count = 0, want 1
FAIL
```
@bobcatfish
Copy link
Contributor

Hi @runzexia !! I think this might be an actual bug and we are running into this in Tekton as well, causing tektoncd/pipeline#2815 (longer explanation in tektoncd/pipeline#2815 (comment))

Looking at the assertion that's failing:

// When Start returns the callbacks should have been called with the
// version of the objects that is available.
for _, obj := range []*counter{foo1, foo2, bar} {
if got, want := obj.count(), 1; got != want {
t.Errorf("%v.count = %d, want %d", obj.name, got, want)
}
}

It looks like the intention is that after calling cmw.Start, all of the callbacks have been called, but these flakes are showing that that isn't always the case (which I think makes sense b/c the callbacks are called in separate go routines).

I hacked the failing test to run in a loop and added a script to run it repeatedly in bobcatfish@32da803 and every time I run doit.sh I eventually see a failure such as:

+ for i in '`seq 1 1000`'
+ go test ./configmap/... -run TestInformedWatcher -count=1
--- FAIL: TestInformedWatcher (0.00s)
    --- FAIL: TestInformedWatcher/AGAIN-140 (0.00s)
        informed_watcher_test.go:94: foo1.count = 0, want 1
FAIL
FAIL	knative.dev/pkg/configmap	2.319s

In Tekton this means that there is a chance that when a controller starts, the values returned by the config store using this watcher might not return the right values (if the go routines that update the config store have not completed).

So I think that InformedWatcher.Start isn't functioning as intended and this is a bug. The fix would be to make it so that InformedWatcher.Start actually somehow guarantees that those callbacks have been invoked before returning, what do you think?

( @mattmoor might be relevant to your interests! )

@mattmoor
Copy link
Member

That'd be a good find. cc @dprotaso

which I think makes sense b/c the callbacks are called in separate go routines

Where do you mean? Within the informer itself?

@bobcatfish
Copy link
Contributor

which I think makes sense b/c the callbacks are called in separate go routines

Where do you mean? Within the informer itself?

  1. InformedWatcher.Start calls
  2. InformedWatcher.registerCallbackAndStartInformer which sets up all the event handlers and then calls
  3. sharedInformerFactory.Start which calls
  4. go informer.Run which i think waves hands pops events from a queue and ultimately uses this Process function is invoked to call the callbacks from within the goroutine ??

@mattmoor
Copy link
Member

I would expect the wait for sync to synchronize things, if the go is just the informer.Run 🤔

if ok := cache.WaitForCacheSync(stopCh, i.informer.Informer().HasSynced); !ok {

@runzexia
Copy link
Contributor Author

I would expect the wait for sync to synchronize things, if the go is just the informer.Run 🤔

if ok := cache.WaitForCacheSync(stopCh, i.informer.Informer().HasSynced); !ok {

I think i.informer.Informer().HasSynced would not guarantee the callback has invoked, thats the problem

@mattmoor
Copy link
Member

That is definitely plausible, but it would be good to confirm whether the informer events are done asynchronously, and where that happens. Just want to make sure we understand what's going on. 👍

@bobcatfish
Copy link
Contributor

would be good to confirm whether the informer events are done asynchronously, and where that happens

It seems like the only place the event handlers (OnAdd, OnUpdate, OnDelete) are called is in the Process function (or by the OnUpdate function, also called ultimately in the Process function), and it seems like the Process function is only invoked via informer.Run (via processloop) - so I think the event handler callbacks are only invoked asynchronously? (but maybe there is some other code in play?)

@dprotaso
Copy link
Member

I think i.informer.Informer().HasSynced would not guarantee the callback has invoked, thats the problem

@runzexia is right.

so I think the event handler callbacks are only invoked asynchronously

@bobcatfish is right.

@dprotaso
Copy link
Member

dprotaso commented Nov 26, 2020

Digging into the code myself HasSynced is true once the informers backing cache is up to date.

Notifying handlers is async and can be delayed (it has it's own queue)

@bobcatfish
Copy link
Contributor

Hey @dprotaso - it looks like #1929 resolved this issue but it's only changing test code - does this mean we're okay with the potential race condition?

i.e. if you use the informedwatcher too quickly (specifically before the event handling goroutines have been invoked) you won't get the value you wanted?

Over in Tekton Pipelines where we use this to read configuration from config maps i wouldn't want to risk that we use the wrong value (however unlikely it is).

If we don't feel like InformedWatcher.Start should wait for the first sync to happen, I'd like to request that we expose some way to wait for this. Let me know if I need to open another issue.

@dprotaso
Copy link
Member

dprotaso commented Dec 1, 2020

Yeah it's not guaranteed. I guess the severity would depend - ie. sharedmain seems to wait for configmap informers first by calling Start on the watcher. Then it waits again for another set of informers that have been setup via injection.

I think it'd be hard to provide guarantees that a set of event handler were invoked. You may way to just fail reconciliation if there's some configuration that's not present but expected.

It's worth discussing so feel free to open a new issue.

@bobcatfish
Copy link
Contributor

Okay @dprotaso I've opened #1960 !

bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Mar 2, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
knative-prow-robot pushed a commit that referenced this issue Mar 2, 2021
* Fix race: Make informed watcher start wait for Add event 🏎️

When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in #1929 which was working
around the race condition identified in #1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes #1960

* Make synced callback and named wait group private 🕵️

These utility objects don't really make sense to expose as part of the
informed watcher package and are only used by the informed watcher.
Writing tests for unexported code makes me a bit :( but hopefully these
will get moved to some other package one day. And there's really no
reason to expose these to users of knative/pkg at the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-and-release kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants