TestInformedWatcher flaky #1907

runzexia · 2020-11-14T12:31:00Z

/area test-and-release
/kind bug

Expected Behavior

test always success

Actual Behavior

test failed in https://github.com/knative/pkg/runs/1399755328,
after rerun test success

Steps to Reproduce the Problem

not easy to reproduce, but i guess it because informer.HasSynced not means all EventHandler has been called

The text was updated successfully, but these errors were encountered:

runzexia · 2020-11-15T07:02:45Z

failed again https://github.com/knative/pkg/runs/1401922270?check_suite_focus=true

This commit hacks in a test that tries to reproduce the issue in tektoncd#2815. Even then, it only reproduced the issue every once in a while - running doit.sh will usually cause the test to fail eventually (but it's running a test that spawns 200 subtests 1000 times!) (The other way to more easily make this failure happen is to introduce a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of the config map by the config store such that the test code always executes first.) I believe that what is happening is: 1. When the config store informed watcher starts, it starts go routines to handle incoming events 2. When this all starts for the first time, the shared informer will trigger "on changed" (added) events for the config maps so that the config store will process their values. 3. There is a race: if the goroutines started in (1) happen after the assertions in the test, then the config store will not be updated with the config map values and will resort to the defaults (i.e. not the CONFIGURED default in the configmap, but the harded coded fallback default, literally `default` in the case of the service account What this means is that there is a (very small) chance that if a TaskRun is created immediately after a controller starts up, the config store might not have yet loaded the default values and will create the TaskRun using the fall back hard coded default values and not whatever is configured in the config maps (this would apply to all values configurable via config maps). I think this might be the same problem that is causing the knative/pkg folks to see a flakey test knative/pkg#1907 If you look at the code that fails in that case: https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92 The comment says: > When Start returns the callbacks should have been called with the version of the objects that is available. So it seems the intention is that when Start returns, the callbacks have been called (and in the config store case, this means the callbacks that process the config map), but the flakes in this test and the knative test are showing that this isn't the case. I think the fix is to introduce that guarantee; i.e. that after Start has returned, the callbacks have been called with the available version of the objects.

By narrowing down the test and introducing some looping (maybe too much looping? i dunno!) I can usually eventually reproduce this issue eventually by running `doit.sh`. For example: ``` + go test ./configmap/... -run TestInformedWatcher -count=1 --- FAIL: TestInformedWatcher (0.00s) --- FAIL: TestInformedWatcher/AGAIN-175 (0.00s) informed_watcher_test.go:94: foo1.count = 0, want 1 informed_watcher_test.go:94: foo2.count = 0, want 1 informed_watcher_test.go:94: bar.count = 0, want 1 FAIL ```

bobcatfish · 2020-11-24T20:18:40Z

Hi @runzexia !! I think this might be an actual bug and we are running into this in Tekton as well, causing tektoncd/pipeline#2815 (longer explanation in tektoncd/pipeline#2815 (comment))

Looking at the assertion that's failing:

pkg/configmap/informed_watcher_test.go

Lines 86 to 92 in c5296cd

    
           // When Start returns the callbacks should have been called with the 
        
           // version of the objects that is available. 
        
           for _, obj := range []*counter{foo1, foo2, bar} { 
        
           	if got, want := obj.count(), 1; got != want { 
        
           		t.Errorf("%v.count = %d, want %d", obj.name, got, want) 
        
           	} 
        
           }

It looks like the intention is that after calling cmw.Start, all of the callbacks have been called, but these flakes are showing that that isn't always the case (which I think makes sense b/c the callbacks are called in separate go routines).

I hacked the failing test to run in a loop and added a script to run it repeatedly in bobcatfish@32da803 and every time I run doit.sh I eventually see a failure such as:

+ for i in '`seq 1 1000`'
+ go test ./configmap/... -run TestInformedWatcher -count=1
--- FAIL: TestInformedWatcher (0.00s)
    --- FAIL: TestInformedWatcher/AGAIN-140 (0.00s)
        informed_watcher_test.go:94: foo1.count = 0, want 1
FAIL
FAIL	knative.dev/pkg/configmap	2.319s

In Tekton this means that there is a chance that when a controller starts, the values returned by the config store using this watcher might not return the right values (if the go routines that update the config store have not completed).

So I think that InformedWatcher.Start isn't functioning as intended and this is a bug. The fix would be to make it so that InformedWatcher.Start actually somehow guarantees that those callbacks have been invoked before returning, what do you think?

( @mattmoor might be relevant to your interests! )

mattmoor · 2020-11-24T20:32:49Z

That'd be a good find. cc @dprotaso

which I think makes sense b/c the callbacks are called in separate go routines

Where do you mean? Within the informer itself?

bobcatfish · 2020-11-24T21:52:43Z

which I think makes sense b/c the callbacks are called in separate go routines

Where do you mean? Within the informer itself?

InformedWatcher.Start calls
InformedWatcher.registerCallbackAndStartInformer which sets up all the event handlers and then calls
sharedInformerFactory.Start which calls
go informer.Run which i think waves hands pops events from a queue and ultimately uses this Process function is invoked to call the callbacks from within the goroutine ??

mattmoor · 2020-11-24T22:24:07Z

I would expect the wait for sync to synchronize things, if the go is just the informer.Run 🤔

pkg/configmap/informed_watcher.go

Line 144 in c5296cd

if ok := cache.WaitForCacheSync(stopCh, i.informer.Informer().HasSynced); !ok {

runzexia · 2020-11-25T02:49:32Z

I would expect the wait for sync to synchronize things, if the go is just the informer.Run 🤔

pkg/configmap/informed_watcher.go

Line 144 in c5296cd

if ok := cache.WaitForCacheSync(stopCh, i.informer.Informer().HasSynced); !ok {

I think i.informer.Informer().HasSynced would not guarantee the callback has invoked, thats the problem

mattmoor · 2020-11-25T03:52:03Z

That is definitely plausible, but it would be good to confirm whether the informer events are done asynchronously, and where that happens. Just want to make sure we understand what's going on. 👍

bobcatfish · 2020-11-25T15:39:55Z

would be good to confirm whether the informer events are done asynchronously, and where that happens

It seems like the only place the event handlers (OnAdd, OnUpdate, OnDelete) are called is in the Process function (or by the OnUpdate function, also called ultimately in the Process function), and it seems like the Process function is only invoked via informer.Run (via processloop) - so I think the event handler callbacks are only invoked asynchronously? (but maybe there is some other code in play?)

dprotaso · 2020-11-26T03:53:50Z

I think i.informer.Informer().HasSynced would not guarantee the callback has invoked, thats the problem

@runzexia is right.

so I think the event handler callbacks are only invoked asynchronously

@bobcatfish is right.

dprotaso · 2020-11-26T03:55:58Z

Digging into the code myself HasSynced is true once the informers backing cache is up to date.

Notifying handlers is async and can be delayed (it has it's own queue)

bobcatfish · 2020-11-30T19:35:04Z

Hey @dprotaso - it looks like #1929 resolved this issue but it's only changing test code - does this mean we're okay with the potential race condition?

i.e. if you use the informedwatcher too quickly (specifically before the event handling goroutines have been invoked) you won't get the value you wanted?

Over in Tekton Pipelines where we use this to read configuration from config maps i wouldn't want to risk that we use the wrong value (however unlikely it is).

If we don't feel like InformedWatcher.Start should wait for the first sync to happen, I'd like to request that we expose some way to wait for this. Let me know if I need to open another issue.

dprotaso · 2020-12-01T05:22:17Z

Yeah it's not guaranteed. I guess the severity would depend - ie. sharedmain seems to wait for configmap informers first by calling Start on the watcher. Then it waits again for another set of informers that have been setup via injection.

I think it'd be hard to provide guarantees that a set of event handler were invoked. You may way to just fail reconciliation if there's some configuration that's not present but expected.

It's worth discussing so feel free to open a new issue.

bobcatfish · 2020-12-14T21:49:53Z

Okay @dprotaso I've opened #1960 !

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960

* Fix race: Make informed watcher start wait for Add event 🏎️ When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in #1929 which was working around the race condition identified in #1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes #1960 * Make synced callback and named wait group private 🕵️ These utility objects don't really make sense to expose as part of the informed watcher package and are only used by the informed watcher. Writing tests for unexported code makes me a bit :( but hopefully these will get moved to some other package one day. And there's really no reason to expose these to users of knative/pkg at the moment.

knative-prow-robot added area/test-and-release kind/bug Categorizes issue or PR as related to a bug. labels Nov 14, 2020

runzexia mentioned this issue Nov 15, 2020

fix flaky test TestInformedWatcher #1908

Closed

bobcatfish mentioned this issue Nov 24, 2020

Nightly unit test failure due to race condition when initializing config stores tektoncd/pipeline#2815

Closed

dprotaso mentioned this issue Nov 26, 2020

fixes TestInformedWatcher flake #1929

Merged

knative-prow-robot closed this as completed in #1929 Nov 26, 2020

bobcatfish mentioned this issue Dec 14, 2020

Race condition when using config store backed by config maps #1960

Closed

bobcatfish mentioned this issue Jan 26, 2021

Race condition on startup when using configuring controller via configmaps tektoncd/pipeline#3720

Closed

bobcatfish mentioned this issue Feb 23, 2021

Fix race: Make informed watcher start wait for Add event 🏎️ #2036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestInformedWatcher flaky #1907

TestInformedWatcher flaky #1907

runzexia commented Nov 14, 2020

runzexia commented Nov 15, 2020

bobcatfish commented Nov 24, 2020

mattmoor commented Nov 24, 2020

bobcatfish commented Nov 24, 2020

mattmoor commented Nov 24, 2020

runzexia commented Nov 25, 2020

mattmoor commented Nov 25, 2020

bobcatfish commented Nov 25, 2020

dprotaso commented Nov 26, 2020

dprotaso commented Nov 26, 2020 •

edited

Loading

bobcatfish commented Nov 30, 2020

dprotaso commented Dec 1, 2020

bobcatfish commented Dec 14, 2020

TestInformedWatcher flaky #1907

TestInformedWatcher flaky #1907

Comments

runzexia commented Nov 14, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

runzexia commented Nov 15, 2020

bobcatfish commented Nov 24, 2020

mattmoor commented Nov 24, 2020

bobcatfish commented Nov 24, 2020

mattmoor commented Nov 24, 2020

runzexia commented Nov 25, 2020

mattmoor commented Nov 25, 2020

bobcatfish commented Nov 25, 2020

dprotaso commented Nov 26, 2020

dprotaso commented Nov 26, 2020 • edited Loading

bobcatfish commented Nov 30, 2020

dprotaso commented Dec 1, 2020

bobcatfish commented Dec 14, 2020

dprotaso commented Nov 26, 2020 •

edited

Loading