-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestInformedWatcher flaky #1907
Comments
This commit hacks in a test that tries to reproduce the issue in tektoncd#2815. Even then, it only reproduced the issue every once in a while - running doit.sh will usually cause the test to fail eventually (but it's running a test that spawns 200 subtests 1000 times!) (The other way to more easily make this failure happen is to introduce a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of the config map by the config store such that the test code always executes first.) I believe that what is happening is: 1. When the config store informed watcher starts, it starts go routines to handle incoming events 2. When this all starts for the first time, the shared informer will trigger "on changed" (added) events for the config maps so that the config store will process their values. 3. There is a race: if the goroutines started in (1) happen after the assertions in the test, then the config store will not be updated with the config map values and will resort to the defaults (i.e. not the CONFIGURED default in the configmap, but the harded coded fallback default, literally `default` in the case of the service account What this means is that there is a (very small) chance that if a TaskRun is created immediately after a controller starts up, the config store might not have yet loaded the default values and will create the TaskRun using the fall back hard coded default values and not whatever is configured in the config maps (this would apply to all values configurable via config maps). I think this might be the same problem that is causing the knative/pkg folks to see a flakey test knative/pkg#1907 If you look at the code that fails in that case: https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92 The comment says: > When Start returns the callbacks should have been called with the version of the objects that is available. So it seems the intention is that when Start returns, the callbacks have been called (and in the config store case, this means the callbacks that process the config map), but the flakes in this test and the knative test are showing that this isn't the case. I think the fix is to introduce that guarantee; i.e. that after Start has returned, the callbacks have been called with the available version of the objects.
By narrowing down the test and introducing some looping (maybe too much looping? i dunno!) I can usually eventually reproduce this issue eventually by running `doit.sh`. For example: ``` + go test ./configmap/... -run TestInformedWatcher -count=1 --- FAIL: TestInformedWatcher (0.00s) --- FAIL: TestInformedWatcher/AGAIN-175 (0.00s) informed_watcher_test.go:94: foo1.count = 0, want 1 informed_watcher_test.go:94: foo2.count = 0, want 1 informed_watcher_test.go:94: bar.count = 0, want 1 FAIL ```
Hi @runzexia !! I think this might be an actual bug and we are running into this in Tekton as well, causing tektoncd/pipeline#2815 (longer explanation in tektoncd/pipeline#2815 (comment)) Looking at the assertion that's failing: pkg/configmap/informed_watcher_test.go Lines 86 to 92 in c5296cd
It looks like the intention is that after calling I hacked the failing test to run in a loop and added a script to run it repeatedly in bobcatfish@32da803 and every time I run
In Tekton this means that there is a chance that when a controller starts, the values returned by the config store using this watcher might not return the right values (if the go routines that update the config store have not completed). So I think that ( @mattmoor might be relevant to your interests! ) |
That'd be a good find. cc @dprotaso
Where do you mean? Within the informer itself? |
|
I would expect the wait for sync to synchronize things, if the pkg/configmap/informed_watcher.go Line 144 in c5296cd
|
I think |
That is definitely plausible, but it would be good to confirm whether the informer events are done asynchronously, and where that happens. Just want to make sure we understand what's going on. 👍 |
It seems like the only place the event handlers (OnAdd, OnUpdate, OnDelete) are called is in the Process function (or by the OnUpdate function, also called ultimately in the Process function), and it seems like the Process function is only invoked via informer.Run (via processloop) - so I think the event handler callbacks are only invoked asynchronously? (but maybe there is some other code in play?) |
@runzexia is right.
@bobcatfish is right. |
Digging into the code myself Notifying handlers is async and can be delayed (it has it's own queue) |
Hey @dprotaso - it looks like #1929 resolved this issue but it's only changing test code - does this mean we're okay with the potential race condition? i.e. if you use the informedwatcher too quickly (specifically before the event handling goroutines have been invoked) you won't get the value you wanted? Over in Tekton Pipelines where we use this to read configuration from config maps i wouldn't want to risk that we use the wrong value (however unlikely it is). If we don't feel like InformedWatcher.Start should wait for the first sync to happen, I'd like to request that we expose some way to wait for this. Let me know if I need to open another issue. |
Yeah it's not guaranteed. I guess the severity would depend - ie. sharedmain seems to wait for configmap informers first by calling Start on the watcher. Then it waits again for another set of informers that have been setup via injection. I think it'd be hard to provide guarantees that a set of event handler were invoked. You may way to just fail reconciliation if there's some configuration that's not present but expected. It's worth discussing so feel free to open a new issue. |
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960
When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960
* Fix race: Make informed watcher start wait for Add event 🏎️ When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in #1929 which was working around the race condition identified in #1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes #1960 * Make synced callback and named wait group private 🕵️ These utility objects don't really make sense to expose as part of the informed watcher package and are only used by the informed watcher. Writing tests for unexported code makes me a bit :( but hopefully these will get moved to some other package one day. And there's really no reason to expose these to users of knative/pkg at the moment.
/area test-and-release
/kind bug
Expected Behavior
test always success
Actual Behavior
test failed in https://github.com/knative/pkg/runs/1399755328,
after rerun test success
Steps to Reproduce the Problem
not easy to reproduce, but i guess it because
informer.HasSynced
not means all EventHandler has been calledThe text was updated successfully, but these errors were encountered: