Race condition when using config store backed by config maps #1960

bobcatfish · 2020-12-14T21:49:31Z

Expected Behavior

When I try to use a config store backed by a config map immediately after the process using it starts up, when the config map exists before the process starts, I'd like to be able to guarantee that I am reading the configured value.

I think we could do this if one of the following was true:

When the reconciler starts, it could wait for the config store to be initialized, i.e. wait for the goroutines that handle events to fire at least once (see TestInformedWatcher flaky #1907 (comment) for a description of this code) before returning
Provide a way for code using this reconciler to wait for this state (e.g. maybe don't do (1) by default, but at least provide a way for a caller to do this, e.g. via the Reconciler object)
Export the config store so that I can look at the state

/kind bug

Actual Behavior

When I use a generated reconciler, which has an unexported configstore, if I try to back that config store with configmaps, there is a race condition where if I try to use that config store immediately after start up, the informed watcher's goroutines which read from the backing config maps and update the config store may not have fired yet, which means I might read something from the config store other than what I've configured.

Steps to Reproduce the Problem

I believe TestInformedWatcher flaky #1907 was actually catching this bug (explanation at TestInformedWatcher flaky #1907 (comment))
I repro-ed this on the Tekton side with these horrible hacks: bobcatfish/pipeline@1fbc388 (trying to more reliably reproduce Nightly unit test failure due to race condition when initializing config stores tektoncd/pipeline#2815)

Additional Info

More detail from a Tekton perspective in Nightly unit test failure due to race condition when initializing config stores tektoncd/pipeline#2815 (comment)

The text was updated successfully, but these errors were encountered:

bobcatfish · 2021-01-26T21:38:45Z

@dprotaso @mattmoor @runzexia do you have any thoughts on what the best approach is to take here? Are you open to the knative/pkg controller code doing something to wait and make sure the configstores are initialized correctly? (This seems like the best solution to me cuz this bug would impact all users of this lib)

(some other ideas listed above)

imjasonh · 2021-02-11T17:46:26Z

Friendly ping, I also noticed this in https://github.com/imjasonh/tektoncd-concurrency

mattmoor · 2021-02-11T19:00:31Z

Based on the discussion in the linked issue, I'd change this check to not be based on the informer cache, but based on a positive hand-off signal from the callback:

pkg/configmap/informed_watcher.go

Lines 175 to 176 in c5296cd

    
           if _, err := i.informer.Lister().ConfigMaps(i.Namespace).Get(k); err != nil { 
        
           	if _, ok := i.defaults[k]; ok && k8serrors.IsNotFound(err) {

I suspect we could do this just by instrumenting add here:

pkg/configmap/informed_watcher.go

Line 160 in c5296cd

AddFunc: i.addConfigMapEvent,

bobcatfish · 2021-02-17T16:48:35Z

@mattmoor sounds like it makes sense to resolve this in knative/pkg 👍

I'd be happy to make PR but I need a bit more detail about what you're envisioning (sounds like we'd be creating an interface where the callbacks can signal, maybe via a channel, that they're ready?) - lemme know if it makes more sense to schedule some time to discuss or if it makes sense for me to bring up in a knative WG

mattmoor · 2021-02-17T16:57:24Z

IDK that we need to change the interface, since we are responsible for calling each callback. I was just thinking about some internal bookkeeping to check that all of the callbacks had been called...? 🤷

I'm happy to chat, or we can throw it on an API WG agenda.

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960

bobcatfish · 2021-02-23T21:39:19Z

@mattmoor I was gonna join a working group but I think I figured out what you meant - lemme know if #2036 is the kind of thing you had in mind :D ❤️

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960

* Fix race: Make informed watcher start wait for Add event 🏎️ When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in #1929 which was working around the race condition identified in #1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes #1960 * Make synced callback and named wait group private 🕵️ These utility objects don't really make sense to expose as part of the informed watcher package and are only used by the informed watcher. Writing tests for unexported code makes me a bit :( but hopefully these will get moved to some other package one day. And there's really no reason to expose these to users of knative/pkg at the moment.

knative-prow-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 14, 2020

This was referenced Dec 14, 2020

TestInformedWatcher flaky #1907

Closed

Nightly unit test failure due to race condition when initializing config stores tektoncd/pipeline#2815

Closed

bobcatfish mentioned this issue Jan 26, 2021

Race condition on startup when using configuring controller via configmaps tektoncd/pipeline#3720

Closed

bobcatfish mentioned this issue Feb 23, 2021

Fix race: Make informed watcher start wait for Add event 🏎️ #2036

Merged

knative-prow-robot closed this as completed in #2036 Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when using config store backed by config maps #1960

Race condition when using config store backed by config maps #1960

bobcatfish commented Dec 14, 2020

bobcatfish commented Jan 26, 2021

imjasonh commented Feb 11, 2021

mattmoor commented Feb 11, 2021

bobcatfish commented Feb 17, 2021

mattmoor commented Feb 17, 2021

bobcatfish commented Feb 23, 2021

Race condition when using config store backed by config maps #1960

Race condition when using config store backed by config maps #1960

Comments

bobcatfish commented Dec 14, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

bobcatfish commented Jan 26, 2021

imjasonh commented Feb 11, 2021

mattmoor commented Feb 11, 2021

bobcatfish commented Feb 17, 2021

mattmoor commented Feb 17, 2021

bobcatfish commented Feb 23, 2021