Race condition on startup when using configuring controller via configmaps #3720

bobcatfish · 2021-01-26T21:35:54Z

Expected Behavior

When a user configures the Pipeline controller's behavior via configmaps, e.g. disabling the affinity assistant or enabling custom tasks, the controller should always respect these configured values.

When the controller tries to use these configured values, it should always reflect the settings in the config maps if they exist.

Actual Behavior

We are using an object provided by knative/pkg called configStore to access feature flags (configuration provided by users via config maps). It used to be that we loaded these values once on startup but in #2637 we switched over to using these config store objects so we could hot reload changes made to the config maps the while the controller is running.

Controller is started via main
main calls knative's sharedmain.MainWithConfig
MainWithConfig starts the config watchers
cmw.Start is InformedWatcher.Start
Start calls registerCallbackAndStartInformer
That calls sharedInformerFactory.Start

Ultimately this is where the interesting bit happens, all the informers are run in separate go routines:

pipeline/pkg/client/resource/informers/externalversions/factory.go

Lines 115 to 120 in 0fddf2e

    
           for informerType, informer := range f.informers { 
        
           	if !f.startedInformers[informerType] { 
        
           		go informer.Run(stopCh) 
        
           		f.startedInformers[informerType] = true 
        
           	} 
        
           }

The gorutine started in (7) (client-go cache controller's Run I think?) is what (waves hands) ultimately leads to reading the configured values from the config maps and storing them such that when the tekton controller tries to read these values (via FromContextOrDefaults) the idea is that it will see the values populated by the goroutine.

The problem is that there is no synchronization so sometimes the controller tries to read those values before the goroutines have populated them.

This causes tests of this functionality to sometimes fail, but more importantly, there is a chance (a very small chance) this could happen to users.

Steps to Reproduce the Problem

Configure a feature via config map
Start the controller
Immediately try to start a taskrun that uses the feature you have configured
Some tiny eensy teensy percentage of the time, you might see that the configured value isn't used and the controller uses its hard coded default instead

(3) would happen SO rarely that you might never be able to reproduce it - see #2815 (comment) on how to reproduce this via the integration tests

Additional Info

This is the bug in knative/pkg describing the problem and requesting it to be fixed or a way to work around it: knative/pkg#1960

I think this is the route cause of flakey failures of:

TestReconcile_ExplicitDefaultSA (Nightly unit test failure due to race condition when initializing config stores #2815)
TestReconcile_FeatureFlags ([flaky] TestReconcile_FeatureFlags Unit test failed in pipeline nightly job #3587)
TestReconcile_FeatureFlags/disable-working-dir-overwrite (TestReconcile_FeatureFlags/disable-working-dir-overwrite unit test seems flaky #3351)

What to do??

While we definitely want to fix this issue, I think that letting our tests continue to flake because of this issue is reducing the quality of the signal we're getting from our tests. Now that we know this is a problem, we don't need the tests to keep failing the way they do.

But because the details of this are hidden in unexported knative/pkg code I'm not sure how to work around it. Some ideas, all terrible:

Disable the impacted tests <-- not great b/c then if something ELSE around this functionality breaks, we won't know
Retry the impacted tests <-- probably better than (1) but still not a guarantee they won't flake
Add an arbitrary sleep into the impacted tests between when they instantiate the controller and when they use the value <-- not guaranteed to fix the problem but since chances of it happening are so small, it might be enough
Do the same as (3) but do it in the actual controller <-- pretty awful, and would make controller start up take longer, but might "fix" the actual problem for users too

The text was updated successfully, but these errors were encountered:

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960

bobcatfish · 2021-02-23T21:39:56Z

Okay knative/pkg#2036 is an attempt to fix this! If it's accepted, we should see these races fixed when we next update knative/pkg :D 🙏

When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in knative#1929 which was working around the race condition identified in knative#1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes knative#1960

* Fix race: Make informed watcher start wait for Add event 🏎️ When using the informed watcher to watch a config map, previously add events were being processed in a goroutine with no syncrhonization making it so that code may try to access the values backed by the configmaps before they are initialized. This commit makes it so that the Start method of the informer will wait for the add event to occur at least once for all config maps it is watching. This commit also undoes the workaround added in #1929 which was working around the race condition identified in #1907 (and in tektoncd/pipeline#3720). This means that if the synchronization was removed, the impacted test would start flaking again. If we wanted it to reliably fail in that case, we could introduce a sleep in the callback but that doesn't seem worth it. I also tested this change by manually patching the changes into my clone of tektoncd/pipeline and following the repro steps at tektoncd/pipeline#2815 (comment) Before the change I can reproduce the issue, and after the change, I can't! :D Fixes #1960 * Make synced callback and named wait group private 🕵️ These utility objects don't really make sense to expose as part of the informed watcher package and are only used by the informed watcher. Writing tests for unexported code makes me a bit :( but hopefully these will get moved to some other package one day. And there's really no reason to expose these to users of knative/pkg at the moment.

shinji62 · 2021-07-09T06:27:56Z

We are still seeing this issue with 0.24.x, if we do a restart of the controller it will create an affinity assitant even if false in the configmap, and we got this at every restart (~150 pipelinerun running a day) .

Wondering why we don't make sure that we read the configmap before the controller can start working.

vdemeester · 2021-07-09T07:15:30Z

@shinji62 sounds like a bump indeed 🤔 can you create an new issue for this ?

bobcatfish added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test labels Jan 26, 2021

bobcatfish self-assigned this Feb 23, 2021

bobcatfish mentioned this issue Feb 23, 2021

Fix race: Make informed watcher start wait for Add event 🏎️ knative/pkg#2036

Merged

bobcatfish mentioned this issue Apr 19, 2021

Bump knative to 0.22 🧙 #3883

Merged

5 tasks

tekton-robot closed this as completed in #3883 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition on startup when using configuring controller via configmaps #3720

Race condition on startup when using configuring controller via configmaps #3720

bobcatfish commented Jan 26, 2021

bobcatfish commented Feb 23, 2021

shinji62 commented Jul 9, 2021

vdemeester commented Jul 9, 2021

Race condition on startup when using configuring controller via configmaps #3720

Race condition on startup when using configuring controller via configmaps #3720

Comments

bobcatfish commented Jan 26, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

What to do??

bobcatfish commented Feb 23, 2021

shinji62 commented Jul 9, 2021

vdemeester commented Jul 9, 2021