Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition on startup when using configuring controller via configmaps #3720

Closed
bobcatfish opened this issue Jan 26, 2021 · 3 comments · Fixed by #3883
Closed

Race condition on startup when using configuring controller via configmaps #3720

bobcatfish opened this issue Jan 26, 2021 · 3 comments · Fixed by #3883
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test

Comments

@bobcatfish
Copy link
Collaborator

Expected Behavior

When a user configures the Pipeline controller's behavior via configmaps, e.g. disabling the affinity assistant or enabling custom tasks, the controller should always respect these configured values.

When the controller tries to use these configured values, it should always reflect the settings in the config maps if they exist.

Actual Behavior

We are using an object provided by knative/pkg called configStore to access feature flags (configuration provided by users via config maps). It used to be that we loaded these values once on startup but in #2637 we switched over to using these config store objects so we could hot reload changes made to the config maps the while the controller is running.

  1. Controller is started via main
  2. main calls knative's sharedmain.MainWithConfig
  3. MainWithConfig starts the config watchers
  4. cmw.Start is InformedWatcher.Start
  5. Start calls registerCallbackAndStartInformer
  6. That calls sharedInformerFactory.Start
  7. Ultimately this is where the interesting bit happens, all the informers are run in separate go routines:
    for informerType, informer := range f.informers {
    if !f.startedInformers[informerType] {
    go informer.Run(stopCh)
    f.startedInformers[informerType] = true
    }
    }

The gorutine started in (7) (client-go cache controller's Run I think?) is what (waves hands) ultimately leads to reading the configured values from the config maps and storing them such that when the tekton controller tries to read these values (via FromContextOrDefaults) the idea is that it will see the values populated by the goroutine.

The problem is that there is no synchronization so sometimes the controller tries to read those values before the goroutines have populated them.

This causes tests of this functionality to sometimes fail, but more importantly, there is a chance (a very small chance) this could happen to users.

Steps to Reproduce the Problem

  1. Configure a feature via config map
  2. Start the controller
  3. Immediately try to start a taskrun that uses the feature you have configured
  4. Some tiny eensy teensy percentage of the time, you might see that the configured value isn't used and the controller uses its hard coded default instead

(3) would happen SO rarely that you might never be able to reproduce it - see #2815 (comment) on how to reproduce this via the integration tests

Additional Info

This is the bug in knative/pkg describing the problem and requesting it to be fixed or a way to work around it: knative/pkg#1960

I think this is the route cause of flakey failures of:

What to do??

While we definitely want to fix this issue, I think that letting our tests continue to flake because of this issue is reducing the quality of the signal we're getting from our tests. Now that we know this is a problem, we don't need the tests to keep failing the way they do.

But because the details of this are hidden in unexported knative/pkg code I'm not sure how to work around it. Some ideas, all terrible:

  1. Disable the impacted tests <-- not great b/c then if something ELSE around this functionality breaks, we won't know
  2. Retry the impacted tests <-- probably better than (1) but still not a guarantee they won't flake
  3. Add an arbitrary sleep into the impacted tests between when they instantiate the controller and when they use the value <-- not guaranteed to fix the problem but since chances of it happening are so small, it might be enough
  4. Do the same as (3) but do it in the actual controller <-- pretty awful, and would make controller start up take longer, but might "fix" the actual problem for users too
@bobcatfish bobcatfish added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test labels Jan 26, 2021
@bobcatfish bobcatfish self-assigned this Feb 23, 2021
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
@bobcatfish
Copy link
Collaborator Author

Okay knative/pkg#2036 is an attempt to fix this! If it's accepted, we should see these races fixed when we next update knative/pkg :D 🙏

bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Mar 2, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
knative-prow-robot pushed a commit to knative/pkg that referenced this issue Mar 2, 2021
* Fix race: Make informed watcher start wait for Add event 🏎️

When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in #1929 which was working
around the race condition identified in #1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes #1960

* Make synced callback and named wait group private 🕵️

These utility objects don't really make sense to expose as part of the
informed watcher package and are only used by the informed watcher.
Writing tests for unexported code makes me a bit :( but hopefully these
will get moved to some other package one day. And there's really no
reason to expose these to users of knative/pkg at the moment.
@shinji62
Copy link

shinji62 commented Jul 9, 2021

We are still seeing this issue with 0.24.x, if we do a restart of the controller it will create an affinity assitant even if false in the configmap, and we got this at every restart (~150 pipelinerun running a day) .

Wondering why we don't make sure that we read the configmap before the controller can start working.

@vdemeester
Copy link
Member

@shinji62 sounds like a bump indeed 🤔 can you create an new issue for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants