Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when using config store backed by config maps #1960

Closed
bobcatfish opened this issue Dec 14, 2020 · 6 comments · Fixed by #2036
Closed

Race condition when using config store backed by config maps #1960

bobcatfish opened this issue Dec 14, 2020 · 6 comments · Fixed by #2036
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bobcatfish
Copy link
Contributor

Expected Behavior

When I try to use a config store backed by a config map immediately after the process using it starts up, when the config map exists before the process starts, I'd like to be able to guarantee that I am reading the configured value.

I think we could do this if one of the following was true:

  1. When the reconciler starts, it could wait for the config store to be initialized, i.e. wait for the goroutines that handle events to fire at least once (see TestInformedWatcher flaky  #1907 (comment) for a description of this code) before returning
  2. Provide a way for code using this reconciler to wait for this state (e.g. maybe don't do (1) by default, but at least provide a way for a caller to do this, e.g. via the Reconciler object)
  3. Export the config store so that I can look at the state

/kind bug

Actual Behavior

When I use a generated reconciler, which has an unexported configstore, if I try to back that config store with configmaps, there is a race condition where if I try to use that config store immediately after start up, the informed watcher's goroutines which read from the backing config maps and update the config store may not have fired yet, which means I might read something from the config store other than what I've configured.

Steps to Reproduce the Problem

Additional Info

@bobcatfish
Copy link
Contributor Author

@dprotaso @mattmoor @runzexia do you have any thoughts on what the best approach is to take here? Are you open to the knative/pkg controller code doing something to wait and make sure the configstores are initialized correctly? (This seems like the best solution to me cuz this bug would impact all users of this lib)

(some other ideas listed above)

@imjasonh
Copy link
Member

Friendly ping, I also noticed this in https://github.com/imjasonh/tektoncd-concurrency

@mattmoor
Copy link
Member

Based on the discussion in the linked issue, I'd change this check to not be based on the informer cache, but based on a positive hand-off signal from the callback:

if _, err := i.informer.Lister().ConfigMaps(i.Namespace).Get(k); err != nil {
if _, ok := i.defaults[k]; ok && k8serrors.IsNotFound(err) {

I suspect we could do this just by instrumenting add here:

AddFunc: i.addConfigMapEvent,

@bobcatfish
Copy link
Contributor Author

@mattmoor sounds like it makes sense to resolve this in knative/pkg 👍

I'd be happy to make PR but I need a bit more detail about what you're envisioning (sounds like we'd be creating an interface where the callbacks can signal, maybe via a channel, that they're ready?) - lemme know if it makes more sense to schedule some time to discuss or if it makes sense for me to bring up in a knative WG

@mattmoor
Copy link
Member

IDK that we need to change the interface, since we are responsible for calling each callback. I was just thinking about some internal bookkeeping to check that all of the callbacks had been called...? 🤷

I'm happy to chat, or we can throw it on an API WG agenda.

bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
@bobcatfish
Copy link
Contributor Author

@mattmoor I was gonna join a working group but I think I figured out what you meant - lemme know if #2036 is the kind of thing you had in mind :D ❤️

bobcatfish added a commit to bobcatfish/pkg that referenced this issue Feb 23, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
bobcatfish added a commit to bobcatfish/pkg that referenced this issue Mar 2, 2021
When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in knative#1929 which was working
around the race condition identified in knative#1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes knative#1960
knative-prow-robot pushed a commit that referenced this issue Mar 2, 2021
* Fix race: Make informed watcher start wait for Add event 🏎️

When using the informed watcher to watch a config map, previously add
events were being processed in a goroutine with no syncrhonization
making it so that code may try to access the values backed by the
configmaps before they are initialized.

This commit makes it so that the Start method of the informer will wait
for the add event to occur at least once for all config maps it is
watching.

This commit also undoes the workaround added in #1929 which was working
around the race condition identified in #1907 (and in
tektoncd/pipeline#3720). This means that if
the synchronization was removed, the impacted test would start flaking
again. If we wanted it to reliably fail in that case, we could introduce
a sleep in the callback but that doesn't seem worth it.

I also tested this change by manually patching the changes into my
clone of tektoncd/pipeline and following the repro steps at
tektoncd/pipeline#2815 (comment)
Before the change I can reproduce the issue, and after the change, I
can't! :D

Fixes #1960

* Make synced callback and named wait group private 🕵️

These utility objects don't really make sense to expose as part of the
informed watcher package and are only used by the informed watcher.
Writing tests for unexported code makes me a bit :( but hopefully these
will get moved to some other package one day. And there's really no
reason to expose these to users of knative/pkg at the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants