Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refactor runConfigMapWatcher to use Informers. Fixes #11657 #11855

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

juranir
Copy link

@juranir juranir commented Sep 21, 2023

Fixes #11657
Fixes #11463

Supersedes / Closes #11799

Motivation

After some time running, when the connection to k8s API is broken (timeout for example), the watcher cannot handle the exception increases the CPU usage, and floods the log with the message level=error msg="invalid config map object received in config watcher. Ignored processing". After reaching the pod limit, it'll be restarted. See #11657 for details.

Modifications

I replaced the function with a new one using informers (many thanks @agilgur5 ).
I tried to use the code pattern used by other functions, but please let me know if something can be improved.
I also changed the namespace used to "managed-namespace" instead of "namespace" (This change will solve #11463 )

Verification

Manual changes in the ConfigMap are being watched and applied;
An instance with this change is running for a while;
We also did some integration and performance tests;

shell

@juranir juranir marked this pull request as ready for review September 22, 2023 09:20
@agilgur5 agilgur5 self-requested a review September 24, 2023 17:00
Copy link
Contributor

@juliev0 juliev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one small comment, but otherwise looks good!

@agilgur5
Copy link

Fixes #11463

I also changed the namespace used to "managed-namespace" instead of "namespace" (This change will solve #11463 )

I don't see how this fixes #11463? The RetryWatcher was already listening to managedNamespace.
#11463 also points out the opposite -- the Controller's ConfigMap is not in the managedNamespace, it's in the Controller's own namespace. Using managedNamespace is the bug that #11463 points out.
Per my review comments in #11799 (comment), there's also an issue with the semaphore notification that will have to be moved, as semaphores are in the managedNamespace

Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of an issue here with the semaphore configmap notifications. See in-line below

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
return cm.GetName() == wfc.configController.GetName()
},
Handler: cache.ResourceEventHandlerFuncs{
UpdateFunc: func(_, obj interface{}) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically speaking, an AddFunc and DeleteFunc would be good too, though I don't think the previous RetryWatcher handled those scenarios either. They are also more complex, so I think this PR is probably good as is

@juranir juranir requested a review from agilgur5 October 4, 2023 22:56
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on my end, I was sick for a whole week and then playing catch up.

There's two main issues with the below code, per the in-line comments:

  1. There are potentially two different namespaces to watch, not just one
  2. There are three different types of ConfigMaps to watch, not just two

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 29, 2023
@tooptoop4
Copy link
Contributor

is this ok to be merged?

@agilgur5
Copy link

is this ok to be merged?

No, my previous review has yet to be addressed

@agilgur5
Copy link

agilgur5 commented Jan 6, 2024

Hey @juranir do you have any time to continue working on this next week? If not, I'd like to take over the work as #11657 has gotten to be one of the top issues (seemingly due to k8s 1.27 upgrades?).
I'll ofc credit you with the existing work and leave you as co-author etc 🙂

@agilgur5 agilgur5 self-assigned this Jan 6, 2024
@sstarcher
Copy link

@agilgur5 any updates on this it's becoming very problematic.

@juranir
Copy link
Author

juranir commented Apr 2, 2024

The proposed suggestions by @agilgur5 and @juliev0 were implemented

@juranir juranir reopened this Apr 2, 2024
@juranir juranir requested a review from agilgur5 April 2, 2024 09:08
@juranir juranir requested a review from juliev0 April 2, 2024 13:20
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up again @juranir! I had to get a solid block of time to re-familiarize myself with this code, but I reviewed everything now.

For other reviewers, note that turning on "Hide Whitespace" in the GH diff is very helpful (tbh, I am thinking of using a different reviewing UI entirely given how many complex PRs I review now and how many insufficiencies GH has).

This does look much better now. Summary of my in-line comments:

  • Biggest thing -- this is missing the runConfigMapWatcher removal. That might've gotten lost in the rebase?
  • I think unconditionally creating both Informers would simplify the code. If we have two Informers on the same namespace, so be it. Informer resource usage tends to be more in the cache of data and filtering, rather than the overhead of the Informer itself.
  • there are some missing namespace checks
  • the label selector on the configmap informer needs moving

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
@agilgur5
Copy link

agilgur5 commented Apr 5, 2024

While I have the context, and given the priority & impact of this, I'm just going to go ahead and make the changes requested above myself and push them to the PR to make things faster (and reduce "telephone game" interpretations). In this review cycle I now have permissions to do that.

Anton Gilgur added 4 commits April 5, 2024 19:47
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have substantially simplified this now (including refactoring out some common plugin handling functions) and fixed all the correctness issues. So this should be "good enough" now.

But:

  1. There's a performance degradation, see in-line comment below. EDIT: I have fixed that now with a third informer for semaphores
  2. I'd like someone else to spot check this since I wrote a decent chunk of it. @juranir and @juliev0 can you both take a look?

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Show resolved Hide resolved
Anton Gilgur added 3 commits April 8, 2024 15:01
…orking on the managed ns. specify ctrl for controller instead

Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
@agilgur5 agilgur5 self-requested a review April 8, 2024 19:54
UpdateFunc: func(_, obj interface{}) {
cm := obj.(*apiv1.ConfigMap)
log.Infof("Received Workflow Controller config map %s/%s update", cm.GetNamespace(), cm.GetName())
wfc.UpdateConfig(ctx)
Copy link

@agilgur5 agilgur5 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an optimization that can be done here as well -- we can just pass the configmap directly. UpdateConfig does a Get, which is not necessary here.

that requires a bit of refactoring, including in the config controller, and is independent of this PR (the optimization probably could have been done before this PR, in the RetryWatcher, as well), so I'm going to do that as a separate follow-up PR

- this way the plugin etc informer is more efficient, only on configmaps with the label
  - and the diff is (much) smaller (once you ignore whitespace from the inverted `if`)

- semaphore informer only needs to watch certain configmaps -- get those names from the index
  - this is pretty hacky, I don't like this, but every way to track this seemed hacky 😕
    - this option had the least changes though
- semaphore informer only needs name and namespace -- remove the rest from its cache

Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Comment on lines +74 to +77
keys := wf.GetSemaphoreKeys()
// store all keys while indexing as a side effect
for _, key := range keys {
semaphoreKeyMap[key] = struct{}{}
Copy link

@agilgur5 agilgur5 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty hacky, I don't like this, but every way to track this seemed hacky 😕
This option had the least changes though and is the most efficient (as opposed to pulling out semaphore keys another time), which is why I ended up going with it.

More ideally, we might want to require a label for Semaphore ConfigMaps. That would be a breaking change though, so can't do that within this fix.

Comment on lines 1352 to 1367
}).WithTransform(func(obj interface{}) (interface{}, error) {
cm, ok := obj.(*apiv1.ConfigMap)
if !ok {
return obj, nil
}

// only leave name and namespace, remove the rest as we don't use it
cm = cm.DeepCopy()
cm.ObjectMeta = metav1.ObjectMeta{
Name: cm.Name,
Namespace: cm.Namespace,
}
cm.Data = map[string]string{}
cm.BinaryData = map[string][]byte{}
return cm, nil
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an external example of using a cache.TransformFunc, for reference

Anton Gilgur added 7 commits April 8, 2024 17:58
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
- the errors [only occur](https://github.com/kubernetes/client-go/blob/46588f2726fa3e25b1704d6418190f424f95a990/tools/cache/shared_informer.go#L580) when the informer was either already started or stopped
  - and we just created it, so we know that is not the case

Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
This reverts commit b3e3274.

Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM to me now -- I've pretty much done every significant optimization that is specific and isolated to this change now

  • WATCH_CONTROLLER_SEMAPHORE_CONFIGMAPS is not the most efficiently implemented (as I mentioned in the initial commit) as the Informers wouldn't need to be created at all in those cases (not just their event handlers), but given that's a workaround that could be removed (and env vars are explicitly documented as unstable), I didn't want to add too much code around it

There's also a comment I have above on the semaphore keys hackery in the indexer, but I don't think there's a better way to do that without a breaking change (e.g. requiring labels on semaphore configmaps)

I'm also going to build a test image and push to a test registry for folks to try and test out

@agilgur5
Copy link

agilgur5 commented Apr 9, 2024

I'm also going to build a test image and push to a test registry for folks to try and test out

Pushed agilgur5/argo-workflow-controller:fix-configmap-informer for testing

@juliev0 juliev0 self-assigned this Apr 9, 2024
if _, ok := wfc.executorPlugins[cm.GetNamespace()]; !ok {
wfc.executorPlugins[cm.GetNamespace()] = map[string]*spec.Plugin{}
}
wfc.executorPlugins[cm.GetNamespace()][cm.GetName()] = p
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need thread safety between these functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if so, I see it wasn't there before)

Copy link

@agilgur5 agilgur5 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this part hasn't changed in this PR, it just got refactored into a common func. If you turn on "Hide Whitespace" on the GH diff, this line actually appears as unchanged (I mentioned the setting earlier, although it's not as useful after my refactor)

Technically speaking, they probably should be made thread-safe in case an apply and delete happen simultaneously, but realistically that will basically never happen. Also, this is ran on a single goroutine, so there aren't multiple threads modifying it. Safety is good to have just in case though, but not specific to the changes in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's run on a single goroutine? Oh, I figured the ResourceEventHandlerFuncs spawned new goroutines - they don't?

Copy link

@agilgur5 agilgur5 Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I missed the notification for this comment; just manually checked and saw it.

I thought it would run on the same goroutine as the Informer's .Run, but no I think you're correct, it does seem to spawn new goroutines. It was a little non-trivial to trace back, but this is the client-go line that spawns goroutines (AddEventHandler calls processor.addListener and wait Group's Start spawns it).

In any case though, as this thread-safety bug had existed beforehand, I don't plan to fix it in this PR. Adding locks would also conventionally entail creating a new, thread-safe data structure for the executor plugins which becomes a sizeable change in and of itself that detracts a bit from the purpose of this PR.
There's a couple other follow-ups from this PR stemming from incidental findings as well like #11855 (comment) and #11855 (comment), so we can add this to the list 😅

Have you reviewed the other parts of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. Let me go through the rest of the PR tomorrow. :)

Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
@juliev0
Copy link
Contributor

juliev0 commented Apr 15, 2024

Unfortunately, it looks like there are some file conflicts now.

Another question: which of the different configmaps have been effectively tested (either manually, unit testing, or e2e)?

Anton Gilgur added 2 commits April 19, 2024 11:53
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
Signed-off-by: Anton Gilgur <agilgur5@gmail.com>
@juliev0
Copy link
Contributor

juliev0 commented Jun 28, 2024

I'm going through old PRs now, looking for stale ones to add "stale" label to. What's the status of this one @agilgur5 @juranir ?

@agilgur5
Copy link

agilgur5 commented Jun 29, 2024

It's still very necessary, although the upstream issue may have been resolved in k8s 1.30 if our hypotheses are correct, but even if that is accurate it may break again as it was due to a breaking bugfix in etcd that is now feature flagged due to the unexpected breakage. This change is still good to have and fixes other issues either way as well.

But I never got around to writing the tests as you previously asked for (been busy with plenty of other RCAs, reviews, etc 🫠), and, as far as I could tell, there were no tests for this previously, which would explain why it wasn't caught during k8s upgrades, as well as why other ConfigMap regressions in our own code weren't caught previously

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
5 participants