Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ignoreResourceUpdates to reduce controller CPU usage (#13534) #13912

Merged
merged 38 commits into from
Jun 25, 2023

Conversation

agaudreault
Copy link
Member

@agaudreault agaudreault commented Jun 5, 2023

Closes #13534 #6108 #13614 #8471 #8100 #7406 #9014 #9819

Changes:

  • Adding ignoreResourceUpdates global configuration to ignore fields before to hash resources.
  • Adding ignoreDifferencesOnResourceUpdates config to use ignoreDifferences automatically to ignoreResourceUpdates.
  • Hashing resources directly part of an application.
  • Filtering out resource updates when hash is the same
  • Changed Refreshing app %s for change in cluster of object %s of type %s/%s debug log to info to help get statistics and configure ignoreResourceUpdates.
    • For Splunk, msg="Refreshing app*for change*" | rex field=msg "Refreshing app (?<application>\S+) for change in cluster of object (?<resource>\S+) of type (?<type>\S+)" | stats count by application resource type | sort -count can be used.
    • EDIT: Now using structured fields.
image image

This was a result of adding the following config

      resource.customizations.ignoreResourceUpdates.all: |
        jsonPointers:
          - /status

During business hours, after optimization.
image
image

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • Optional. My organization is added to USERS.md.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.

Please see Contribution FAQs if you have questions about your pull-request.

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@agaudreault agaudreault changed the title feat: ignore watched resource update feat: ignore watched resource updates Jun 5, 2023
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@agaudreault agaudreault changed the title feat: ignore watched resource updates feat: ignore watched resource updates (#13534) Jun 6, 2023
@agaudreault agaudreault marked this pull request as ready for review June 6, 2023 19:39
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Patch coverage: 36.32% and project coverage change: -0.01 ⚠️

Comparison is base (4b06175) 49.61% compared to head (214cc53) 49.61%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13912      +/-   ##
==========================================
- Coverage   49.61%   49.61%   -0.01%     
==========================================
  Files         256      257       +1     
  Lines       43829    44146     +317     
==========================================
+ Hits        21744    21901     +157     
- Misses      19948    20091     +143     
- Partials     2137     2154      +17     
Impacted Files Coverage Δ
pkg/apis/application/v1alpha1/types.go 59.11% <0.00%> (-0.57%) ⬇️
controller/cache/cache.go 25.10% <13.46%> (-1.20%) ⬇️
cmd/argocd/commands/admin/settings.go 61.87% <14.75%> (-5.88%) ⬇️
controller/cache/info.go 65.47% <50.00%> (-0.97%) ⬇️
util/settings/settings.go 51.09% <72.72%> (+1.09%) ⬆️
controller/appcontroller.go 53.83% <75.00%> (+0.47%) ⬆️
util/argo/normalizers/diff_normalizer.go 76.92% <100.00%> (+1.92%) ⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@agaudreault agaudreault changed the title feat: ignore watched resource updates (#13534) feat: add ignoreResourceUpdates to reduce controller CPU usage (#13534) Jun 7, 2023
@jenting
Copy link
Contributor

jenting commented Jun 8, 2023

This PR looks fantastic. It looks like could fix the app controller CPU high issues.

@jaideepr97
Copy link
Contributor

Hi @agaudreault-jive
I'm not sure I fully understand what problem ignoreResourceUpdates is solving that can't be achieved using the existing ignoreDifferences, would you mind elaborating on that?

@agaudreault
Copy link
Member Author

agaudreault commented Jun 8, 2023

@jaideepr97
If your question is related to what is the difference between both: The ignoreDifferences are used to evaluate during the reconcile/refresh of the application if the app should be synchronized or not. Reconcile is the operation consuming the CPU, not sync. So we need to avoid performing the reconcile on watched resources. Take a look at #13534 for more details.

If your question is related to why there are 2 different settings and ignoreDifferences could not be reused to skip the reconcile as well: In our case, ignore difference has more configuration than what is necessary for the reconcile optimization. It is also hard/impossible to know what everyone has configured. Having 2 configurations prevents the possibility of conflicts. However, ignoreDifferencesOnResourceUpdates is there to simplify the configuration and prevent duplicates, especially for people that do not already ignore differences in the /status field of all objects. Another thing is that ignoreDifferences is only applied to resources defined by the user in git, while ignoreResourceUpdates could/should ideally be applied on any resources affecting the application.

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@donkeyx
Copy link

donkeyx commented Jun 15, 2023

Agreed, this pr amazing. We are seeing 100% cpu endlessly, with the application-controller monitoring out 2k pods. This seems to be due metric changes on the HPA's.

@crenshaw-dev
Copy link
Member

@donkeyx are you in a position to run this branch internally and monitor the effects? I'd be happy to help cherry pick these changes to whatever version you're running now.

@donkeyx
Copy link

donkeyx commented Jun 15, 2023

@donkeyx are you in a position to run this branch internally and monitor the effects? I'd be happy to help cherry pick these changes to whatever version you're running now.

@crenshaw-dev we are currently running quay.io/argoproj/argocd:v2.5.8, and yes would be happy to give it a test in our cluster for verification, if you can cherry-pick to the same release for us.

@agaudreault
Copy link
Member Author

agaudreault commented Jun 15, 2023

@donkeyx for HPA, we had the same issue and I used the following configs to make sure it works with v1 too. You might still see a couple of reconciles due to the ReplicaSet/Pods/Deployments updates, but no more from the HPA.

      resource.customizations.ignoreResourceUpdates.all: |
        jsonPointers:
          - /status
      resource.customizations.ignoreResourceUpdates.autoscaling_HorizontalPodAutoscaler:
        |
        jsonPointers:
          - /metadata/annotations/autoscaling.alpha.kubernetes.io~1behavior
          - /metadata/annotations/autoscaling.alpha.kubernetes.io~1conditions
          - /metadata/annotations/autoscaling.alpha.kubernetes.io~1metrics
          - /metadata/annotations/autoscaling.alpha.kubernetes.io~1current-metrics
      resource.customizations.ignoreResourceUpdates.apps_ReplicaSet: |
        jsonPointers:
          - /metadata/annotations/deployment.kubernetes.io~desired-replicas
          - /metadata/annotations/deployment.kubernetes.io~max-replicas
          - /metadata/annotations/deployment.kubernetes.io~revision

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
@agaudreault
Copy link
Member Author

agaudreault commented Jun 19, 2023

One "glitch" that I am seeing is when a ReplicaSet is scaled down (by HPA). When a pod is set to terminating, its health turns to "progressing", the Application health also changes to "progressing".

However, when the pods "disappear" from the UI, the Application status is not updated and stays "Progressing".

I expect the OnResourceUpdated callback to be called when a resource is deleted based on the code with newRes as nil and the reconcile to always happens, so I am unsure where the problem can be.

The app status is the following, but no resources in status.resources are progressing.

status:
  health:
    status: Progressing

Copy link
Member

@crenshaw-dev crenshaw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one remaining substantial thought.

util/settings/settings.go Outdated Show resolved Hide resolved
util/settings/settings.go Outdated Show resolved Hide resolved
util/settings/settings.go Show resolved Hide resolved
crenshaw-dev and others added 3 commits June 22, 2023 21:09
…cile

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
crenshaw-dev and others added 9 commits June 23, 2023 10:36
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
@crenshaw-dev crenshaw-dev enabled auto-merge (squash) June 25, 2023 00:27
@csantanapr
Copy link
Member

csantanapr commented Jun 25, 2023

@agaudreault-jive thanks to you and your company for this significant contribution to improving performance
@crenshaw-dev, as always, the exemplar maintainer taking time to review and discuss the PR
🎉

@crenshaw-dev crenshaw-dev merged commit 88994ea into argoproj:master Jun 25, 2023
@everythings-gonna-be-alright
Copy link

everythings-gonna-be-alright commented Jul 1, 2023

I Added to my config an additional section:

resource.customizations.ignoreResourceUpdates.all: |
        jsonPointers:
          - /status

Then I restarted argo-cd-argocd-application-controller but still see in the logs records about ExternalSecrerts and CronJobs:

time="2023-07-01T13:16:03Z" level=debug msg="Requesting app refresh caused by object update" api-version=kubernetes-client.io/v1 application=argo-cd/2418 fields.level=1 kind=ExternalSecret name=backup namespace=2418
time="2023-07-01T13:20:16Z" level=debug msg="Requesting app refresh caused by object update" api-version=batch/v1 application=argo-cd/8517 fields.level=1 kind=CronJob name=dynamic-cron-finish-stream namespace=8517

image: quay.io/argoproj/argocd:v2.8.0-rc1

@everythings-gonna-be-alright

Actually, after setting resource.ignoreResourceUpdatesEnabled: "true", the CPU usage returns to normal. However, ArgoCD doesn't work properly. All syncs get stuck at this phase:

time="2023-07-01T18:17:09Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: 2331)" application=argo-cd/2331

On the other hand, when resource.ignoreResourceUpdatesEnabled: "false", adding resource.customizations.ignoreResourceUpdates.all doesn't change anything, and the CPU usage remains very high as before.

@agaudreault
Copy link
Member Author

agaudreault commented Jul 1, 2023

@everythings-gonna-be-alright #14304

until this is merged and cherry-picked, you can use "debug" logs while resource.ignoreResourceUpdatesEnabled: "false", but when you set resource.ignoreResourceUpdatesEnabled: "true", make sure to not have "debug logs.

yyzxw pushed a commit to yyzxw/argo-cd that referenced this pull request Aug 9, 2023
…oproj#13534) (argoproj#13912)

* feat: ignore watched resource update

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add doc and CLI

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc index

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add command

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* revert formatting

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* do not skip on health change

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update logging to use context

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* fix typos. local build broken...

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* change after review

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* manifestHash to string

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* more doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* example for argoproj Application

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add unit test for ignored logs

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* move hash and set log to debug

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* feature flag

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* less aggressive managedFields ignore rule

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* use local settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* latest settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* safety first

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* since it's behind a feature flag, go aggressive on overrides

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
tesla59 pushed a commit to tesla59/argo-cd that referenced this pull request Dec 16, 2023
…oproj#13534) (argoproj#13912)

* feat: ignore watched resource update

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add doc and CLI

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc index

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add command

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* revert formatting

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* do not skip on health change

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* update logging to use context

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* fix typos. local build broken...

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* change after review

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* manifestHash to string

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* more doc

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* example for argoproj Application

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* add unit test for ignored logs

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* codegen

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* move hash and set log to debug

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* Update util/settings/settings.go

Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* feature flag

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* fix

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* less aggressive managedFields ignore rule

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* Update docs/operator-manual/reconcile.md

Co-authored-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>

* use local settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* latest settings

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* safety first

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

* since it's behind a feature flag, go aggressive on overrides

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Alexandre Gaudreault <alexandre.gaudreault@logmein.com>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
@ebuildy
Copy link
Contributor

ebuildy commented Jan 23, 2024

Just to be sure, argocd is taking /status field in account ? And we must configure to avoid it.

@agaudreault
Copy link
Member Author

@ebuildy correct. You have a few example in https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

Unnecessary application refresh on update event cause high CPU usage on controller
8 participants