Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

tebaly · 2024-07-29T09:46:38Z

I believe there is a problem with the spec.syncPolicy.automated.selfHeal option in Argo CD. The current implementation seems to ignore the configured timeouts and triggers checks more frequently than expected, leading to unnecessary resource consumption.

Details:

I have configured the following timeouts:

timeout.reconciliation: 900s
controller.self.heal.timeout.seconds: 300

My expectation is that the controller will not attempt to change or check the cluster state within these timeout periods. Specifically, I expect to see no synchronization attempts in the controller logs for at least five minutes. However, this is not happening. Instead, the controller constantly attempts to check and synchronize the state, as shown in the logs below:

info msg="Refreshing app status (controller refresh requested), level (1)" application=
info msg="Comparing app state (cluster: 
info msg="Normalized app spec:
info msg="Skipping auto-sync: application status is Synced"
info msg="Update successful" application=
info msg="Reconciliation completed" application=
info msg="Refreshing app status (comparison expired, requesting refresh. reconciledAt: 2024-07-29 09:23:11 +0000 UTC, expiry: 15m0s), level (2)" application=

Hypothesis:

It appears that the controller might be responding to events in the cluster and ignoring the configured timeouts. One possible trigger for these frequent checks could be automatic changes in the number of replicas managed by a HorizontalPodAutoscaler, which, in this case, seems unnecessary.

Request:

I would like the ability to disable all checks and events except for the regular timeout checks. The controller should respect the configured timeouts and not attempt to check or synchronize the cluster state during these intervals unless explicitly required.

Thank you for looking into this issue.

The text was updated successfully, but these errors were encountered:

juwon8891 · 2024-07-29T17:30:14Z

I can't reproduce this issue. Can you tell me how to reproduce it in detail?

tebaly · 2024-07-31T11:38:06Z

this is my guess. Please configure HorizontalPodAutoscaler so that it changes the number of replicas automatically - if Argo CD does not react to these automatic changes in the cluster, then there is no problem . a little later I will set these timeouts to 86400 in my cluster and show the logs

tebaly · 2024-09-03T14:42:57Z

argocd-application-controller-0

ARGOCD_RECONCILIATION_TIMEOUT : 30000s
ARGOCD_APPLICATION_CONTROLLER_SELF_HEAL_TIMEOUT_SECONDS : 30000

Previously I installed using Helm - Now I changed the timeouts in the variables file and

helm upgrade argocd .

The controller has rebooted with new settings. It still does something endlessly and takes up as much CPU time as it can.

level=info msg="Update successful" application=
level=info msg="Updated health status: Healthy -> Progressing"
level=info msg="Refreshing app status (controller refresh requested), level (1)
level=info msg="Comparing app state ...
level=info msg="GetRepoObjs stats" application=

There are many such lines and they appear in batches with breaks of a couple of minutes, I think the processor can't go faster, it's busy with something...

With the above settings I expect the controller to be idle. At least the CPU load should be lower - but nothing has changed

tebaly · 2024-09-03T15:03:22Z

it could be that he's been storing tasks in a queue for months and is now trying to complete them all. That is, there's a queue of tens/hundreds of thousands of tasks. And he'll continue to complete them all?

Now it continues to do what I described above and without stopping it grows the message log

it is interrupted only when it physically needs time

tebaly · 2024-09-03T16:49:10Z

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/component: server
    app.kubernetes.io/instance: argocd
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  ...
  resource.customizations.ignoreResourceUpdates.all: |
    jsonPointers:
    - /status
    - /metadata

this didn't change anything either

How to debug this - Refreshing app status (controller refresh requested)

tebaly added the bug Something isn't working label Jul 29, 2024

alexmt added bug/in-triage This issue needs further triage to be correctly classified component:argo-cd type:bug labels Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

tebaly commented Jul 29, 2024 •

edited

Loading

juwon8891 commented Jul 29, 2024

tebaly commented Jul 31, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

Comments

tebaly commented Jul 29, 2024 • edited Loading

juwon8891 commented Jul 29, 2024

tebaly commented Jul 31, 2024 • edited Loading

tebaly commented Sep 3, 2024 • edited Loading

helm upgrade argocd .

tebaly commented Sep 3, 2024 • edited Loading

tebaly commented Sep 3, 2024 • edited Loading

tebaly commented Jul 29, 2024 •

edited

Loading

tebaly commented Jul 31, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading

tebaly commented Sep 3, 2024 •

edited

Loading