Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

Open
tebaly opened this issue Jul 29, 2024 · 5 comments
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:argo-cd type:bug

Comments

@tebaly
Copy link

tebaly commented Jul 29, 2024

I believe there is a problem with the spec.syncPolicy.automated.selfHeal option in Argo CD. The current implementation seems to ignore the configured timeouts and triggers checks more frequently than expected, leading to unnecessary resource consumption.

Details:

I have configured the following timeouts:

timeout.reconciliation: 900s
controller.self.heal.timeout.seconds: 300

My expectation is that the controller will not attempt to change or check the cluster state within these timeout periods. Specifically, I expect to see no synchronization attempts in the controller logs for at least five minutes. However, this is not happening. Instead, the controller constantly attempts to check and synchronize the state, as shown in the logs below:

info msg="Refreshing app status (controller refresh requested), level (1)" application=
info msg="Comparing app state (cluster: 
info msg="Normalized app spec:
info msg="Skipping auto-sync: application status is Synced"
info msg="Update successful" application=
info msg="Reconciliation completed" application=
info msg="Refreshing app status (comparison expired, requesting refresh. reconciledAt: 2024-07-29 09:23:11 +0000 UTC, expiry: 15m0s), level (2)" application=

Hypothesis:

It appears that the controller might be responding to events in the cluster and ignoring the configured timeouts. One possible trigger for these frequent checks could be automatic changes in the number of replicas managed by a HorizontalPodAutoscaler, which, in this case, seems unnecessary.

Request:

I would like the ability to disable all checks and events except for the regular timeout checks. The controller should respect the configured timeouts and not attempt to check or synchronize the cluster state during these intervals unless explicitly required.

Thank you for looking into this issue.

@tebaly tebaly added the bug Something isn't working label Jul 29, 2024
@alexmt alexmt added bug/in-triage This issue needs further triage to be correctly classified component:argo-cd type:bug labels Jul 29, 2024
@juwon8891
Copy link
Member

I can't reproduce this issue. Can you tell me how to reproduce it in detail?

@tebaly
Copy link
Author

tebaly commented Jul 31, 2024

this is my guess. Please configure HorizontalPodAutoscaler so that it changes the number of replicas automatically - if Argo CD does not react to these automatic changes in the cluster, then there is no problem . a little later I will set these timeouts to 86400 in my cluster and show the logs

@tebaly
Copy link
Author

tebaly commented Sep 3, 2024

argocd-application-controller-0

ARGOCD_RECONCILIATION_TIMEOUT : 30000s
ARGOCD_APPLICATION_CONTROLLER_SELF_HEAL_TIMEOUT_SECONDS : 30000

Previously I installed using Helm - Now I changed the timeouts in the variables file and

helm upgrade argocd .

The controller has rebooted with new settings. It still does something endlessly and takes up as much CPU time as it can.

level=info msg="Update successful" application=
level=info msg="Updated health status: Healthy -> Progressing"
level=info msg="Refreshing app status (controller refresh requested), level (1)
level=info msg="Comparing app state ...
level=info msg="GetRepoObjs stats" application=

There are many such lines and they appear in batches with breaks of a couple of minutes, I think the processor can't go faster, it's busy with something...

With the above settings I expect the controller to be idle. At least the CPU load should be lower - but nothing has changed

Screenshot_20240903_215109

Screenshot_20240903_215408

Screenshot_20240903_215548

Screenshot_20240903_215758

@tebaly
Copy link
Author

tebaly commented Sep 3, 2024

it could be that he's been storing tasks in a queue for months and is now trying to complete them all. That is, there's a queue of tens/hundreds of thousands of tasks. And he'll continue to complete them all?

Now it continues to do what I described above and without stopping it grows the message log

  • it is interrupted only when it physically needs time

@tebaly
Copy link
Author

tebaly commented Sep 3, 2024

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/component: server
    app.kubernetes.io/instance: argocd
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  ...
  resource.customizations.ignoreResourceUpdates.all: |
    jsonPointers:
    - /status
    - /metadata

this didn't change anything either

How to debug this - Refreshing app status (controller refresh requested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:argo-cd type:bug
Projects
None yet
Development

No branches or pull requests

3 participants