Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation loop #8100

Closed
2 of 3 tasks
Funk66 opened this issue Jan 5, 2022 · 30 comments
Closed
2 of 3 tasks

Reconciliation loop #8100

Funk66 opened this issue Jan 5, 2022 · 30 comments
Labels
bug Something isn't working

Comments

@Funk66
Copy link

Funk66 commented Jan 5, 2022

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Upon upgrading from v2.1.7 to v2.2.1, the argocd application controller started performing continuous reconciliations for every app (about one per second, which is as much as CPU capacity allows).
Issues #3262 and #6108 sound similar but didn't help.
I haven't been able to figure out the reason why a refresh keeps being requested. The log below shows the block that keeps repeating for each app every second.

Expected behavior

The number of reconciliations should be two orders of magnitude lower.

Version

v2.2.2+03b17e0

Logs

time="2022-01-05T12:34:33Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Ignore '/spec/preserveUnknownFields' for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: kube-system)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="getRepoObjs stats" application=prometheus-adapter build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=178 unmarshal_ms=163 version_ms=14
time="2022-01-05T12:34:33Z" level=info msg="Skipping auto-sync: application status is Synced" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="No status changes. Skipping patch" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="Reconciliation completed" application=monitoring-common dedup_ms=0 dest-name= dest-namespace=services dest-server="https://kubernetes.default.svc" diff_ms=0 fields.level=1 git_ms=391 health_ms=0 live_ms=119 settings_ms=0 sync_ms=0 time_ms=1098
@Funk66 Funk66 added the bug Something isn't working label Jan 5, 2022
@patrickjahns
Copy link
Contributor

@Funk66
We are experiencing a similar situation, but on ArgoCD v1.8.4+28aea3d. Something strange we noticed is, that we encounter this issue currently only on EKS clusters. We have identical clusters in Azure and the problem does not occur there.

Mind me asking if your ArgoCD setup is in AWS? Did the issue happen around the 4th of Januar - at least that what happened with our idendical environments

@Funk66
Copy link
Author

Funk66 commented Jan 10, 2022

@patrickjahns, yes, this is on AWS. It started in December, on the day we upgraded to v2.2.1, as explained in the description. I have no reason to think that this is related to the underlying infrastructure. If you have any indication to the contrary, please let me know and I'll try reaching out to the AWS support team.

@cnfatal
Copy link

cnfatal commented Jan 11, 2022

I met the same problem, but at version v1.7.14+92b0237

@patrickjahns
Copy link
Contributor

@Funk66
It might be a red herring, but I can elaborate shortly what lead me down the path for asking:

We have several kubernetes environments in AWS and Azure. ArgoCD is installed locally - from alls clusters, there are 3 EKS clusters in the same region and their version is 1.18-eks.8 and 1.19-eks.6. We are seeing the issue on 3 of our clusters from the same region, and it started to surface on the same day (4. January ) around the same time ( half an hour difference ).

We increased the verbosity of logging to debug/trace but haven't found any further indicators so far. So this is really mind boggling right now

@FatalC
Any chance this is happening in EKS? If not, at least I am a bit more sure that EKS is not the right direction to investigate ;-)

@Funk66
Copy link
Author

Funk66 commented Jan 11, 2022

@patrickjahns, did the issue by any chance start after an application controller pod restart? We're on EKS 1.20 and see this happening on every cluster in every region. The only change around the time it started was the ArgoCD upgrade, which is why I'm inclined to think that this problem is caused by ArgoCD being unable to properly keep track of the apps it has already refreshed. That said, I haven't taken the time to look into the code, so that just an uninformed guess.

@patrickjahns
Copy link
Contributor

We didn't perform any operations on the controllers. By chance they controllers must have all three been restarted around the same time (same day, within 1 hour from each other)

@MrSaints
Copy link

We are seeing this on our k3s cluster (v1.22.4+k3s1) with ArgoCD v2.1.8. CPU generally high too.

@alexmt alexmt added this to the v2.3 milestone Jan 11, 2022
@patrickjahns
Copy link
Contributor

patrickjahns commented Jan 12, 2022

Further digging in our environments revealed, that there were permanent updates to externalsecrets resource (status field) by the external-secrets controller. In our environments that was triggered through expired certificates (mTLS authentication of external-secrets) which we didn't catch.

We've resolved the underlying issues with the certificates and the reconciliation loop stopped. In the ArgoCD documentation we noticed als that one can disable that StatusChanges trigger reconciliation loops

data:
  resource.compareoptions: |
    # disables status field diffing in specified resource types
    # 'crd' - CustomResourceDefinition-s (default)
    # 'all' - all resources
    # 'none' - disabled
    ignoreResourceStatusField: all

https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#system-level-configuration

Maybe this is something people can try and see if that is the trigger in their environments.
Besides that, another option would be to iterate over the resources and watch for changes. Didn't find a nice command yet to do a watch on all resources (i.e. something along the lines of kubectl watch * ) yet - if anyone has an idea - highly appreciated

Something like corneliusweig/ketall#29 would be good to catch I suppose.

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

Maybe this is something that ArgoCD maintainers would consider (cc @alexmt (pinging you since it was added to a milestone for investigation))

@jannfis
Copy link
Member

jannfis commented Jan 16, 2022

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

I agree, this information would be really useful. We had reconciliation loops bugs in the past, where it wasn't clear which resource(s) actually triggered the reconciliation and took tremendous efforts to troubleshoot.

@Funk66
Copy link
Author

Funk66 commented Jan 17, 2022

The issue about changing secrets was mentioned in #6108. I have checked all resources being tracked by the corresponding applications and none of them sees to change, or at least not at that rate. The ignoreResourceStatusField parameter didn't help in my case. I will have to dig deeper to ferret out what is going on. I agree that more comprehensive logging would make this much easier.

@rbreeze rbreeze removed this from the v2.3 milestone Jan 20, 2022
@Funk66
Copy link
Author

Funk66 commented Feb 9, 2022

So I've finally taken some time to have another look at this and here's what I found. First, I can confirm that the issue started with v2.2.0. Reverting the application-controller image to an earlier version makes the problem go away. Furthermore, I think the issue was introduced with commit 05935a9, where an 'if' statement to exclude orphaned resources was removed.
The problem itself is that ArgoCD detects changes to config-maps used for leader election purposes. These can be easily identified with kubectl get cm -A -w, since the leader election process requires updating the config-map every few seconds. Now, even though these resources are listed in spec.orphanedResources.ignore of the AppProject manifest, the ApplicationController.handleObjectUpdated method flags them as being managed by every App in that namespace, hence calling requestAppRefresh for each one of them roughly every second.
I could submit a PR reverting the conflicting change, but I'd appreciate having other opinions on how to better fix this.

@nilsbillo
Copy link

Running argocd 2.1.3 in EKS and have problem with high cpu usage and throttling of application controller aswell. So do not think 2.2 is the only issue though.

@albgus
Copy link

albgus commented Mar 9, 2022

For what it's worth I tried the solution suggested by @patrickjahns above and our ArgoCD went from consuming ~1000-1500m to ~ 20m CPU.

i.e. setting this in argocd-cm and restarting the argocd-application-controller deployment:

data:
  resource.compareoptions: |
    ignoreResourceStatusField: all

Running ArgoCD 2.2.5 in EKS 1.21.

@pyromaniac3010
Copy link

I'm also hit by the high cpu caused by reconciliation loop. Thanks to @Funk66 I verified that it is caused by the leader election configmaps.
Is there any workaround available or a fix in progress?
The problem exists for me in Argo CD 2.3.1 and 2.3.2 with the following configmaps:

  • aws-load-balancer-controller-leader
  • karpenter-leader-election
  • ingress-controller-leader (ingress-nginx)
  • cert-manager-controller
  • cert-manager-cainjector-leader-election
  • cp-vpc-resource-controller
  • fargate-scheduler
  • eks-certificates-controller

@pyromaniac3010
Copy link

FYI: If you remove spec.orphanedResources completely from your "kind: AppProject" the reconciliation loop and high cpu stops.
I had it set to warn: false to be able to see orphaned resources in the web ui:

spec:
  description: Argocd Project
  orphanedResources:
    warn: false

Removing it lead to a complete stop of the reconciliation loop and a significant drop in cpu:
image
image
image

@ybialik
Copy link

ybialik commented Mar 30, 2022

using the command suggested by @Funk66 I was also able to see that I have several cm that keep popping in the list, but one of them is in a namespace we see many reconciliations for.

is there a workaround?

@Vladyslav-Miletskyi
Copy link
Contributor

Vladyslav-Miletskyi commented Apr 27, 2022

  1. Delete orphanedResources (even if it is empty, but present in spec issue is still ongoing) spec. Reconciliation loop #8100 (comment)
  2. Restart application controller(-s)
  3. Enjoy

Tested with version V2.3.3

@bakkerpeter
Copy link

bakkerpeter commented May 18, 2022

@Vladyslav-Miletskyi thanks! That did the trick. We were having the exact same problem and now the load is normal.

@agaudreault
Copy link
Member

Is there something else than a debug log that we could use to detect this in a production deployment? Enabling debug in production is not something that is possible for us.

I am mainly looking at a way to find resources that are continuously regenerated.

@prein
Copy link

prein commented Nov 3, 2022

Disabling orphanedResources didn't do the trick for me. I am observing around 2k / min of "Refreshing app status (controller refresh requested) in logs with only 170 apps. ArgoCD v2.4.11

@roeizavida
Copy link

The issue is still present in v2.5.1 and the orphanedResources is not in spec.

@jamesalucas
Copy link

jamesalucas commented Feb 8, 2023

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference.
Is there any way to ignore reconciliation on specific resources or fields?

#8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

@jamesalucas
Copy link

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference. Is there any way to ignore reconciliation on specific resources or fields?

#8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

In case it helps anyone else, increasing the ScaledObject pollingInterval made a massive difference to the ArgoCD CPU usage.

@BongoEADGC6
Copy link

BongoEADGC6 commented Feb 24, 2023

I've been seeing this a lot still on v2.6.2 with two different metallb deployments. Constantly loops over them and the orphanedResources is not in the project spec for default.

@roeizavida
Copy link

In v2.6.1 with ignoreAggregatedRoles: true, ignoreResourceStatusField: all, timeout.reconciliation: 300s and increased polling interval for Keda, the issue is still present. The application controller (4 replicas) is using 16 CPUs with ~280 applications.

@neiljain
Copy link

neiljain commented Mar 2, 2023

ArgoCD version:

{
    "Version": "v2.5.7+e0ee345",
    "BuildDate": "2023-01-18T02:23:39Z",
    "GitCommit": "e0ee3458d0921ad636c5977d96873d18590ecf1a",
    "GitTreeState": "clean",
    "GoVersion": "go1.18.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v4.5.7 2022-08-02T16:35:54Z",
    "HelmVersion": "v3.10.3+g835b733",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.18.0"
}

we even bumped timeout.reconciliation from 30m to 2h, but that didn't help.

we ran into this issue when using custom plugins for our applications:

      plugin:
        env: []
        name: custom-plugin
      repoURL: ssh://git@<your-repo-server>/argo/deploy-sample-app.git
      targetRevision: main

and noticed the following logs in application controller:
{"application":"argocd/deploy-sample-“app,”level":"info","msg":"Refreshing app status (spec.source differs), level (3)","time":"2023-03-02T06:16:35Z"}

with multiple test environments configured to use argocd and 100s of argo apps per env, this crashed our git servers every couple of days.

so we had to add the following dummy var to fix the constant refresh of the app:

      plugin:
        env:
        - name: DUMMY_VAR_TO_STOP_ARGO_REFRESH
          value: "true"

@nferro
Copy link

nferro commented Mar 10, 2023

I'm also seeing this issue with AzureKeyVaultSecret

argocd-application-controller-8] time="2023-03-10T00:39:15Z" level=debug msg="Refreshing app argocd/application for change in cluster of object namespace/avk of type spv.no/v1/AzureKeyVaultSecret"

this then triggers a level (1) refresh that takes a long time:

[argocd-application-controller-8] time="2023-03-10T00:39:14Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application= argocd/application

@agaudreault
Copy link
Member

The behavior can be configured in ignoreResourceUpdates to resolve this issue.

@tooptoop4
Copy link

@Funk66 did u submit a PR for #8100 (comment) ?

@Funk66
Copy link
Author

Funk66 commented May 13, 2024

I tried implementing a fix but couldn't make it work fully. I may try again in the coming weeks, if nobody else does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests