Sync operation getting stuck with high number of resources in Application #14224

bartoszbryk · 2023-06-27T13:02:59Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

ArgoCD doesn't finish syncing the application and the sync seems to be getting stuck with higher number of resources (around 4000) in Application. The sync also cannot be terminated in this state. Only deleting the application-controller pod helps. However all the resources in the application appear to be synced and the log doesn't indicate any reason for sync being stuck.

To Reproduce

Create an application with high number of resources (in our case 4000 Kafka topics and users) and try to sync it automatically

Expected behavior

The sync finishes successfully

Screenshots

Version

argocd: v2.6.1+3f143c9.dirty
  BuildDate: 2023-02-08T22:21:48Z
  GitCommit: 3f143c9307f99a61bf7049a2b1c7194699a7c21b
  GitTreeState: dirty
  GoVersion: go1.19.5
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.6.3+e05298b
  BuildDate: 2023-02-27T14:40:19Z
  GitCommit: e05298b9c6ab8610104271fa8491f019fee3c587
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.19.1

Logs

time="2023-06-27T11:53:46Z" level=info msg="Refreshing app status (comparison expired, requesting refresh. reconciledAt: 2023-06-27 11:46:28 +0000 UTC, expiry: 6m0s), level (2)" application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Comparing app state (cluster: <url>, namespace: appname)" application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Resuming in-progress operation. phase: Running, message: " application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Comparing app state (cluster: <url>, namespace: appname)" application=argocd/appname
time="2023-06-27T11:54:08Z" level=info msg="getRepoObjs stats" application=argocd/appname build_options_ms=0 helm_ms=21842 plugins_ms=0 repo_ms=0 time_ms=21962 unmarshal_ms=120 version_ms=0
time="2023-06-27T11:54:11Z" level=info msg="getRepoObjs stats" application=argocd/appname build_options_ms=0 helm_ms=22072 plugins_ms=0 repo_ms=0 time_ms=24971 unmarshal_ms=2898 version_ms=0
time="2023-06-27T11:54:14Z" level=info msg=Syncing application=argocd/appname skipHooks=false started=false syncId=00002-Jebev
time="2023-06-27T11:54:15Z" level=info msg="Tasks (dry-run)" application=argocd/appname syncId=00002-Jebev tasks="[Sync/0 resource kafka.strimzi.io/KafkaUser:appname/argouser obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-0 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-10 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-100 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1000 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1001 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1002 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1003 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1004 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1005 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1006 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1007 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1008 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1009 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-101 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1010 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1011 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1012 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1013 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1014 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1015 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1016 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1017 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1018 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1019 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-102 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1020 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1021 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1022 obj->obj (,,), Sync/0 resource (....)
time="2023-06-27T12:09:26Z" level=info msg="Skipping auto-sync: another operation is in progress" application=argocd/appname
time="2023-06-27T12:09:26Z" level=info msg="Updated sync status: OutOfSync -> Synced" application=appname dest-namespace=appname dest-server="<url>" reason=ResourceUpdated type=Normal
time="2023-06-27T12:09:26Z" level=info msg="Updated health status: Progressing -> Healthy" application=appname dest-namespace=appname dest-server="<url>" reason=ResourceUpdated type=Normal
time="2023-06-27T12:09:27Z" level=info msg="Update successful" application=argocd/appname
time="2023-06-27T12:09:27Z" level=info msg="Reconciliation completed" application=argocd/appname dedup_ms=36 dest-name= dest-namespace=appname dest-server="<url>" diff_ms=2478 fields.level=1 git_ms=134 health_ms=1745 live_ms=197 settings_ms=0 sync_ms=21 time_ms=8011

The text was updated successfully, but these errors were encountered:

blakepettersson · 2023-06-27T16:10:06Z

There were some deadlock fixes that came in 2.6.8, could you try that and see if the issue persists?

deadlysyn · 2023-06-28T15:13:16Z

running argocd + rollouts + image updater, all on latest release, doing blue/green with pre- and post-analysis. routinely see similar behavior in a large but significantly smaller app (few hundred resource). providing context in case it helps since this is the closest bug report i've found to what we see:

    "Version": "v2.7.6+00c914a.dirty",
    "BuildDate": "2023-06-20T20:51:13Z",
    "GitCommit": "00c914a948d9e8ad99be8bd82a368fbdeba12f88",
    "GitTreeState": "dirty",
    "GoVersion": "go1.19.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.0.1 2023-03-14T01:32:48Z",
    "HelmVersion": "v3.11.2+g912ebc1",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.19.1"

app enters a perpetual progressing state (usually run with auto-sync)
rollout appears to think it's done, so can't abort
to try and force an update, disable auto-sync and manually override with a newer image
rollout scales new replica set, orphans the previous (needs manual deletion)
appears stuck in loop, new analysis run does not start
repeat for as many images as you like (seems stuck here)
did some housekeeping (manually deleted all unused rs and analysisrun) -- know this shouldn't be necessary or affect anything but was trying things out of desperation
hit "sync" which scaled another replica set w/ latest image, created analysisrun, appears to have returned to normal behavior

we are still root causing, but the main log we see during this "stuck' state is "another operation is in progress" which brought me here.

bartoszbryk · 2023-07-12T11:05:00Z

updating to 2.6.8 didn't help
@deadlysyn We are observing the same logs in our issues - "another operation is in progress"

crenshaw-dev · 2023-07-14T19:43:50Z

@deadlysyn I think your issue may be different from @bartoszbryk. Just a theory, but I have recent experience which makes me think this.

@bartoszbryk I think what's happening is that, when Argo CD is trying to sync your ~4000 resources, it's trying to patch status.operationState with the current state of the sync operation. But because the resource has gotten so big, the Kubernetes API is rejecting the patch. You can validate this theory by searching your k8s API logs for patches against that resource. You should see error responses.

You will not see corresponding Argo CD error logs because, currently, we don't log errors encountered when updating the operation state. We retry indefinitely. I've put up a PR to fix this. Incidentally, the retry spams the k8s API with requests that are doomed to fail. Intuit's k8s team was the first to notice the issue, due to an elevated number or error responses.

We bought ourselves a little time by setting controller.resource.health.persist to "false". This offloaded some of the .status field to Redis, which got us back under the k8s resource size limit.

But we quickly hit the limit again as the number of resources increased. We ended up splitting the app into two apps to get back under the limit. But it's just a band-aid.

I've opened an issue to brainstorm ways to get Argo CD to gracefully handle large apps. I've scheduled time at the next SIG Scalability meeting to discuss as well.

Please let me know if this theory matches what you're seeing. I'd love to help work out a solution.

PavelPikat · 2023-10-28T22:14:54Z

@crenshaw-dev Were there any findings/actions from the SIG meeting?

crenshaw-dev · 2023-10-29T00:09:08Z

@PavelPikat only that the idea of compressing the status seems like a reasonable way to counteract the problem.

karanrn · 2024-08-29T04:49:00Z

Any updates or solutions to this issue? We are facing issues too

bartoszbryk added the bug Something isn't working label Jun 27, 2023

crenshaw-dev added the type:scalability Issues related to scalability and performance related issues label Jul 14, 2023

crenshaw-dev mentioned this issue Jul 19, 2023

ArgoCD Application Stuck In Syncing/Terminating State #8113 #11088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync operation getting stuck with high number of resources in Application #14224

Sync operation getting stuck with high number of resources in Application #14224

bartoszbryk commented Jun 27, 2023 •

edited

Loading

blakepettersson commented Jun 27, 2023

deadlysyn commented Jun 28, 2023 •

edited

Loading

bartoszbryk commented Jul 12, 2023

crenshaw-dev commented Jul 14, 2023 •

edited

Loading

PavelPikat commented Oct 28, 2023

crenshaw-dev commented Oct 29, 2023

karanrn commented Aug 29, 2024

Sync operation getting stuck with high number of resources in Application #14224

Sync operation getting stuck with high number of resources in Application #14224

Comments

bartoszbryk commented Jun 27, 2023 • edited Loading

blakepettersson commented Jun 27, 2023

deadlysyn commented Jun 28, 2023 • edited Loading

bartoszbryk commented Jul 12, 2023

crenshaw-dev commented Jul 14, 2023 • edited Loading

PavelPikat commented Oct 28, 2023

crenshaw-dev commented Oct 29, 2023

karanrn commented Aug 29, 2024

bartoszbryk commented Jun 27, 2023 •

edited

Loading

deadlysyn commented Jun 28, 2023 •

edited

Loading

crenshaw-dev commented Jul 14, 2023 •

edited

Loading