Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD Application Controller replica hangs/stuck during initialization of the cluster cache (v2.4.12) #10842

Closed
3 tasks done
rahul-mourya opened this issue Oct 7, 2022 · 5 comments
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache

Comments

@rahul-mourya
Copy link

rahul-mourya commented Oct 7, 2022

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Some of the application get stuck at refresh operation indefinitely until the application controller restarts. We have argocd deployment with 3-replica of application controllers and 2 replica of argocd-server.
After some debugging it looks that one of the application controller replica doesn't complete the invalidation of live state cache and it's reinitialization. Once an invalidate live state cache triggers then the problematic replica would stop the automatic reconciliation of the applications which it was responsible for handling and there are verify minimal logging in the problematic replica and Memory consumption for that replica remains constant later throughout and CPU almost drops to zero. The symptoms point to a possible deadlock scenario during the reinitialization of cluster cache.
This problem is only seen with some specific cluster's applications, only with the cluster which is handled by the problematic argocd application controller replica.

Restarting the application controller statefulset seems to resolve the issue.
Similar issue reported here #8116
To Reproduce

Restarting/Deleting one of the two argocd-server pod seems to trigger the invalidation of cluster cache in all the application controller replica after which one of the replica shows the above explained symptoms of hang.
Expected behavior
No application should be stuck at refresh operations.

Invalidation of cluster cache and it's reinitialization should complete without any issues for all the application controller replica. No applications should be stuck in refresh operation and automatic reconciliation of application should run without any issues.
Screenshots

Version
v2.4.12

argocd-server: v2.4.12+28b8fea.dirty
  BuildDate: 2022-09-19T03:28:31Z
  GitCommit: 28b8fea2e68a931543b05e988e78241eb9487058
  GitTreeState: dirty
  GoVersion: go1.18.6
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: unknown 1970-01-01T00:00:00Z
  Helm Version: v3.9.0+g7ceeda6
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

Logs
Here is the logs at the time of invalidation of cache with some redacted cluster URL for privacy reasons.
Logs from other replica with proper Invalidation and reinitialization of cluster cache
Note: To validate the reinitialization of the cache, look for logs with live state cache invalidated and then Start syncing cluster logs for all the clusters assigned to this replica.


  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Start syncing cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="live state cache invalidated"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://kubernetes.default.svc"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted 3"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"

Logs from the problematic replica
Note: No live state cache invalidated and Start syncing cluster logs after triggering of invalidation of live state cache event


  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"

@rahul-mourya rahul-mourya added the bug Something isn't working label Oct 7, 2022
@jgwest jgwest added the component:core Syncing, diffing, cluster state cache label Oct 7, 2022
@rahul-mourya
Copy link
Author

rahul-mourya commented Nov 9, 2022

I tried to debug it further and looks like the recent commit introduced a RLock() which seems to be causing a deadlock when the invalidation of cache flow is triggered.

Below is the goroutine which is responsible for invalidating the liveStateCache and clusterCache.

goroutine 160 [semacquire, 67 minutes]:
sync.runtime_SemacquireMutex(0xc03ff94750?, 0x30?, 0xc03c6efbb0?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*Mutex).lockSlow(0xc0019aefa0)
	/usr/local/go/src/sync/mutex.go:162 +0x165
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0x20300f?)
	/usr/local/go/src/sync/rwmutex.go:139 +0x36
github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).Invalidate(0xc0019aef00, {0xc0718169b0, 0x1, 0x1?})
	/go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:409 +0x68
github.com/argoproj/argo-cd/v2/controller/cache.(*liveStateCache).invalidate(0xc000220a80, {{{0x44a26e0, 0xc0523aea50}, {0x449b1e0, 0xc0523aea20}}, {0x33214a0, 0x1a}, {0x32e3233, 0x5}})
	/go/src/github.com/argoproj/argo-cd/controller/cache/cache.go:477 +0x284
github.com/argoproj/argo-cd/v2/controller/cache.(*liveStateCache).watchSettings(0xc000220a80, {0x44c4140, 0xc00032d080})
	/go/src/github.com/argoproj/argo-cd/controller/cache/cache.go:588 +0x371
created by github.com/argoproj/argo-cd/v2/controller/cache.(*liveStateCache).Run
	/go/src/github.com/argoproj/argo-cd/controller/cache/cache.go:611 +0xfb

The above is waiting to acquire a write lock on the respective *clusterCache (on line 409 in cluster.go) object.

Now if we look into the goroutines which I shared above in my previous comments especially goroutine 16723

goroutine 16723 [semacquire, 13 minutes]:
sync.runtime_SemacquireMutex(0xc065ce17d0?, 0x80?, 0x4?)
  /usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).RLock(...)
  /usr/local/go/src/sync/rwmutex.go:63
github.com/argoproj/argo-cd/v2/controller/cache.(*liveStateCache).getCluster.func1(0xc0019aef00?, 0x1)
  /go/src/github.com/argoproj/argo-cd/controller/cache/cache.go:404 +0x85
github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).newResource(0xc0019aef00, 0xc065ce17d0)
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:361 +0x90
github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).processEvent(0xc0019aef00, {0xc034da1008, 0x8}, 0x0?)
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:1014 +0x3d0
github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1()
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:618 +0x633
github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed.func1()
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/utils/kube/kube.go:404 +0xec
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000508000})
  /go/pkg/mod/k8s.io/apimachinery@v0.23.1/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x44c4140?, 0xc00e34cec0?}, 0xc0454e7ca0?)
  /go/pkg/mod/k8s.io/apimachinery@v0.23.1/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x44c4140, 0xc00e34cec0}, 0xe0?, 0x1271765?, 0xc04c941500?)
  /go/pkg/mod/k8s.io/apimachinery@v0.23.1/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext({0x44c4140, 0xc00e34cec0}, 0x40?, 0xc000508000?)
  /go/pkg/mod/k8s.io/apimachinery@v0.23.1/pkg/util/wait/wait.go:545 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil(0xc028da4bc0?, 0x7f08f634eba0?, 0x48?)
  /go/pkg/mod/k8s.io/apimachinery@v0.23.1/pkg/util/wait/wait.go:536 +0x7c
github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed({0x44c4140?, 0xc028da4bc0?}, 0x3b9aca00, {0xc054787360, 0x48}, {{0x44c9e58?, 0xc002188400?}, 0x0?}, 0xc023518820)
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/utils/kube/kube.go:402 +0x132
github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents(_, {_, _}, {{{0x0, 0x0}, {0xc026b724d0, 0x9}}, {{0xc026b724c0, 0xa}, {0x0, ...}, ...}, ...}, ...)
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:534 +0x2df
created by github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).sync.func1.1
  /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.3/pkg/cache/cluster.go:779 +0x30a

This goroutine got into a deadlock state when trying to acquire the read lock while still holding multiple write locks,

  1. Acquired a write lock on *liveStateCache object(line number 375 in cache.go)
  2. Acquired a write lock on *clusterCache object(line number 1006 in cluster.go)
  3. Trying/Waiting to acquire RLock on the same *liveStateCache object (line number 404 in cache.go) ---> Caused the deadlock since it already hold the write lock on the same cache object.

Now since this goroutine is holding the write lock on a *clusterCache object and got into a deadlock state while the other goroutine(responsible for invalidating the liveStateCache/clusterCache) is waiting to acquire a write lock on the *clusterCache object, so it seems to me that due to the deadlock the lock on the *clusterCache object never got released and hence the other goroutine which is invalidating/reinitialization the cache is also stuck/hang forever waiting for the lock.

I tried removing/commenting the RLock()(which initially caused the deadlock) with our testpatch and it seems to have resolved this issue and I can see in the logs that the invalidation and reinitialization of caches goes fine and doesn't get stuck/hang forever.

I wanted to verify if this is the same *clusterCache object where this lock contention is happening. I noticed the addresses in the goroutine stack frames. Like for the below one:

github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).processEvent(0xc0019aef00, {0xc034da1008, 0x8}, 0x0?)

My assumption is that in the above stack frame, first object address(0xc0019aef00) refers to the caller object and then the addresses of the function arguments but I am not sure about it. If my above assumption is correct then it is the same *clusterCache object with address as 0xc0019aef00 in both of the above goroutine on which this lock contention is happening.

Do we really need the RLock() introduced with the commit in the cache.go here or we can get rid of that since the same goroutine is already acquiring a write lock and avoid the deadlock here.
I understand that this code changes were included as part of the fix for another issue with the commit. But I do not understand completely as we are already reading and initializing the cacheSettings object here in the starting of the func getCluster, why do we need to reinitialize it again in the callback func. Should we avoid the reinitialization and reuse the same cacheSettings object in the callback function.

Any feedback or comments in this context are highly appreciated.

@decodingahmed
Copy link

We are on v2.5.2+148d8da and we have also experienced this. Restarting the argocd-application-controller pod seems to have done the trick.

Please just bear in mind that not everyone would have access to the underlying infrastructure to restart the pod so this could be a big issue for some and a smaller one for others.

@aroundthecode
Copy link

We are experiencing the very same problem on 2.5.10, we added a persistent volume to /home/argocd/.kube so that the discovry cache folder is persisted across restart and it mitigated the problem, but when we add new cluster without cache (or cache TTL expires) the application-controller hangs on discovery phase blocking all other deploy.

This happens only on some cluster and not on others and seems related on CRD installed on the server but we didn't find any evidence of a specific CRD issue yet.

Any suggestion on this ? New argo version should include kubenretes client library> 1.24 so it should not suffer discovery throttling but evidence is against this assuption.

@aroundthecode
Copy link

@decodingahmed maybe we found a solution to the problem. Turned out we had a HAProxy in front of our target cluster in HTTP/L7 mode and doing SSL offload torwards K8 API Server.

We removed the http proxy and moved to TCP/L4, we had to re-generate cluster API certificate in order to properly handle SSL with the HaPoxy name and provide its CA to Argo cluster configuraton.

With these change the number of open connections on HaProxy dropped from 300+ to less than 100 and all argocd deployment got back to work with signifincat performance improvement.

My assumption is that L7 balancer was not handling properly "Watch" connection argo does on discovery phase leaving them zombie and saturating the pool. With L4 these connection are handled properly and recycled when needed.

Let me now if this helps (any confirmation on my assumpion from ArgoCD team is more then welcome as well ;) )

@alexmt
Copy link
Collaborator

alexmt commented Jul 25, 2023

Fixed by #13636

@alexmt alexmt closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache
Projects
None yet
Development

No branches or pull requests

5 participants