-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArgoCD Application Controller replica hangs/stuck during initialization of the cluster cache (v2.4.12) #10842
Comments
I tried to debug it further and looks like the recent commit introduced a RLock() which seems to be causing a deadlock when the invalidation of cache flow is triggered. Below is the goroutine which is responsible for invalidating the liveStateCache and clusterCache.
The above is waiting to acquire a write lock on the respective *clusterCache (on line 409 in Now if we look into the goroutines which I shared above in my previous comments especially
This goroutine got into a deadlock state when trying to acquire the read lock while still holding multiple write locks,
Now since this goroutine is holding the write lock on a *clusterCache object and got into a deadlock state while the other goroutine(responsible for invalidating the liveStateCache/clusterCache) is waiting to acquire a write lock on the *clusterCache object, so it seems to me that due to the deadlock the lock on the *clusterCache object never got released and hence the other goroutine which is invalidating/reinitialization the cache is also stuck/hang forever waiting for the lock. I tried removing/commenting the RLock()(which initially caused the deadlock) with our testpatch and it seems to have resolved this issue and I can see in the logs that the invalidation and reinitialization of caches goes fine and doesn't get stuck/hang forever. I wanted to verify if this is the same *clusterCache object where this lock contention is happening. I noticed the addresses in the goroutine stack frames. Like for the below one:
My assumption is that in the above stack frame, first object address(0xc0019aef00) refers to the caller object and then the addresses of the function arguments but I am not sure about it. If my above assumption is correct then it is the same *clusterCache object with address as Do we really need the RLock() introduced with the commit in the cache.go here or we can get rid of that since the same goroutine is already acquiring a write lock and avoid the deadlock here. Any feedback or comments in this context are highly appreciated. |
We are on Please just bear in mind that not everyone would have access to the underlying infrastructure to restart the pod so this could be a big issue for some and a smaller one for others. |
We are experiencing the very same problem on 2.5.10, we added a persistent volume to /home/argocd/.kube so that the discovry cache folder is persisted across restart and it mitigated the problem, but when we add new cluster without cache (or cache TTL expires) the application-controller hangs on discovery phase blocking all other deploy. This happens only on some cluster and not on others and seems related on CRD installed on the server but we didn't find any evidence of a specific CRD issue yet. Any suggestion on this ? New argo version should include kubenretes client library> 1.24 so it should not suffer discovery throttling but evidence is against this assuption. |
@decodingahmed maybe we found a solution to the problem. Turned out we had a HAProxy in front of our target cluster in HTTP/L7 mode and doing SSL offload torwards K8 API Server. We removed the http proxy and moved to TCP/L4, we had to re-generate cluster API certificate in order to properly handle SSL with the HaPoxy name and provide its CA to Argo cluster configuraton. With these change the number of open connections on HaProxy dropped from 300+ to less than 100 and all argocd deployment got back to work with signifincat performance improvement. My assumption is that L7 balancer was not handling properly "Watch" connection argo does on discovery phase leaving them zombie and saturating the pool. With L4 these connection are handled properly and recycled when needed. Let me now if this helps (any confirmation on my assumpion from ArgoCD team is more then welcome as well ;) ) |
Fixed by #13636 |
Checklist:
argocd version
.Describe the bug
Some of the application get stuck at refresh operation indefinitely until the application controller restarts. We have argocd deployment with 3-replica of application controllers and 2 replica of argocd-server.
After some debugging it looks that one of the application controller replica doesn't complete the invalidation of live state cache and it's reinitialization. Once an invalidate live state cache triggers then the problematic replica would stop the automatic reconciliation of the applications which it was responsible for handling and there are verify minimal logging in the problematic replica and Memory consumption for that replica remains constant later throughout and CPU almost drops to zero. The symptoms point to a possible deadlock scenario during the reinitialization of cluster cache.
This problem is only seen with some specific cluster's applications, only with the cluster which is handled by the problematic argocd application controller replica.
Restarting the application controller statefulset seems to resolve the issue.
Similar issue reported here #8116
To Reproduce
Restarting/Deleting one of the two
argocd-server
pod seems to trigger the invalidation of cluster cache in all the application controller replica after which one of the replica shows the above explained symptoms of hang.Expected behavior
No application should be stuck at refresh operations.
Invalidation of cluster cache and it's reinitialization should complete without any issues for all the application controller replica. No applications should be stuck in refresh operation and automatic reconciliation of application should run without any issues.
Screenshots
Version
v2.4.12
Logs
Here is the logs at the time of invalidation of cache with some redacted cluster URL for privacy reasons.
Logs from other replica with proper Invalidation and reinitialization of cluster cache
Note: To validate the reinitialization of the cache, look for logs with
live state cache invalidated
and thenStart syncing cluster
logs for all the clusters assigned to this replica.Logs from the problematic replica
Note: No
live state cache invalidated
andStart syncing cluster
logs after triggering of invalidation of live state cache eventThe text was updated successfully, but these errors were encountered: