-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache.WaitForCacheSync may never exit on shutdown #41
Comments
I have a suspicion that we are waiting for the Kubernetes provider to exit:
That stack trace leads us into this function in the autodiscover library: elastic-agent-autodiscover/kubernetes/watcher.go Lines 185 to 187 in 4369c6b
if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
return fmt.Errorf("kubernetes informer unable to sync cache")
} |
Looking at the Kubernetes provider stack trace above:
I think we might lose the context, specifically I don't see us selecting or checking the Done channel of the context embedded in the I see this entire section to create a K8S client and eventually start a Pod eventer but I don't see the context used or anything that would tie the lifetime of requests or goroutines to the parent We eventually call the pod watcher Start method which is in the stack trace:
I don't see the node watcher being tied to the parent context anywhere: I suspect that when the agent tries to exit and cancels the context of the dynamic variable provider here it never propagates into the Pod watch object started here. @ChrsMark or @MichaelKatsoulis I see both your names in the commit log in this area, does this seem like a possible explanation to you for why the K8S provider would deadlock the agent shutdown sequence? Any ideas on the best possible fix? |
It seems like the Pod watcher has a context internally and we just don't set it: elastic-agent-autodiscover/kubernetes/watcher.go Lines 181 to 187 in 4369c6b
// Start watching pods
func (w *watcher) Start() error {
go w.informer.Run(w.ctx.Done())
if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
return fmt.Errorf("kubernetes informer unable to sync cache")
} Yep we are using elastic-agent-autodiscover/kubernetes/watcher.go Lines 128 to 134 in 4369c6b
ctx, cancel := context.WithCancel(context.TODO())
w := &watcher{
client: client,
informer: informer,
store: store,
queue: queue,
ctx: ctx, I believe we need to expose the ability to pass in the |
Thanks for all the research here @cmacknz! I had a quick look into this and what you suggest as a change looks valid in general. However I was wondering how it would be possible a go routine (provider/watcher etc) to make the whole program to get stuck? The provider in general should not be waited in case the controller/Agent wants to exit. Or this behavior is expected based on Agent's architecture? |
Yes in that the agent is explicitly waiting for the providers to exit gracefully in this sequence, see in particularly that it is waiting on the
The decision will ultimately come down to how much the agent should trust the providers. The agent doesn't trust inputs or integrations to behave reasonably at all as they are external to the agent, and will take the approach of force terminating them after 30 seconds. Since the providers are built directly into the agent itself, I think we should trust them to behave reasonably and assume they respect cancellation signals. In general it is a best practice (if not an actual necessity) to have the lifetime of each goroutine and network request tied to a cancellation signal, either for teardown or timeouts. We could equivalently fix this deadlock by making the agent give up on waiting for providers for a set period of time, but I would worry that is just hiding some other problem. For example, a provider might have API, session, or file resources it should clean up on shutdown instead of just abandoning them. I currently strongly prefer fixing the Kubernetes provider cancellation mechanism here instead of just ignoring it to ensure we aren't hiding another lower level problem, or allowing some future provider to misbehave in the future by not requiring a clean shutdown. |
Thanks for the explanation @cmacknz! I agree we need to ensure the graceful |
But we still properly close the channel by calling the
At @cmacknz did you manage to reproduce it? Could you share the steps? I would like to troubleshoot it further. |
The ECK team originally caught this in their integration tests. It was not 100% reproducible. I would get in touch with them. Easiest way is probably to just follow up in the issue on the ECK side where we were originally diagnosing this elastic/cloud-on-k8s#6331 (comment) |
We eventually end up stuck in
|
That looks weird. I cannot see why the
If the WaitForCacheSync was stalled the watcher wouldn't do the watch process at all. Means no discovered Pods. Is this true for their use-case?I'm not sure how this result happened tbh but it if WaitForCacheSync is stalled I think the autodiscovery is not working at all and it's no matter of the close/stop process.@gizas do you have any other ideas/thoughts here? |
I tried to follow the flow from the beginning so I might repeat things said above (sorry in advance):
When we want to run agent shutdown, this means that coordinator should terminate a context to pass this to composable is not it? But we dont have this common context now.
If this is stalled I agree that the autodiscovery will completely fail. Only case I can think is that we try to close this thread while is still running. |
@cmacknz I will refine my above answer, with following flow: Coodinator runs the provider and passes ctx -----> We create another localCtx context (based on previous ctx )which is passed to dynamic Provider -----> For eg. for kubernetes.go provider we will stop it only when comm.Done(https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/providers/kubernetes/kubernetes.go#L92) is received. So I am not sure if we send such signal by this https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L557 We can not replicate it at the moment. Do you have any ideas how we can replicate such scenario in kubernetes? Can somehow stop agent and this might trigger it? |
The ECK team has integration tests that experience this so often they had to disable the tests. See elastic/cloud-on-k8s#6331 (comment) for example. I would ask the ECK team how to set up their test environment, you can probably just reply to that issue. The stack trace here isn't lying to us about where the agent is stuck, although it might not be exactly obvious why. My suspicion is still the same as inn #41 I can see that elastic-agent-autodiscover's watcher.go is using a placeholder context that was not passed in by the caller, and therefore can never be cancelled or time out. |
We changed the agent so that it will no longer wait forever for the dynamic providers to exit in elastic/elastic-agent#2352. I've moved this into the autodiscover repository and reworded the description to specify that this was specifically about cache.WaitForCacheSync not being cancelled on shutdown. |
Closing this as per #41 (comment) |
This was discovered as the root cause of intermittent failures in the ECK operator integration tests for Fleet. See elastic/cloud-on-k8s#6331 (comment) for the logs we get when this happens. Diagnostics including a goroutine dump are attached.
Contents of the agent's state.yaml when this happens:
fleet-server-deadlock-diagnostics.tar.gz
The coordinator seems to be stuck at:
This is the relevant code block:
https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/agent/application/coordinator/coordinator.go#L552-L568
The text was updated successfully, but these errors were encountered: