-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid infinite loop in Kubernetes watcher #6353
Conversation
In case of unmarshalling errors the Watcher was ending up in a infinite loop, this code ignores those errors and only restarts the watcher in case of EOF.
3b44e64
to
32debe1
Compare
libbeat/common/kubernetes/watcher.go
Outdated
// In case of EOF, stop watching and restart the process | ||
if err == io.EOF || err == io.ErrUnexpectedEOF { | ||
watcher.Close() | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please make this binary back off?
failing tests are unrelated to this PR |
We updated to |
Uhm, could you paste the log output? Some things should have, at least, changed |
Sure:
Thanks for looking into it! Edit: I see this addition in this pull request:
However, there doesn't appear to be any wait between failures. |
This is weird, it looks like the change is in, but watcher connection backoff doesn't seem in place. Do you know by a chance what's the event originating this? I haven't reproduced it myself |
So far I haven't been able to find a specific suspicious event right before the loop occurs, do you have any suggestions for what I could look for? |
Hello, We are experiencing this very same issue on Filebeat 6.2.2 when the 'add_kubernetes_metadata' processor is added on our Openshift Kubernetes nodes. Grtz |
@jimsheldon @willemdh Any chance both of you could enable debug logging and share the log output here (or link to a gist)? |
I once met t this before, that's what I thought client-go would make help here with better compatibility.
…________________________________
From: Nicolas Ruflin <notifications@github.com>
Sent: Tuesday, February 27, 2018 8:49:22 AM
To: elastic/beats
Cc: Subscribed
Subject: Re: [elastic/beats] Avoid infinite loop in Kubernetes watcher (#6353)
@jimsheldon<https://github.com/jimsheldon> @willemdh<https://github.com/willemdh> Any chance both of you could enable debug logging and share the log output here (or link to a gist)?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#6353 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AK0p_rs_j-Bz7vwSGZ70Ff53Bgp29YTDks5tY1ESgaJpZM4SCCFd>.
|
I once make a issue for this ericchiang/k8s#68 before while using beat, but didn't get response for this, while when I test #6117 with client-go, it didn't happen again. |
@ruflin Well it seems after restarting the FIlebeat agents on our 10 Openshift nodes, the problem did not reappear.. I'm monitoring closely to find a pattern in the hope to reproduce. |
@willemdh Keep us posted if it happens again. I think a point that I only realised now is that you are running OpenShift. There were some subtle differences in the past, so this is probably also a good data point to add to the equation when we see this again. |
We are facing the same Issues with Filebeat 6.2.2 using the add_kubernetes_metadata processor.
|
@exekias no. So maybe the updated client library make help here? |
That was my expectation, would anyone suffering from this issue be up to give a try to |
@exekias I would give it a try |
I'm experiencing the same issue in custom build image from master(5fa9be4).
Also, I'm not sure if it caused crash, but the last logs of the previous instance of restarted pods end with this error. |
I might not have entire context about the issue. I am guessing that this is because of an incompatible kube client as the issue seems to be coming from proto unmarshalling. @wflyer what is the version of Kubernetes that you're running on? |
@vjsamuel I'm using kubernetes version v1.9.2.
|
For the crash part that I mentioned (#6353 (comment)) it looks like that this issue significantly increases the memory usage of the pod and makes the pod to be killed due to OOM. |
Thank you for the patience everyone, I've built a 6.3 snapshot: |
@exekias
|
Seeing the same issue with es stack 6.2 on k8s 1.8.5 multiple lines of: and causing the pods to run OOM and crash |
Seeing the same issue with filebeat 6.2.2 on k8s 1.7.2. It also appears that when the issue occurs the k8s api server is flooded with WATCH requests which causes the CPU on the k8s api servers to increase to almost 100% when the load is normally < 10%. Would it be better to create a new issue rather than comment on a merged pull request? |
Maybe I can build a image with latest beat master but using kubernetes/client-go instead of the library currently in used? I am pretty sure this is compatibility issue in k8s library. |
@vjsamuel I've not been able to reproduce it. Generally after restarting Filebeat everything works as expected. It takes 3-5 days before this issue appears. |
never mind @willemdh. I was able to reproduce. I have a PR with the fix. |
Just an update on this, a fix for #6503 has been merged and should go out in both |
In case of unmarshalling errors the Watcher was ending up in a infinite loop, this code ignores those errors and only restarts the watcher in case of EOF.
In case of unmarshalling errors the Watcher was ending up in a infinite loop,
this code ignores those errors and only restarts the watcher in case of
EOF.
This was reproduced by @cdahlqvist:
Opening directly to 6.2 as master is not affected, after its refactoring and client update