-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335
Comments
Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in #16864. |
Pinging @elastic/integrations-services (Team:Services) |
This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207 |
While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: After the timeout we're even seeing: |
Same problem with Filebeat and Logstash 7.7.1. |
Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection. We do have the filebeat |
bump, impacts filebeat 6.8 as well. |
Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery. https://gallery.ecr.aws/elastic/filebeat
|
Hi, |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Hi, same issue here |
hi, |
Hi! We're labeling this issue as |
👍 |
Reminder of the test scenario:
|
I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.
Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):
logstash
(to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, Parsing of bigger XML-s is extremely CPU intensive (because of rexml extremely inefficient regexps) logstash#11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)
… but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).
Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)
Logs picture. Here the problem started:
This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.
But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):
and so on, and so on, and so on, until restart (after which everything started to work OK).
This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.
To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.
The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).
PS It may matter, or not, that I use a few sections and there are plenty of log files
The text was updated successfully, but these errors were encountered: