Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Open
Mekk opened this issue Feb 14, 2020 · 15 comments
Labels
Auditbeat bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@Mekk
Copy link

Mekk commented Feb 14, 2020

I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.

Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):

  • logstash (to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, Parsing of bigger XML-s is extremely CPU intensive (because of rexml extremely inefficient regexps) logstash#11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)

  • sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)

  • … but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).

Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)

Logs picture. Here the problem started:

2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:35.025+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-12T19:44:36.909+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-12T19:44:36.909+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-12T19:44:40.506+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:40.506+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 1 reconnect attempt(s)
2020-02-12T19:44:47.796+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:47.796+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 2 reconnect attempt(s)
…
2020-02-12T19:46:18.110+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 5 reconnect attempt(s)
2020-02-12T19:46:18.111+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-12T19:46:50.516+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55834->10.92.23.57:5044: i/o timeout
2020-02-12T19:46:50.556+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected

This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.

But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):

2020-02-14T16:30:47.871+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:30:47.872+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:17.920+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33111->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:17.967+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:19.305+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:19.306+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:19.307+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:49.350+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33112->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:49.391+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:50.807+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:50.807+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:50.808+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:20.854+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33113->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:20.896+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:22.755+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:22.755+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:22.756+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:52.804+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33114->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:52.844+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:54.005+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:54.005+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:54.007+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:33:24.052+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33115->10.92.23.57:5044: i/o timeout
2020-02-14T16:33:24.095+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:33:25.708+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:33:25.708+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:33:25.710+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established

and so on, and so on, and so on, until restart (after which everything started to work OK).

This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.

To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.

The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).

PS It may matter, or not, that I use a few sections and there are plenty of log files

@Mekk Mekk changed the title Filebeat stays disconnected after logstash performance problems Filebeat stays disconnected after logstash performance problems (even long after those are resolved) Feb 14, 2020
@jsoriano
Copy link
Member

jsoriano commented Mar 6, 2020

Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in #16864.

@jsoriano jsoriano added Auditbeat Filebeat Filebeat Team:Services (Deprecated) Label for the former Integrations-Services team triage_needed labels Mar 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@jsoriano
Copy link
Member

jsoriano commented Mar 6, 2020

This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207

@urso
Copy link

urso commented Mar 6, 2020

While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: output.logstash.timeout).

After the timeout we're even seeing: Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused. Filebeat can't reconnect. Either there are network problems, or there is a connection limit reached in Logstash. Maybe the the workers (not sure, but I think Logstash uses a separate worker pool for beats communication) didn't get CPU time by the OS.

@M0rdecay
Copy link

M0rdecay commented Aug 3, 2020

Same problem with Filebeat and Logstash 7.7.1.
This is a serious trouble if there is periodic work on the network that makes Logstash unavailable for a while.

@danlsgiga
Copy link

Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection.

We do have the output.logstash.timeout setting but it seems to be of no effect in this specific scenario.

filebeat v7.5.2

@seadub
Copy link

seadub commented May 6, 2021

bump, impacts filebeat 6.8 as well.

@whyayala
Copy link

whyayala commented Nov 1, 2021

Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery.

https://gallery.ecr.aws/elastic/filebeat

ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected

@barionoperators
Copy link

Hi,
We have the same problem with filebeat and winlogbeat 7.16.3.
Tried every combination of firewall rules ( port 5044 to and from the logstash server, rule with all ports open).
Creating these FW rule seemed to resolve the problem so we started to narrow down the FW rules.
Now we have a situation where some clients (VLAN1) works perfectly without any additional FW rule (only 5044 allowed form the client to the server), and we have clients (VLAN2) which won't work after a logstash restart although all ports are open.
It is a very blocking issue for us.
Can someone help?

@jlind23 jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Services (Deprecated) Label for the former Integrations-Services team labels Mar 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@Carmelly212
Copy link

Hi, same issue here ERROR logstash/async.go:256 Failed to publish events caused by: write tcp 10.2....
trying to run filebeat as a sidecar in a helm chart...
I get the error when ever there is a JSON parse error but other then that logs are not showing up in elastic. can't tell if they reach logstash

@sebglon
Copy link

sebglon commented Mar 8, 2023

hi,
We have the same issue; i describe it here: https://discuss.elastic.co/t/filebeat-connection-error-with-logstash/327208

@botelastic
Copy link

botelastic bot commented Mar 7, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Mar 7, 2024
@henrikno
Copy link
Contributor

👍

@botelastic botelastic bot removed the Stalled label Jul 11, 2024
@Mekk
Copy link
Author

Mekk commented Jul 12, 2024

Reminder of the test scenario:

  • provoke situation where logstash is cpu-starved (in my case it was doing sth very cpu-intensive during log processing, in real case it was xml parser mishap, but anything would likely do – just write ruby filter which is doing some calculations for 30 seconds for ⅓ entries)

  • wait until filebeat reports connection problems (while logstash is working and doing that slow cpu-intensive processing)

  • „fix” (remove cpu-intensive filter) and restart logstash

  • observe whether filebeat manages to connect back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auditbeat bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests