Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Mekk · 2020-02-14T17:08:45Z

I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.

Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):

logstash (to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, Parsing of bigger XML-s is extremely CPU intensive (because of rexml extremely inefficient regexps) logstash#11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)
sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)
… but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).

Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)

Logs picture. Here the problem started:

2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:35.025+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-12T19:44:36.909+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-12T19:44:36.909+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-12T19:44:40.506+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:40.506+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 1 reconnect attempt(s)
2020-02-12T19:44:47.796+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:47.796+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 2 reconnect attempt(s)
…
2020-02-12T19:46:18.110+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 5 reconnect attempt(s)
2020-02-12T19:46:18.111+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-12T19:46:50.516+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55834->10.92.23.57:5044: i/o timeout
2020-02-12T19:46:50.556+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected

This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.

But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):

2020-02-14T16:30:47.871+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:30:47.872+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:17.920+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33111->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:17.967+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:19.305+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:19.306+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:19.307+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:49.350+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33112->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:49.391+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:50.807+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:50.807+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:50.808+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:20.854+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33113->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:20.896+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:22.755+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:22.755+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:22.756+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:52.804+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33114->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:52.844+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:54.005+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:54.005+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:54.007+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:33:24.052+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33115->10.92.23.57:5044: i/o timeout
2020-02-14T16:33:24.095+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:33:25.708+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:33:25.708+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:33:25.710+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established

and so on, and so on, and so on, until restart (after which everything started to work OK).

This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.

To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.

The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).

PS It may matter, or not, that I use a few sections and there are plenty of log files

The text was updated successfully, but these errors were encountered:

jsoriano · 2020-03-06T10:08:57Z

Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in #16864.

elasticmachine · 2020-03-06T10:10:26Z

Pinging @elastic/integrations-services (Team:Services)

jsoriano · 2020-03-06T10:14:39Z

This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207

urso · 2020-03-06T12:26:07Z

While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: output.logstash.timeout).

After the timeout we're even seeing: Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused. Filebeat can't reconnect. Either there are network problems, or there is a connection limit reached in Logstash. Maybe the the workers (not sure, but I think Logstash uses a separate worker pool for beats communication) didn't get CPU time by the OS.

M0rdecay · 2020-08-03T07:03:38Z

Same problem with Filebeat and Logstash 7.7.1.
This is a serious trouble if there is periodic work on the network that makes Logstash unavailable for a while.

danlsgiga · 2021-02-25T18:20:50Z

Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection.

We do have the output.logstash.timeout setting but it seems to be of no effect in this specific scenario.

filebeat v7.5.2

seadub · 2021-05-06T15:50:36Z

bump, impacts filebeat 6.8 as well.

whyayala · 2021-11-01T21:06:47Z

Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery.

https://gallery.ecr.aws/elastic/filebeat

ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected

barionoperators · 2022-02-02T15:08:18Z

Hi,
We have the same problem with filebeat and winlogbeat 7.16.3.
Tried every combination of firewall rules ( port 5044 to and from the logstash server, rule with all ports open).
Creating these FW rule seemed to resolve the problem so we started to narrow down the FW rules.
Now we have a situation where some clients (VLAN1) works perfectly without any additional FW rule (only 5044 allowed form the client to the server), and we have clients (VLAN2) which won't work after a logstash restart although all ports are open.
It is a very blocking issue for us.
Can someone help?

elasticmachine · 2022-03-31T13:26:12Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Carmelly212 · 2022-06-07T09:33:08Z

Hi, same issue here ERROR logstash/async.go:256 Failed to publish events caused by: write tcp 10.2....
trying to run filebeat as a sidecar in a helm chart...
I get the error when ever there is a JSON parse error but other then that logs are not showing up in elastic. can't tell if they reach logstash

sebglon · 2023-03-08T08:02:17Z

hi,
We have the same issue; i describe it here: https://discuss.elastic.co/t/filebeat-connection-error-with-logstash/327208

botelastic · 2024-03-07T08:46:17Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

henrikno · 2024-07-11T22:11:18Z

👍

Mekk · 2024-07-12T06:39:45Z

Reminder of the test scenario:

provoke situation where logstash is cpu-starved (in my case it was doing sth very cpu-intensive during log processing, in real case it was xml parser mishap, but anything would likely do – just write ruby filter which is doing some calculations for 30 seconds for ⅓ entries)
wait until filebeat reports connection problems (while logstash is working and doing that slow cpu-intensive processing)
„fix” (remove cpu-intensive filter) and restart logstash
observe whether filebeat manages to connect back

Mekk changed the title ~~Filebeat stays disconnected after logstash performance problems~~ Filebeat stays disconnected after logstash performance problems (even long after those are resolved) Feb 14, 2020

ferozsalam mentioned this issue Mar 6, 2020

Auditbeat agent fails to reconnect to Logstash #16864

Closed

jsoriano added Auditbeat Filebeat Filebeat Team:Services (Deprecated) Label for the former Integrations-Services team triage_needed labels Mar 6, 2020

jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Services (Deprecated) Label for the former Integrations-Services team labels Mar 31, 2022

jlind23 added bug 8.4-candidate and removed triage_needed labels Mar 31, 2022

jlind23 added 8.5-candidate and removed 8.4-candidate labels May 5, 2022

jlind23 removed the 8.5-candidate label Jul 7, 2022

botelastic bot added the Stalled label Mar 7, 2024

botelastic bot removed the Stalled label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Mekk commented Feb 14, 2020 •

edited

Loading

jsoriano commented Mar 6, 2020

elasticmachine commented Mar 6, 2020

jsoriano commented Mar 6, 2020

urso commented Mar 6, 2020

M0rdecay commented Aug 3, 2020

danlsgiga commented Feb 25, 2021

seadub commented May 6, 2021

whyayala commented Nov 1, 2021

barionoperators commented Feb 2, 2022

elasticmachine commented Mar 31, 2022

Carmelly212 commented Jun 7, 2022

sebglon commented Mar 8, 2023

botelastic bot commented Mar 7, 2024

henrikno commented Jul 11, 2024

Mekk commented Jul 12, 2024

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335

Comments

Mekk commented Feb 14, 2020 • edited Loading

jsoriano commented Mar 6, 2020

elasticmachine commented Mar 6, 2020

jsoriano commented Mar 6, 2020

urso commented Mar 6, 2020

M0rdecay commented Aug 3, 2020

danlsgiga commented Feb 25, 2021

seadub commented May 6, 2021

whyayala commented Nov 1, 2021

barionoperators commented Feb 2, 2022

elasticmachine commented Mar 31, 2022

Carmelly212 commented Jun 7, 2022

sebglon commented Mar 8, 2023

botelastic bot commented Mar 7, 2024

henrikno commented Jul 11, 2024

Mekk commented Jul 12, 2024

Mekk commented Feb 14, 2020 •

edited

Loading