Filebeat stops sending logs when load balancing is enabled #1829

tokle · 2016-06-10T06:57:42Z

I have a basic elastic setup where logs are harvested and published to logstash/elasticsearch using filebeat.
On several occasions I have observed that no logs are sent even though everything looks ok on elasticsearch and logstash ends.
The log pipeline will only return to a good state by restarting the filebeat process (restarting logstash or elasticsearch has no effect).
We have seen this issue in correlation with a network issue on or reboot of the elasticsearch nodes. (logstash logs "Beats input: the pipeline is blocked, temporary refusing new connection.")

After some experimenting I have been able to reproduce the issue:

Configure filebeat with config below.
- 3 important settings that seems to trigger this (I have not been able to reproduce without them)
  - publish_async: true
  - loadbalance: true (logstash output)
  - 2 or more logstash hosts
Filebeat is connected and publishing logs successfully to logstash
Elasticsearch goes down (by stopping the service)
Logstash starts logging "Beats input: the pipeline is blocked, temporary refusing new connection"
After some time elasticsearch is available again. In my test I set the downtime to 2 minutes.
Logstash stops logging that the pipeline is blocked, but filebeat is not publishing any logs.

This seems to be related to or even the same issue as #878

I have tested this with filebeat versions 1.2.3 and 5.0.0-alpha3. Below outputs is with filebeat 1.2.3.

The filebeat config:

filebeat:
  prospectors:
    - paths:
        - /var/log/nginx/access.log
      exclude_files: [".gz$"]
      document_type: nginxproxyaccess
      fields_under_root: true
      close_older: 20m
      force_close_files: true
  publish_async: true
  registry_file: /var/lib/filebeat/registry
output:
  logstash:
    hosts: ["logstash1:5044","logstash2:5044"]
    timeout: 15
    compression_level: 3
    loadbalance: true

Logstash logs:

{:timestamp=>"2016-06-10T06:07:05.362000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"CircuitBreaker::Open", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.864000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:06.365000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:06.865000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}

Filebeat logs:

2016-06-10T06:08:25Z DBG  Start next scan
2016-06-10T06:08:25Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:25Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:26Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:26Z DBG  Try to publish 2048 events to logstash with window size 8
2016-06-10T06:08:30Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:30Z DBG  Try to publish 2048 events to logstash with window size 8
2016-06-10T06:08:35Z DBG  Start next scan
2016-06-10T06:08:35Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:35Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:41Z DBG  close connection
2016-06-10T06:08:41Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:41Z INFO Error publishing events (retrying): read tcp fbeat:56583->logstash1:5044: i/o timeout
2016-06-10T06:08:45Z DBG  close connection
2016-06-10T06:08:45Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:45Z INFO Error publishing events (retrying): read tcp fbeat:34991->logstash2:5044: i/o timeout
2016-06-10T06:08:45Z DBG  Start next scan
2016-06-10T06:08:45Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:45Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Start next scan
2016-06-10T06:08:55Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:55Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log

The text was updated successfully, but these errors were encountered:

mpolden · 2016-06-10T07:21:22Z

After some debugging, we have discovered what seems to be a deadlock in load balancer mode. This occurs when Logstash is rejecting connections due to some error condition, e.g. Elasticsearch being down, as described in the first commant

At balance.go#L167 each worker reads retry attempts from m.retries and at balance.go#L261 forwardEvent puts a message on ch (m.retries).

When m.retries reaches it's maximum capacity (4 in our test case, 2 * number of clients). forwardEvent will block because ch is full. cap(ch) = 4, len(ch) = 4 at balance.go#L261. All nodes needs to be down until m.retriesis filled for this to occur.

At this point everything is blocked as each worker is blocking at balance.go#L170, waiting for onMessage to complete (which it never does because forwardEvent is blocking).

I noticed that the first select in forwardEvent does not have a timeout case balance.go#L260 vs. balance.go#L260. Adding a timeout to the first select fixes the deadlock, but I'm not 100% sure about the implications of doing so.

ruflin · 2016-06-10T07:35:43Z

I didn't check it in detail yet but that sounds like related to one of these?

mpolden · 2016-06-10T07:57:06Z

@ruflin Not sure, I can try reproducing with master.

mpolden · 2016-06-10T08:37:34Z

@ruflin We can reproduce the same issue on master (c8d2ccb).

ruflin · 2016-06-10T08:46:06Z

@martinp Thanks a lot for testing. @urso Can you have a look here?

urso · 2016-06-10T10:13:19Z

First select has no timeout on purpose. As events/messages are not allowed to be dropped. It's used if Guaranteed flag or max_retries<0 is set.

One potential problem I see with code (receive method) is beats not enforcing any order of queue reads. The select statement chooses the channel to try/read from first by random. Better solution for load-balancer would be to not read any events from work-channel as long as there are any messages in retry-channel. The retry channel is sized to be 2*number of workers. This should help breaking the deadlock.

I will try to reproduce the issue with an unit test.

Thanks for very detailed debugging information. Helps a ton identifying these kind of issues.

Current workaround in filebeat is: disable publish_async and set filebeat.spool_size to workers * logstash endpoints * bulk_max_size. This config will get you at least some load-balancing with all logstash instance operating in lock-step.

mpolden · 2016-06-10T10:25:48Z

@urso Thanks, we'll try the workaround and see how it goes.

tokle · 2016-06-10T11:04:54Z

Thanks for the suggested workaround @urso. I've tried it in our environment and it looks acceptable both in terms of throughput and the fact that reconnect seems to work properly.

ruflin added bug Filebeat Filebeat libbeat labels Jun 10, 2016

urso mentioned this issue Jun 10, 2016

Fix loadbalancer deadlock on too many retries being queued up #1832

Merged

tsg added the Pioneer Program label Jun 10, 2016

ruflin closed this as completed in #1832 Jun 10, 2016

tsg removed the Pioneer Program label May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat stops sending logs when load balancing is enabled #1829

Filebeat stops sending logs when load balancing is enabled #1829

tokle commented Jun 10, 2016

mpolden commented Jun 10, 2016 •

edited

Loading

ruflin commented Jun 10, 2016

mpolden commented Jun 10, 2016

mpolden commented Jun 10, 2016

ruflin commented Jun 10, 2016

urso commented Jun 10, 2016 •

edited

Loading

mpolden commented Jun 10, 2016

tokle commented Jun 10, 2016

Filebeat stops sending logs when load balancing is enabled #1829

Filebeat stops sending logs when load balancing is enabled #1829

Comments

tokle commented Jun 10, 2016

mpolden commented Jun 10, 2016 • edited Loading

ruflin commented Jun 10, 2016

mpolden commented Jun 10, 2016

mpolden commented Jun 10, 2016

ruflin commented Jun 10, 2016

urso commented Jun 10, 2016 • edited Loading

mpolden commented Jun 10, 2016

tokle commented Jun 10, 2016

mpolden commented Jun 10, 2016 •

edited

Loading

urso commented Jun 10, 2016 •

edited

Loading