Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

robin13 · 2016-07-15T12:46:15Z

Please post all questions and issues on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.

For confirmed bugs, please report:

Version: filebeat 1.2.3 and 5.0.0 alpha4 with logstash 5.0.0 alpha4
Operating System: Ubuntu 14.0.4
Steps to Reproduce:

Kill filebeat with SIGHUP (as is done when using the standard service restart: service filebeat restart) and then restart will result in duplicate events. A SIGHUP should be considered a normal restart, and filebeat should wait for ACK from logstash before exiting.

Attached are sample configurations for logstash and filebeat and a test script to run the test multiple times in sequence to reproduce.

test.tar.gz

Related: #2041

Note - for the script to work you must start logstash with the -r restart option. The line in the script which adds a comment to the logstash configuration causes logstash to reload the configuration and flush the lines. Logshash file output should flush everything to file after 5 seconds, but does not seem to be honouring this now...

The text was updated successfully, but these errors were encountered:

robin13 · 2016-07-15T12:53:20Z

Assuming that filebeat cannot wait indefinitely for an ACK (if logstash is blocked), it should wait some "sane" amount (e.g. default 5 seconds, but configurable) to fulfil the mandate of "under normal operations, no data loss or duplication should occur"

ruflin · 2016-07-18T10:39:10Z

The current behaviour is somehow expected, as the connection is "just" closed. I think the best option would be to introduce a config option like shutdown_timeout or something similar where a timeout can be configured. I would keep it set to 0 by default as the "sane" amount can be very different depending on how many messages are sent and if servers are reachable. So it is up to the user to find the right value in his case.

tsg · 2016-07-25T22:39:19Z

We've discussed this a bit in our team meeting, here is a summary. The solution proposed in this ticket (wait for a configurable amount of time before shutting down the publisher and the registrar) would be relatively easy to implement and would help, but it has the following disadvantages:

The amount of time to wait that would be "enough" is hard to estimate, because we depend not only on the network RTT but also on the Logstash/Elasticsearch processing time. Logstash persistence might make the timing more reliable, but it will still be dependent on the batch size and the load of the system at the given moment.
If the configured value is too large, we risk either the operator or the init script/systemd to kill -9 Filebeat making the problem worse. So the default value will need to be either 0 or a very short interval (e.g. 0.5s) to only cover the most basic cases.

Because of these issues, we would generally prefer that we rely on #1492 to remove duplicates by de-duplication on the Elasticsearch side.

However, in the Filebeat case, the de-duplication would only be effective if also spooling is used, because otherwise new UUIDs would be generated when the log lines are reread. Because spooling will come with an obvious cost (disk usage and IO), it will probably also be off by default in Filebeat. So, after all, it makes sense to implement both this ticket and #1492.

Also, we realized we need to maintain a docs page explaining all the situations in which the Beats can cause duplicates or losses, similar to the Elasticsearch resiliency page.

See elastic#2044

ruflin · 2016-09-16T15:25:49Z

As shutdown_timeout is now in master I'm closing this issue: #2514 We still have #1492 to track the Id generation.

@robin13 Please let us know if the shutdown_timeout works for you as expected.

robin13 · 2017-02-13T14:51:25Z

@ruflin Thank you - that the solution works nicely: with shutdown_timeout set to 5 seconds there were still some duplicates, but at 10 seconds it's all good. :)
Thank you!

robin13 added the Filebeat Filebeat label Jul 15, 2016

robin13 changed the title ~~Duplicate events when filebeats is killed with SIGHUP/SIGINT~~ Duplicate events when filebeat is killed with SIGHUP/SIGINT Jul 15, 2016

tsg added the enhancement label Jul 18, 2016

ruflin mentioned this issue Sep 7, 2016

Implement shutdown_timeout #2481

Closed

2 tasks

ruflin added a commit to ruflin/beats that referenced this issue Sep 7, 2016

Implement shutdown_timeout

d83c9a7

See elastic#2044

ruflin closed this as completed Sep 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

robin13 commented Jul 15, 2016 •

edited

Loading

robin13 commented Jul 15, 2016

ruflin commented Jul 18, 2016

tsg commented Jul 25, 2016

ruflin commented Sep 16, 2016

robin13 commented Feb 13, 2017

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

Comments

robin13 commented Jul 15, 2016 • edited Loading

robin13 commented Jul 15, 2016

ruflin commented Jul 18, 2016

tsg commented Jul 25, 2016

ruflin commented Sep 16, 2016

robin13 commented Feb 13, 2017

robin13 commented Jul 15, 2016 •

edited

Loading