Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

Closed
robin13 opened this issue Jul 15, 2016 · 5 comments
Closed

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

robin13 opened this issue Jul 15, 2016 · 5 comments

Comments

@robin13
Copy link
Contributor

robin13 commented Jul 15, 2016

Please post all questions and issues on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.

For confirmed bugs, please report:

  • Version: filebeat 1.2.3 and 5.0.0 alpha4 with logstash 5.0.0 alpha4
  • Operating System: Ubuntu 14.0.4
  • Steps to Reproduce:

Kill filebeat with SIGHUP (as is done when using the standard service restart: service filebeat restart) and then restart will result in duplicate events. A SIGHUP should be considered a normal restart, and filebeat should wait for ACK from logstash before exiting.

Attached are sample configurations for logstash and filebeat and a test script to run the test multiple times in sequence to reproduce.

test.tar.gz

Related: #2041

Note - for the script to work you must start logstash with the -r restart option. The line in the script which adds a comment to the logstash configuration causes logstash to reload the configuration and flush the lines. Logshash file output should flush everything to file after 5 seconds, but does not seem to be honouring this now...

@robin13 robin13 added the Filebeat Filebeat label Jul 15, 2016
@robin13 robin13 changed the title Duplicate events when filebeats is killed with SIGHUP/SIGINT Duplicate events when filebeat is killed with SIGHUP/SIGINT Jul 15, 2016
@robin13
Copy link
Contributor Author

robin13 commented Jul 15, 2016

Assuming that filebeat cannot wait indefinitely for an ACK (if logstash is blocked), it should wait some "sane" amount (e.g. default 5 seconds, but configurable) to fulfil the mandate of "under normal operations, no data loss or duplication should occur"

@ruflin
Copy link
Contributor

ruflin commented Jul 18, 2016

The current behaviour is somehow expected, as the connection is "just" closed. I think the best option would be to introduce a config option like shutdown_timeout or something similar where a timeout can be configured. I would keep it set to 0 by default as the "sane" amount can be very different depending on how many messages are sent and if servers are reachable. So it is up to the user to find the right value in his case.

@tsg
Copy link
Contributor

tsg commented Jul 25, 2016

We've discussed this a bit in our team meeting, here is a summary. The solution proposed in this ticket (wait for a configurable amount of time before shutting down the publisher and the registrar) would be relatively easy to implement and would help, but it has the following disadvantages:

  • The amount of time to wait that would be "enough" is hard to estimate, because we depend not only on the network RTT but also on the Logstash/Elasticsearch processing time. Logstash persistence might make the timing more reliable, but it will still be dependent on the batch size and the load of the system at the given moment.
  • If the configured value is too large, we risk either the operator or the init script/systemd to kill -9 Filebeat making the problem worse. So the default value will need to be either 0 or a very short interval (e.g. 0.5s) to only cover the most basic cases.

Because of these issues, we would generally prefer that we rely on #1492 to remove duplicates by de-duplication on the Elasticsearch side.

However, in the Filebeat case, the de-duplication would only be effective if also spooling is used, because otherwise new UUIDs would be generated when the log lines are reread. Because spooling will come with an obvious cost (disk usage and IO), it will probably also be off by default in Filebeat. So, after all, it makes sense to implement both this ticket and #1492.

Also, we realized we need to maintain a docs page explaining all the situations in which the Beats can cause duplicates or losses, similar to the Elasticsearch resiliency page.

@ruflin ruflin mentioned this issue Sep 7, 2016
2 tasks
ruflin added a commit to ruflin/beats that referenced this issue Sep 7, 2016
@ruflin
Copy link
Contributor

ruflin commented Sep 16, 2016

As shutdown_timeout is now in master I'm closing this issue: #2514 We still have #1492 to track the Id generation.

@robin13 Please let us know if the shutdown_timeout works for you as expected.

@ruflin ruflin closed this as completed Sep 16, 2016
@robin13
Copy link
Contributor Author

robin13 commented Feb 13, 2017

@ruflin Thank you - that the solution works nicely: with shutdown_timeout set to 5 seconds there were still some duplicates, but at 10 seconds it's all good. :)
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants