-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966
Comments
Checking the code, we indeed there no backoff. Did x-pack monitoring work before? The init loop is supposed to check x-pack availablity every 10s. If init loop succeeds, the beat starts sending. |
Thanks. It worked before, only 1 node became unresponsive on the ES side so it started doing healthchecks against the failed node. Yah, certainly, all the Beats will not be firing off the healthcheck at the exact moment, but if there are enough Beats running on the network, it can still translate to noticeably more requests per second by the network admins. |
@urso Digging around the beats codebase a bit, sounds like the client implementation in beats monitoring could possible (re)use something like this? beats/libbeat/outputs/backoff.go Line 37 in 0053aaf
Just wanted to get your thoughts before I go off and put up a PR. Thanks. |
The monitoring reporter has 2 phases. First phase: check x-pack monitoring is available, every 10s. If First phase succeeds, the collector phase is started. The WithBackoff can be used for the collector phase. Not sure about the init phase, though. The WithBackoff wrapper only act on errors. Question: we want to backoff in case x-pack monitoring is disabled in ES (does not generate an error)? Anyways, WithBackoff reuses backoff from libbeat, as last resort you can use that one. |
If monitoring is disabled in the ES node specified in the hosts array, seems like it will make sense to skip the backoff routine for that host (until it shows up again with the monitoring feature enabled)? |
Cherry-pick of PR #8090 to 6.4 branch. Original message: Closes: #7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.
@ppf2 fyi: PRs have been backported and merged. Fix will be available in 6.4.1 and 6.5. |
Thank you sir! |
…#8144) Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: Closes: elastic#7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.
Currently, if multiple hosts (https://www.elastic.co/guide/en/beats/filebeat/6.3/configuration-monitor.html#_literal_hosts_literal_3) are specified for X-pack monitoring, if 1 host fails, it will continue to round robin requests to the other host(s). Say we have 2 hosts in the array, if 1 host goes down, it will send all the monitoring requests to the 2nd host until the 1st host returns. During this time, we continue to perform healthchecks against the problem host:
This occurs every 10s until the host returns. For a network with many beat clients, a failure against 1 host in the monitoring hosts array, can result in a lot of additional packets, proportional to the # of beats running in the network. We may want to consider implementing an exponential backoff of these healthchecks against a failed host given this.
The text was updated successfully, but these errors were encountered: