Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

ppf2 · 2018-08-14T21:54:45Z

Currently, if multiple hosts (https://www.elastic.co/guide/en/beats/filebeat/6.3/configuration-monitor.html#_literal_hosts_literal_3) are specified for X-pack monitoring, if 1 host fails, it will continue to round robin requests to the other host(s). Say we have 2 hosts in the array, if 1 host goes down, it will send all the monitoring requests to the 2nd host until the 1st host returns. During this time, we continue to perform healthchecks against the problem host:

2018-08-14T14:27:44.022-0700	ERROR	pipeline/output.go:74	Failed to connect: X-Pack capabilities query failed with: Get http://localhost:9201/_xpack?filter_path=features.monitoring.enabled: dial tcp [::1]:9201: getsockopt: connection refused

This occurs every 10s until the host returns. For a network with many beat clients, a failure against 1 host in the monitoring hosts array, can result in a lot of additional packets, proportional to the # of beats running in the network. We may want to consider implementing an exponential backoff of these healthchecks against a failed host given this.

The text was updated successfully, but these errors were encountered:

urso · 2018-08-14T22:04:52Z

Checking the code, we indeed there no backoff. Did x-pack monitoring work before? The init loop is supposed to check x-pack availablity every 10s. If init loop succeeds, the beat starts sending.
Assuming a number of beats are started at about the same time, we should also introduce some random jitter, so not all beats will try at about the same time, no matter the backoff strategy.

ppf2 · 2018-08-14T22:21:07Z

Thanks. It worked before, only 1 node became unresponsive on the ES side so it started doing healthchecks against the failed node. Yah, certainly, all the Beats will not be firing off the healthcheck at the exact moment, but if there are enough Beats running on the network, it can still translate to noticeably more requests per second by the network admins.

ycombinator · 2018-08-15T17:52:44Z

@urso Digging around the beats codebase a bit, sounds like the client implementation in beats monitoring could possible (re)use something like this?

beats/libbeat/outputs/backoff.go

Line 37 in 0053aaf

func WithBackoff(client NetworkClient, init, max time.Duration) NetworkClient {

Just wanted to get your thoughts before I go off and put up a PR. Thanks.

urso · 2018-08-15T18:33:21Z

The monitoring reporter has 2 phases. First phase: check x-pack monitoring is available, every 10s. If First phase succeeds, the collector phase is started. The WithBackoff can be used for the collector phase.

Not sure about the init phase, though. The WithBackoff wrapper only act on errors. Question: we want to backoff in case x-pack monitoring is disabled in ES (does not generate an error)?

Anyways, WithBackoff reuses backoff from libbeat, as last resort you can use that one.

ppf2 · 2018-08-17T00:24:02Z

we want to backoff in case x-pack monitoring is disabled in ES (does not generate an error)?

If monitoring is disabled in the ES node specified in the hosts array, seems like it will make sense to skip the backoff routine for that host (until it shows up again with the monitoring feature enabled)?

Cherry-pick of PR #8090 to 6.4 branch. Original message: Closes: #7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

urso · 2018-08-30T16:19:35Z

@ppf2 fyi: PRs have been backported and merged. Fix will be available in 6.4.1 and 6.5.

ppf2 · 2018-08-30T19:12:25Z

Thank you sir!

…#8144) Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: Closes: elastic#7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

ppf2 added enhancement monitoring labels Aug 14, 2018

ycombinator self-assigned this Aug 15, 2018

ppf2 mentioned this issue Aug 17, 2018

Beats common backoff behavior #6791

Closed

ppf2 added bug and removed enhancement labels Aug 20, 2018

urso self-assigned this Aug 21, 2018

urso mentioned this issue Aug 26, 2018

Improve monitoring reporter #8090

Merged

urso closed this as completed in #8090 Aug 29, 2018

This was referenced Aug 29, 2018

Cherry-pick #8090 to 6.x: Improve monitoring reporter #8143

Merged

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

ppf2 commented Aug 14, 2018

urso commented Aug 14, 2018

ppf2 commented Aug 14, 2018

ycombinator commented Aug 15, 2018 •

edited

Loading

urso commented Aug 15, 2018

ppf2 commented Aug 17, 2018

urso commented Aug 30, 2018

ppf2 commented Aug 30, 2018

Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

Comments

ppf2 commented Aug 14, 2018

urso commented Aug 14, 2018

ppf2 commented Aug 14, 2018

ycombinator commented Aug 15, 2018 • edited Loading

urso commented Aug 15, 2018

ppf2 commented Aug 17, 2018

urso commented Aug 30, 2018

ppf2 commented Aug 30, 2018

ycombinator commented Aug 15, 2018 •

edited

Loading