Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential backoff on failed Elasticsearch node configuration for X-pack monitoring #7966

Closed
ppf2 opened this issue Aug 14, 2018 · 7 comments
Assignees

Comments

@ppf2
Copy link
Member

ppf2 commented Aug 14, 2018

Currently, if multiple hosts (https://www.elastic.co/guide/en/beats/filebeat/6.3/configuration-monitor.html#_literal_hosts_literal_3) are specified for X-pack monitoring, if 1 host fails, it will continue to round robin requests to the other host(s). Say we have 2 hosts in the array, if 1 host goes down, it will send all the monitoring requests to the 2nd host until the 1st host returns. During this time, we continue to perform healthchecks against the problem host:

2018-08-14T14:27:44.022-0700	ERROR	pipeline/output.go:74	Failed to connect: X-Pack capabilities query failed with: Get http://localhost:9201/_xpack?filter_path=features.monitoring.enabled: dial tcp [::1]:9201: getsockopt: connection refused

This occurs every 10s until the host returns. For a network with many beat clients, a failure against 1 host in the monitoring hosts array, can result in a lot of additional packets, proportional to the # of beats running in the network. We may want to consider implementing an exponential backoff of these healthchecks against a failed host given this.

@urso
Copy link

urso commented Aug 14, 2018

Checking the code, we indeed there no backoff. Did x-pack monitoring work before? The init loop is supposed to check x-pack availablity every 10s. If init loop succeeds, the beat starts sending.
Assuming a number of beats are started at about the same time, we should also introduce some random jitter, so not all beats will try at about the same time, no matter the backoff strategy.

@ppf2
Copy link
Member Author

ppf2 commented Aug 14, 2018

Thanks. It worked before, only 1 node became unresponsive on the ES side so it started doing healthchecks against the failed node. Yah, certainly, all the Beats will not be firing off the healthcheck at the exact moment, but if there are enough Beats running on the network, it can still translate to noticeably more requests per second by the network admins.

@ycombinator
Copy link
Contributor

ycombinator commented Aug 15, 2018

@urso Digging around the beats codebase a bit, sounds like the client implementation in beats monitoring could possible (re)use something like this?

func WithBackoff(client NetworkClient, init, max time.Duration) NetworkClient {

Just wanted to get your thoughts before I go off and put up a PR. Thanks.

@ycombinator ycombinator self-assigned this Aug 15, 2018
@urso
Copy link

urso commented Aug 15, 2018

The monitoring reporter has 2 phases. First phase: check x-pack monitoring is available, every 10s. If First phase succeeds, the collector phase is started. The WithBackoff can be used for the collector phase.

Not sure about the init phase, though. The WithBackoff wrapper only act on errors. Question: we want to backoff in case x-pack monitoring is disabled in ES (does not generate an error)?

Anyways, WithBackoff reuses backoff from libbeat, as last resort you can use that one.

@ppf2
Copy link
Member Author

ppf2 commented Aug 17, 2018

we want to backoff in case x-pack monitoring is disabled in ES (does not generate an error)?

If monitoring is disabled in the ES node specified in the hosts array, seems like it will make sense to skip the backoff routine for that host (until it shows up again with the monitoring feature enabled)?

@ppf2 ppf2 added bug and removed enhancement labels Aug 20, 2018
@urso urso self-assigned this Aug 21, 2018
urso pushed a commit that referenced this issue Aug 30, 2018
Cherry-pick of PR #8090 to 6.4 branch. Original message: 

Closes: #7966 

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.
@urso
Copy link

urso commented Aug 30, 2018

@ppf2 fyi: PRs have been backported and merged. Fix will be available in 6.4.1 and 6.5.

@ppf2
Copy link
Member Author

ppf2 commented Aug 30, 2018

Thank you sir!

leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…#8144)

Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: 

Closes: elastic#7966 

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants