Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

urso · 2018-08-29T18:34:42Z

Cherry-pick of PR #8090 to 6.4 branch. Original message:

Closes: #7966

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter

Failover was not handled correctly
Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on. (cherry picked from commit 43ee7d7)

ycombinator

LGTM. WFG.

…#8144) Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: Closes: elastic#7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

urso added backport review labels Aug 29, 2018

urso force-pushed the backport_8090_6.4 branch from 5254e89 to aff5b55 Compare August 29, 2018 18:35

urso requested review from ycombinator and ruflin August 29, 2018 18:35

ycombinator approved these changes Aug 29, 2018

View reviewed changes

ruflin approved these changes Aug 30, 2018

View reviewed changes

graphaelli mentioned this pull request Aug 30, 2018

Update beats framework to 43ee7d7 elastic/apm-server#1338

Merged

urso merged commit 2828c75 into elastic:6.4 Aug 30, 2018

urso deleted the backport_8090_6.4 branch February 19, 2019 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

urso commented Aug 29, 2018

ycombinator left a comment

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

Conversation

urso commented Aug 29, 2018

ycombinator left a comment

Choose a reason for hiding this comment