Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

shlomi-noach · 2019-11-05T09:10:35Z

Fixes #999

This PR introduces the UnreachableIntermediateMasterWithLaggingReplicas analysis. As the name suggests, when orchestrator cannot reach an intermediate master, and in addition all of its replicas are lagging -- this analysis is made.

The remediation is similar to that of UnreachableMasterWithLaggingReplicas: orchestrator emergently restarts replication IO_thread on all replicas of said intermediate master.

In scenarios like the one depicted in #999, the replicas then quick identify themselves to be broken. Thus, a next failure detection by orchestrator is expected to analyze a DeadIntermediateMaster and kick a failover.

cc @jfg956

Shlomi Noach added 2 commits November 5, 2019 11:06

Supporting UnreachableIntermediateMasterWithLaggingReplicas

2c658ad

Supporting UnreachableIntermediateMasterWithLaggingReplicas

cf6c319

shlomi-noach mentioned this pull request Nov 5, 2019

Orchestrator not detecting intermediate master failure with relay_log_space_limit. #999

Closed

Shlomi Noach added 4 commits November 5, 2019 11:30

Merge branch 'master' into im-with-lagging-replicas

637388d

Merge branch 'master' into im-with-lagging-replicas

621ced7

Merge branch 'master' into im-with-lagging-replicas

7d25121

Merge branch 'master' into im-with-lagging-replicas

dac201e

shlomi-noach merged commit 684d6e2 into master Nov 24, 2019

shlomi-noach deleted the im-with-lagging-replicas branch November 24, 2019 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

shlomi-noach commented Nov 5, 2019

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

Conversation

shlomi-noach commented Nov 5, 2019