Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Supporting UnreachableIntermediateMasterWithLaggingReplicas #1005

Merged
merged 6 commits into from
Nov 24, 2019

Conversation

shlomi-noach
Copy link
Collaborator

Fixes #999

This PR introduces the UnreachableIntermediateMasterWithLaggingReplicas analysis. As the name suggests, when orchestrator cannot reach an intermediate master, and in addition all of its replicas are lagging -- this analysis is made.

The remediation is similar to that of UnreachableMasterWithLaggingReplicas: orchestrator emergently restarts replication IO_thread on all replicas of said intermediate master.

In scenarios like the one depicted in #999, the replicas then quick identify themselves to be broken. Thus, a next failure detection by orchestrator is expected to analyze a DeadIntermediateMaster and kick a failover.

cc @jfg956

@shlomi-noach shlomi-noach merged commit 684d6e2 into master Nov 24, 2019
@shlomi-noach shlomi-noach deleted the im-with-lagging-replicas branch November 24, 2019 10:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orchestrator not detecting intermediate master failure with relay_log_space_limit.
1 participant