Add support for recovery of async/semisync replicas of failed replication group members #1254

ejortegau · 2020-10-16T21:25:53Z

Related issue: #1253

Description

This PR addresses the issue mentioned above. It does so by adding failure detection and recovery for replication group members that have traditional async/semi-sync replicas.

cc @sjmudd, @dveeden, @luisyonaldo.

shlomi-noach

please see inline comments

go/inst/analysis.go

go/inst/instance_dao.go

shlomi-noach · 2020-10-18T06:05:57Z

go/logic/topology_recovery.go

+// failure of a group member with replicas is akin to failure of an intermediate master.
+func checkAndRecoverDeadGroupMemberWithReplicas(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (bool, *TopologyRecovery, error) {
+	// Don't proceed with recovery unless it was forced or automatic intermediate source recovery is enabled.
+	// We consider failed group members akin to failed intermediate masters, so we re-use the configuration for


so we re-use the configuration

but in analysis_dao.go it seems like you've changed that: intermediate master recovery only takes place under

if !a.IsReplicationGroupMember {

What I mean here is that we re-use analysisEntry.ClusterDetails.HasAutomatedIntermediateMasterRecovery configuration to decide whether we want to fail-over group members as opposed to having a separate configuration. As mentioned in the method's doc comment, we are operating under the assumption that group secondaries with replicas are akin to intermediate masters in the sense that they perform a very similar function in the replication chain; get and apply changes from the primary (except, via GR instead of binlog), and distribute them to replicas (via the binlog). I hope this clarifies my intent.

shlomi-noach · 2020-10-18T06:11:56Z

go/logic/topology_recovery.go

+			topologyRecovery.SuccessorKey = &recoveredToGroupMember.Key
+			topologyRecovery.SuccessorAlias = recoveredToGroupMember.InstanceAlias
+			// For the same reasons that were mentioned above, we re-use the post intermediate master fail-over hooks
+			executeProcesses(config.Config.PostIntermediateMasterFailoverProcesses, "PostIntermediateMasterFailoverProcesses", topologyRecovery, false)


I don't run Group Replication myself, but I think it can be debatable whether it is correct to run PostIntermediateMasterFailoverProcesses. For now, let's keep it at that, but I predict that someone in the future will argue against this.

For now, our use case does not seem to require different hooks for these. If the need arises (or someone comes knocking at your door asking for it) I'd be happy to change this to have different GR and intermediate source hooks.

…tion group members.

shlomi-noach reviewed Oct 18, 2020

View reviewed changes

Add support for recovery of async/semisync replicas of failed replica…

efb23b8

…tion group members.

ejortegau force-pushed the master branch from 77f3a3e to efb23b8 Compare October 18, 2020 18:25

shlomi-noach approved these changes Oct 19, 2020

View reviewed changes

shlomi-noach merged commit 37c255e into openark:master Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for recovery of async/semisync replicas of failed replication group members #1254

Add support for recovery of async/semisync replicas of failed replication group members #1254

ejortegau commented Oct 16, 2020 •

edited

Loading

shlomi-noach left a comment

shlomi-noach Oct 18, 2020

ejortegau Oct 18, 2020

shlomi-noach Oct 18, 2020

ejortegau Oct 18, 2020 •

edited

Loading

Add support for recovery of async/semisync replicas of failed replication group members #1254

Add support for recovery of async/semisync replicas of failed replication group members #1254

Conversation

ejortegau commented Oct 16, 2020 • edited Loading

Description

shlomi-noach left a comment

Choose a reason for hiding this comment

shlomi-noach Oct 18, 2020

Choose a reason for hiding this comment

ejortegau Oct 18, 2020

Choose a reason for hiding this comment

shlomi-noach Oct 18, 2020

Choose a reason for hiding this comment

ejortegau Oct 18, 2020 • edited Loading

Choose a reason for hiding this comment

ejortegau commented Oct 16, 2020 •

edited

Loading

ejortegau Oct 18, 2020 •

edited

Loading