Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Add support for recovery of async/semisync replicas of failed replication group members #1254

Merged
merged 1 commit into from
Oct 19, 2020

Conversation

ejortegau
Copy link
Contributor

@ejortegau ejortegau commented Oct 16, 2020

Related issue: #1253

Description

This PR addresses the issue mentioned above. It does so by adding failure detection and recovery for replication group members that have traditional async/semi-sync replicas.

cc @sjmudd, @dveeden, @luisyonaldo.

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline comments

go/inst/analysis.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
// failure of a group member with replicas is akin to failure of an intermediate master.
func checkAndRecoverDeadGroupMemberWithReplicas(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (bool, *TopologyRecovery, error) {
// Don't proceed with recovery unless it was forced or automatic intermediate source recovery is enabled.
// We consider failed group members akin to failed intermediate masters, so we re-use the configuration for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we re-use the configuration

but in analysis_dao.go it seems like you've changed that: intermediate master recovery only takes place under

if !a.IsReplicationGroupMember {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is that we re-use analysisEntry.ClusterDetails.HasAutomatedIntermediateMasterRecovery configuration to decide whether we want to fail-over group members as opposed to having a separate configuration. As mentioned in the method's doc comment, we are operating under the assumption that group secondaries with replicas are akin to intermediate masters in the sense that they perform a very similar function in the replication chain; get and apply changes from the primary (except, via GR instead of binlog), and distribute them to replicas (via the binlog). I hope this clarifies my intent.

topologyRecovery.SuccessorKey = &recoveredToGroupMember.Key
topologyRecovery.SuccessorAlias = recoveredToGroupMember.InstanceAlias
// For the same reasons that were mentioned above, we re-use the post intermediate master fail-over hooks
executeProcesses(config.Config.PostIntermediateMasterFailoverProcesses, "PostIntermediateMasterFailoverProcesses", topologyRecovery, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't run Group Replication myself, but I think it can be debatable whether it is correct to run PostIntermediateMasterFailoverProcesses. For now, let's keep it at that, but I predict that someone in the future will argue against this.

Copy link
Contributor Author

@ejortegau ejortegau Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, our use case does not seem to require different hooks for these. If the need arises (or someone comes knocking at your door asking for it) I'd be happy to change this to have different GR and intermediate source hooks.

@shlomi-noach shlomi-noach merged commit 37c255e into openark:master Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants