Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VTOrc: Refactor and reload of ephemeral information for remaining recovery functions #10150

Merged
merged 9 commits into from
May 4, 2022

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Apr 27, 2022

Description

This is the follow up PR for #10115.

In that PR, we had added the functionality to VTOrc to refresh its ephemeral information before running any cluster operation. But according to our discussions, we should be doing this for all failure scenarios. For example, we see that a tablet has its replication stopped, but we might be using old information and actually it is a BACKUP tablet_type. So, even in these cases, we should not be going ahead with the recovery to start the replication, since it might break other parts of Vitess. This PR addresses that concern and adds this reload of information for the other recovery functions as well.

In this PR, we have also adds some unit tests to VTOrc. We have also refactored the replicationAnalysis codepath to do the locking of the shard and refreshing of the information before we call the recoveryFunction, to prevent repeated code. This PR also deletes any unused failure scenarios that were coming from orchestrator legacy code. Vitess doesn't support Co-primary or intermediate-primary configuration, so it is safe to remove these failure scenarios. We anyways didn't have any associated recoveries for them.

Moreover, the codepath for PrimaryHasPrimary failure has been refactored, to only reset the replication on the primary instance, instead of running a force DeadPrimary recovery.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…n analysis info to be provided for adding unit tests for GetReplicationAnalysis

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ommon code path for repair functions

Signed-off-by: Manan Gupta <manan@planetscale.com>
… have the same recovery function

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTorc Vitess Orchestrator integration release notes labels Apr 27, 2022
Signed-off-by: Manan Gupta <manan@planetscale.com>
@@ -929,7 +929,7 @@ func MakeCoPrimary(instanceKey *InstanceKey) (*Instance, error) {
}
log.Infof("Will make %+v co-primary of %+v", instanceKey, primary.Key)

var gitHint OperationGTIDHint = GTIDHintNeutral
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a linter fix

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link the catch-all vtorc issue into this PR?
I went and looked at how we decide on PrimaryHasPrimary and now I'm having second thoughts about simply removing the replication. Even though the tablet type is PRIMARY, we should make sure that the underlying mysqld is writable as well. That doesn't seem to be something that is considered when setting IsClusterPrimary to true.

@GuptaManan100
Copy link
Member Author

The PRIMARY tablet not being writable is a failure scenario in itself. If the designated primary is read only, it will spawn the PrimaryIsReadOnly failure which will set the primary to read-write.
I am unsure of what to do about PrimaryHasPrimary too, the failure could be as simple as someone messed up when entering the tablet host port when they wanted to the set replication source (maybe for experimental tablets, etc?). The earlier code assumed that something was seriously wrong for the Primary to have a replication source set, and ran an ERS forcefully.

…ical errors later

Signed-off-by: Manan Gupta <manan@planetscale.com>
@deepthi deepthi merged commit 3c53dd2 into vitessio:main May 4, 2022
@deepthi deepthi deleted the vtorc-refactor branch May 4, 2022 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants