Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

Open
rohit-nayak-ps opened this issue Dec 3, 2024 · 1 comment

Comments

@rohit-nayak-ps
Copy link
Contributor

rohit-nayak-ps commented Dec 3, 2024

Overview of the Issue

Problem

Currently a workflow can get into a bad state during either SwitchTraffic or ReverseTraffic when both primary and replica tablets are switched at once. The result is severe since it can cause the tables being moved to be no longer available. This inconsistent state can occur when:

  1. SwitchTraffic worked correctly with both Replicas and Primaries switched correctly to the target, resulting in the state Reads Switched, Writes Switched
  2. ReverseTraffic experienced a tablet roll after Replicas were reversed, but in the middle of Reversing Primary, resulting in the state Reads Not Switched, Writes Switched, so Replicas pointing correctly to the Source and Writes (apparently and correctly) still to the Target since Reverse failed for Primary.
  3. BUT: while the Routing Rules (which is what the API uses to check the state) were pointing correctly to the target, the tablet roll happened after the denied tables were added to the Target, and deleted from the Source, in preparation for pointing writes back to the Source.
  4. One of the consequences of this is that vtgate still thinks MoveTables is in progress. The table is pointing to the target via routing rules, but target still has the table as a denied table, which is taken as a signal that MoveTables has started but not completed. So effectively we have downtime.

In fact this is a known issue with catastrophic failures in the middle of Primary TrafficSwitches because we don't really have a distributed transaction for the topo changes. In the past, such a the failure would result in a message informing the user of the inconsistent state, expecting a manual fix. However we have users now using the vtctld RPCs directly and have no way of knowing what went wrong and how to fix it since they would need to manually invoke other RPCs depending on what went wrong.

Recent changes, including locking of tables when writes are being switched, have significantly increased the duration where we are susceptible to such crashes.

Current Resolution Options

Currently we will need to manually fix this using the vtctldclient commands SetShardTabletControl and RebuildKeyspaceGraph. This is what needs to be done:

  1. Figure out which keyspace must have denied tables removed (depends on failure during SwitchTraffic or ReverseTraffic)
  2. Remove the incorrect denied tables
  3. Rebuild the keyspace

This is true as of main (v22 dev) 1058621

Desired Solution

One possible option here is a Repair which can be a separate sub-command, which user can call/invoke when Switch/Reverse fails. We will know about failures because there will be an error reported since the grpc connection breaks.

Repair can check for inconsistent state and either report and/or fix based. In addition Switch/Reverse Traffic should do the check for inconsistent state each time and fail if one is found.

@rohit-nayak-ps
Copy link
Contributor Author

Further review for this specific incident revealed that vctld did try to rollback the changes to revert to the previous state but the client had a short context timeout and hence the rollback also failed. So we had the vttablet roll that caused the table locking step to take long before it failed and that also triggered the context cancellation in vtctld.

A simple solution for this would be to do the cancel operations with a separate context. That is the initial approach we will take since other solutions are a lot more complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant