MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

rohit-nayak-ps · 2024-12-03T09:17:29Z

Overview of the Issue

Problem

Currently a workflow can get into a bad state during either SwitchTraffic or ReverseTraffic when both primary and replica tablets are switched at once. The result is severe since it can cause the tables being moved to be no longer available. This inconsistent state can occur when:

SwitchTraffic worked correctly with both Replicas and Primaries switched correctly to the target, resulting in the state Reads Switched, Writes Switched
ReverseTraffic experienced a tablet roll after Replicas were reversed, but in the middle of Reversing Primary, resulting in the state Reads Not Switched, Writes Switched, so Replicas pointing correctly to the Source and Writes (apparently and correctly) still to the Target since Reverse failed for Primary.
BUT: while the Routing Rules (which is what the API uses to check the state) were pointing correctly to the target, the tablet roll happened after the denied tables were added to the Target, and deleted from the Source, in preparation for pointing writes back to the Source.
One of the consequences of this is that vtgate still thinks MoveTables is in progress. The table is pointing to the target via routing rules, but target still has the table as a denied table, which is taken as a signal that MoveTables has started but not completed. So effectively we have downtime.

In fact this is a known issue with catastrophic failures in the middle of Primary TrafficSwitches because we don't really have a distributed transaction for the topo changes. In the past, such a the failure would result in a message informing the user of the inconsistent state, expecting a manual fix. However we have users now using the vtctld RPCs directly and have no way of knowing what went wrong and how to fix it since they would need to manually invoke other RPCs depending on what went wrong.

Recent changes, including locking of tables when writes are being switched, have significantly increased the duration where we are susceptible to such crashes.

Current Resolution Options

Currently we will need to manually fix this using the vtctldclient commands SetShardTabletControl and RebuildKeyspaceGraph. This is what needs to be done:

Figure out which keyspace must have denied tables removed (depends on failure during SwitchTraffic or ReverseTraffic)
Remove the incorrect denied tables
Rebuild the keyspace

This is true as of main (v22 dev) 1058621

Desired Solution

One possible option here is a Repair which can be a separate sub-command, which user can call/invoke when Switch/Reverse fails. We will know about failures because there will be an error reported since the grpc connection breaks.

Repair can check for inconsistent state and either report and/or fix based. In addition Switch/Reverse Traffic should do the check for inconsistent state each time and fail if one is found.

The text was updated successfully, but these errors were encountered:

rohit-nayak-ps · 2024-12-05T11:10:13Z

Further review for this specific incident revealed that vctld did try to rollback the changes to revert to the previous state but the client had a short context timeout and hence the rollback also failed. So we had the vttablet roll that caused the table locking step to take long before it failed and that also triggered the context cancellation in vtctld.

A simple solution for this would be to do the cancel operations with a separate context. That is the initial approach we will take since other solutions are a lot more complex.

rohit-nayak-ps added Type: Bug Component: VReplication labels Dec 3, 2024

rohit-nayak-ps self-assigned this Dec 3, 2024

rohit-nayak-ps mentioned this issue Dec 5, 2024

SwitchTraffic: use separate context while canceling a migration #17340

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

rohit-nayak-ps commented Dec 3, 2024 •

edited

Loading

rohit-nayak-ps commented Dec 5, 2024

MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

MoveTables: some failures while switching primary tablet traffic can cause downtime #17326

Comments

rohit-nayak-ps commented Dec 3, 2024 • edited Loading

Overview of the Issue

Problem

Current Resolution Options

Desired Solution

rohit-nayak-ps commented Dec 5, 2024

rohit-nayak-ps commented Dec 3, 2024 •

edited

Loading