You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently a workflow can get into a bad state during either SwitchTraffic or ReverseTraffic when both primary and replica tablets are switched at once. The result is severe since it can cause the tables being moved to be no longer available. This inconsistent state can occur when:
SwitchTraffic worked correctly with both Replicas and Primaries switched correctly to the target, resulting in the state Reads Switched, Writes Switched
ReverseTraffic experienced a tablet roll after Replicas were reversed, but in the middle of Reversing Primary, resulting in the state Reads Not Switched, Writes Switched, so Replicas pointing correctly to the Source and Writes (apparently and correctly) still to the Target since Reverse failed for Primary.
BUT: while the Routing Rules (which is what the API uses to check the state) were pointing correctly to the target, the tablet roll happened after the denied tables were added to the Target, and deleted from the Source, in preparation for pointing writes back to the Source.
One of the consequences of this is that vtgate still thinks MoveTables is in progress. The table is pointing to the target via routing rules, but target still has the table as a denied table, which is taken as a signal that MoveTables has started but not completed. So effectively we have downtime.
In fact this is a known issue with catastrophic failures in the middle of Primary TrafficSwitches because we don't really have a distributed transaction for the topo changes. In the past, such a the failure would result in a message informing the user of the inconsistent state, expecting a manual fix. However we have users now using the vtctld RPCs directly and have no way of knowing what went wrong and how to fix it since they would need to manually invoke other RPCs depending on what went wrong.
Recent changes, including locking of tables when writes are being switched, have significantly increased the duration where we are susceptible to such crashes.
Current Resolution Options
Currently we will need to manually fix this using the vtctldclient commands SetShardTabletControl and RebuildKeyspaceGraph. This is what needs to be done:
Figure out which keyspace must have denied tables removed (depends on failure during SwitchTraffic or ReverseTraffic)
One possible option here is a Repair which can be a separate sub-command, which user can call/invoke when Switch/Reverse fails. We will know about failures because there will be an error reported since the grpc connection breaks.
Repair can check for inconsistent state and either report and/or fix based. In addition Switch/Reverse Traffic should do the check for inconsistent state each time and fail if one is found.
The text was updated successfully, but these errors were encountered:
Further review for this specific incident revealed that vctld did try to rollback the changes to revert to the previous state but the client had a short context timeout and hence the rollback also failed. So we had the vttablet roll that caused the table locking step to take long before it failed and that also triggered the context cancellation in vtctld.
A simple solution for this would be to do the cancel operations with a separate context. That is the initial approach we will take since other solutions are a lot more complex.
Overview of the Issue
Problem
Currently a workflow can get into a bad state during either
SwitchTraffic
orReverseTraffic
when both primary and replica tablets are switched at once. The result is severe since it can cause the tables being moved to be no longer available. This inconsistent state can occur when:SwitchTraffic
worked correctly with both Replicas and Primaries switched correctly to the target, resulting in the stateReads Switched, Writes Switched
ReverseTraffic
experienced a tablet roll after Replicas were reversed, but in the middle of Reversing Primary, resulting in the stateReads Not Switched, Writes Switched
, so Replicas pointing correctly to the Source and Writes (apparently and correctly) still to the Target since Reverse failed for Primary.MoveTables
is in progress. The table is pointing to the target via routing rules, but target still has the table as a denied table, which is taken as a signal that MoveTables has started but not completed. So effectively we have downtime.In fact this is a known issue with catastrophic failures in the middle of Primary TrafficSwitches because we don't really have a distributed transaction for the topo changes. In the past, such a the failure would result in a message informing the user of the inconsistent state, expecting a manual fix. However we have users now using the
vtctld
RPCs directly and have no way of knowing what went wrong and how to fix it since they would need to manually invoke other RPCs depending on what went wrong.Recent changes, including locking of tables when writes are being switched, have significantly increased the duration where we are susceptible to such crashes.
Current Resolution Options
Currently we will need to manually fix this using the
vtctldclient
commandsSetShardTabletControl
andRebuildKeyspaceGraph
. This is what needs to be done:SwitchTraffic
orReverseTraffic
)This is true as of
main (v22 dev)
1058621Desired Solution
One possible option here is a
Repair
which can be a separate sub-command, which user can call/invoke whenSwitch/Reverse
fails. We will know about failures because there will be an error reported since the grpc connection breaks.Repair
can check for inconsistent state and either report and/or fix based. In additionSwitch/Reverse Traffic
should do the check for inconsistent state each time and fail if one is found.The text was updated successfully, but these errors were encountered: