Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

mattlord · 2024-04-13T17:46:15Z

Overview of the Issue

VTGate can buffer queries / park sessions when it sees that there's not currently a healthy tablet to forward the query to. After #13507 a vtgate will detect when a MoveTables SwitchTraffic/ReverseTraffic command is being done and buffer the incoming queries until the traffic switch is completed.

This can virtually eliminate any errors seen by application users during a planned event — in this case switching traffic for a MoveTables workflow.

This generally works very well today, but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended. Because this second one occurred concurrently with us processing the keyspace event resulting from the traffic switch in vtgate and thus draining the buffers, we don't get another keyspace event which would then cause us to immediately drain the buffers. So all traffic is paused until the max buffer failover window is hit and the buffers then drained. So a lot of additional query latency and query errors result. It was also seen that when this occurred we had some common queries that persistently failed as they were using the cached query plan from the query that kicked off the buffering (as that plan was going to the side traffic was switched away from). This persisted until another keyspace event / srv vschema change was processed as this clears out the query plan cache. Those query errors observed would all have an error code of Internal (ERInternalError / 1815 as there was no healthy shard to send them to due to the denied tables rules that were detected during what it thought was an active traffic switch that had actually ended concurrently).

Reproduction Steps

You can observe these behaviors on main if you ONLY apply the endtoend test changes from #15701 and run the endtoend test locally:

 go test -v -failfast -count 1 -timeout 10m -run ^TestPartialMoveTablesBasic/vtctld$ vitess.io/vitess/go/test/endtoend/vreplication

Those test changes are tailor made to trigger the issue:

Changes made to the loadGenerator cause us to go from ~ 500QPS to ~ 10,000 QPS during the test run
We do many more traffic switches than we were, always looking for any errors

Binary Version

❯ vtgate --version
vtgate version Version: 20.0.0-SNAPSHOT (Git revision 1f81b8a57273710a02e9ce460f72ebf3abf3bd92 branch 'ks_events_improvements') built on Sat Apr 13 12:48:58 EDT 2024 by matt@pslord.local using go1.22.2 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

deepthi · 2024-04-17T03:32:31Z

What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.

but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.

mattlord · 2024-04-17T04:20:40Z

What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.

but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.

Yes, in the case that was being investigated the value was set to 0.

mattlord added Type: Bug Component: VReplication labels Apr 13, 2024

mattlord self-assigned this Apr 13, 2024

mattlord mentioned this issue Apr 13, 2024

VReplication: Improve query buffering behavior during MoveTables traffic switching #15701

Merged

5 tasks

mattlord added this to the v20.0.0 milestone Apr 13, 2024

mattlord closed this as completed in #15701 Apr 25, 2024

github-project-automation bot added this to VReplication Aug 29, 2024

github-project-automation bot moved this to Done in VReplication Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

mattlord commented Apr 13, 2024 •

edited

Loading

deepthi commented Apr 17, 2024

mattlord commented Apr 17, 2024 •

edited

Loading

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

Comments

mattlord commented Apr 13, 2024 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Apr 17, 2024

mattlord commented Apr 17, 2024 • edited Loading

mattlord commented Apr 13, 2024 •

edited

Loading

mattlord commented Apr 17, 2024 •

edited

Loading