You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This can virtually eliminate any errors seen by application users during a planned event — in this case switching traffic for a MoveTables workflow.
This generally works very well today, but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended. Because this second one occurred concurrently with us processing the keyspace event resulting from the traffic switch in vtgate and thus draining the buffers, we don't get another keyspace event which would then cause us to immediately drain the buffers. So all traffic is paused until the max buffer failover window is hit and the buffers then drained. So a lot of additional query latency and query errors result. It was also seen that when this occurred we had some common queries that persistently failed as they were using the cached query plan from the query that kicked off the buffering (as that plan was going to the side traffic was switched away from). This persisted until another keyspace event / srv vschema change was processed as this clears out the query plan cache. Those query errors observed would all have an error code of Internal (ERInternalError / 1815 as there was no healthy shard to send them to due to the denied tables rules that were detected during what it thought was an active traffic switch that had actually ended concurrently).
Reproduction Steps
You can observe these behaviors on main if you ONLY apply the endtoend test changes from #15701 and run the endtoend test locally:
go test -v -failfast -count 1 -timeout 10m -run ^TestPartialMoveTablesBasic/vtctld$ vitess.io/vitess/go/test/endtoend/vreplication
Those test changes are tailor made to trigger the issue:
Changes made to the loadGenerator cause us to go from ~ 500QPS to ~ 10,000 QPS during the test run
We do many more traffic switches than we were, always looking for any errors
Binary Version
❯ vtgate --version
vtgate version Version: 20.0.0-SNAPSHOT (Git revision 1f81b8a57273710a02e9ce460f72ebf3abf3bd92 branch 'ks_events_improvements') built on Sat Apr 13 12:48:58 EDT 2024 by matt@pslord.local using go1.22.2 darwin/arm64
Operating System and Environment details
N/A
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.
but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.
What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.
but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.
Yes, in the case that was being investigated the value was set to 0.
Overview of the Issue
VTGate can buffer queries / park sessions when it sees that there's not currently a healthy tablet to forward the query to. After #13507 a
vtgate
will detect when aMoveTables
SwitchTraffic
/ReverseTraffic
command is being done and buffer the incoming queries until the traffic switch is completed.This can virtually eliminate any errors seen by application users during a planned event — in this case switching traffic for a
MoveTables
workflow.This generally works very well today, but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended. Because this second one occurred concurrently with us processing the keyspace event resulting from the traffic switch in
vtgate
and thus draining the buffers, we don't get another keyspace event which would then cause us to immediately drain the buffers. So all traffic is paused until the max buffer failover window is hit and the buffers then drained. So a lot of additional query latency and query errors result. It was also seen that when this occurred we had some common queries that persistently failed as they were using the cached query plan from the query that kicked off the buffering (as that plan was going to the side traffic was switched away from). This persisted until another keyspace event / srv vschema change was processed as this clears out the query plan cache. Those query errors observed would all have an error code ofInternal
(ERInternalError / 1815
as there was no healthy shard to send them to due to the denied tables rules that were detected during what it thought was an active traffic switch that had actually ended concurrently).Reproduction Steps
You can observe these behaviors on
main
if you ONLY apply the endtoend test changes from #15701 and run the endtoend test locally:Those test changes are tailor made to trigger the issue:
Binary Version
❯ vtgate --version vtgate version Version: 20.0.0-SNAPSHOT (Git revision 1f81b8a57273710a02e9ce460f72ebf3abf3bd92 branch 'ks_events_improvements') built on Sat Apr 13 12:48:58 EDT 2024 by matt@pslord.local using go1.22.2 darwin/arm64
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: