Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

Closed
mattlord opened this issue Apr 13, 2024 · 2 comments · Fixed by #15701
Closed

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic #15707

mattlord opened this issue Apr 13, 2024 · 2 comments · Fixed by #15701

Comments

@mattlord
Copy link
Contributor

mattlord commented Apr 13, 2024

Overview of the Issue

VTGate can buffer queries / park sessions when it sees that there's not currently a healthy tablet to forward the query to. After #13507 a vtgate will detect when a MoveTables SwitchTraffic/ReverseTraffic command is being done and buffer the incoming queries until the traffic switch is completed.

This can virtually eliminate any errors seen by application users during a planned event — in this case switching traffic for a MoveTables workflow.

This generally works very well today, but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended. Because this second one occurred concurrently with us processing the keyspace event resulting from the traffic switch in vtgate and thus draining the buffers, we don't get another keyspace event which would then cause us to immediately drain the buffers. So all traffic is paused until the max buffer failover window is hit and the buffers then drained. So a lot of additional query latency and query errors result. It was also seen that when this occurred we had some common queries that persistently failed as they were using the cached query plan from the query that kicked off the buffering (as that plan was going to the side traffic was switched away from). This persisted until another keyspace event / srv vschema change was processed as this clears out the query plan cache. Those query errors observed would all have an error code of Internal (ERInternalError / 1815 as there was no healthy shard to send them to due to the denied tables rules that were detected during what it thought was an active traffic switch that had actually ended concurrently).

Reproduction Steps

You can observe these behaviors on main if you ONLY apply the endtoend test changes from #15701 and run the endtoend test locally:

 go test -v -failfast -count 1 -timeout 10m -run ^TestPartialMoveTablesBasic/vtctld$ vitess.io/vitess/go/test/endtoend/vreplication

Those test changes are tailor made to trigger the issue:

  • Changes made to the loadGenerator cause us to go from ~ 500QPS to ~ 10,000 QPS during the test run
  • We do many more traffic switches than we were, always looking for any errors

Binary Version

❯ vtgate --version
vtgate version Version: 20.0.0-SNAPSHOT (Git revision 1f81b8a57273710a02e9ce460f72ebf3abf3bd92 branch 'ks_events_improvements') built on Sat Apr 13 12:48:58 EDT 2024 by matt@pslord.local using go1.22.2 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

@deepthi
Copy link
Member

deepthi commented Apr 17, 2024

What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.

but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.

@mattlord
Copy link
Contributor Author

mattlord commented Apr 17, 2024

What was the value of buffer_min_time_between_failovers? I'm not seeing how you end up with the following unless that is set to 0.

but under heavy load (very high QPS) it's been observed that we can have 2 buffer windows ~ overlap so that we go into a new buffer window immediately after the previous one ended.

Yes, in the case that was being investigated the value was set to 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
2 participants