-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: enhance graceful stop by closing connections after finish the ongoing txn (#32111) #48905
server: enhance graceful stop by closing connections after finish the ongoing txn (#32111) #48905
Conversation
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tiancaiamao, xhebox The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## release-6.5 #48905 +/- ##
================================================
Coverage ? 73.6375%
================================================
Files ? 1087
Lines ? 349429
Branches ? 0
================================================
Hits ? 257311
Misses ? 75605
Partials ? 16513 |
…nish the ongoing txn (pingcap#32111) (pingcap#48905)" This reverts commit b504f0b.
… ongoing txn (pingcap#32111) (pingcap#48905) (pingcap#29) close pingcap#32110 Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
This is an automated cherry-pick of #32111
What problem does this PR solve?
Issue Number: close #32110
Problem Summary:
enhance graceful stop by closing the connection after finishing the ongoing txn
What is changed and how it works?
I tested the transaction failure caused by rolling updateof tidb, and found that basically all the transactions currently running will fail.
How to test
k apply -f tc.yaml
deploys a cluster with version v5.4.0-pre(3 tidb, 1 tikv, 1 pd), the spec is as follows:Click to expand!
Write a simple test program with n connections concurrently, running transactions continuously, and simply running two replace statements per transaction. The code is here.
k apply test.yaml
to run a pod to run the test program:Just modify the annotations of tidb spec to trigger rolling update of tidb
The number of failed transactions after restarting three tidbs is 1275
Note that some of the errors are reported during begin, and some are reported during exec statement.
why it failed
To understand why it fails, we need to understand how tidb handles exit and go sql driver.
How tidb handles exits,
Let's take a look at the code that handles signal exit:
Here
graceful
will be true only if the signal is QUIT (unclear why), we can ignore only the false case for now. Because we willsend SIGTERM to stop tidb, and then sent SIGKILL directly after a certain period of time.svr.Close()
mainly does the following work (code):inShutdownMode
to true, and wait fors.cfg.GracefulWaitBeforeShutdown
, the purpose is to let LB discover and remove the tidb first.Listener
s to reject the new connection.cleanup()
mainly looks at the last calledGracefulDown()
:s.kickIdleConnection()
inGracefulDown()
mainly scanss.clients
(all connections maintained), if the current connection is not in a transaction, it will close the connection. But note that it is checked every second, if a connection is actively running transactions, it may not close it after checking many times. After thegracefulCloseConnectionsTimeout
(15s), it will directly close the connection regardless of the current state of the connection. The errors reported in the exec statement mentioned above are all closed directly here.go sql driver
Here we use the driver https://github.com/go-sql-driver/mysql. The driver implementation does not manage the connection pool itself, but the database/sql package of go. The driver implements the some interfaces in database/sql/driver package. The implementation tells the sql package that the connection status is invalid by returning
driver.ErrBadConn
(for example, the server side closes the connection), and you need to retry with a new connection.The check logic of go-sql-driver/mysql connection is mainly in conncheck.go, refer to pr924. The main thing to do is to read the connection in non-blocking way when the statement is executed for the first time getting from the connection pool. If no data can be read and err is syscall.EAGAIN or syscall.EWOULDBLOCK, this connection is Normal, otherwise ErrBadConn is returned. In the test, part of the transaction that fails to run begin is because the client side has not yet learn that our server is about to close or has closed the connection, and then it failed when running the "START TRANSACTION" statement with the connection.
Reason summary
You can see that tidb tries to close the connection between transactions (or close the connection when it is idle). One type of failure is the race between the server-side close connection and the client checking the connection status. One type of failure is caused by tidb closing all connection directly after trying to close all connections in the
gracefulCloseConnectionsTimeout
(the part where the exec statement fails).Optimize
Each connection will have a gorouting run func (cc *clientConn) Run(ctx context.Context). What the Run function does is to keep reading a packet (here refers to mysql protocol packet), and then process the packet. Here we can change it to let the Run of this clientConn find out that it is currently shutting down and then choose to close the connection at the right time (that is, between transactions), instead of relying on external timing checks. In this way, we can close each connection immediately after running the current transaction, and there will be no forced close after timeout. However, in the case of begin failure, because there is a race, when we close the connection, the client may even have used this connection to send the statement to open the transaction and sent it successfully at TCP level. There is nothing we can do about this part.
After modification and retest, failCount: 642, the failure is less than half of the previous one. In order to test the error situation when running transactions that are not very active, I changed the test program to let each goroutine take a connection in the DB to run a transaction and sleep for one second, and then retest:
It can be seen that the optimized version does not have any failures (there is no guarantee that there will be no failures).
A further possible optimization is to close the tcp read side before processing the commit statement on the server side (ref shutdown TCPConn.CloseREAD), if the driver's
connCheck
is implemented to check whether the connection is close read, then theoretically for our test program, there will be no failure when rolling update tidb。Summarize
The optimized version of tidb can reduce the client-side failure caused by restarting tidb when using connection pool, close the connection after finishing the ongoing ten, and will not directly close the connection after exceeding the
gracefulCloseConnectionsTimeout
(assume that the transaction itself does not run for more than thegracefulCloseConnectionsTimeout
).Check List
Tests
Side effects
Documentation
Release note