-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
database/sql: context cancellation allows statements to execute after rollback #34775
Comments
Standalone repro: package main
import (
"context"
"database/sql"
"fmt"
"log"
"os"
"time"
_ "github.com/go-sql-driver/mysql"
)
func main() {
if len(os.Args) != 2 {
log.Fatalf("usage: %s <dbname>", os.Args[0])
}
dbName := os.Args[1]
db, err := sql.Open("mysql", "root@/"+dbName)
if err != nil {
log.Fatal(err)
}
defer db.Close()
db.SetMaxOpenConns(100)
db.SetMaxIdleConns(100)
if _, err := db.Exec("drop table if exists t"); err != nil {
log.Fatal(err)
}
if _, err := db.Exec("create table t (id int not null auto_increment primary key)"); err != nil {
log.Fatal(err)
}
ins, err := db.Prepare("insert into t (id) values (null)")
if err != nil {
log.Fatal(err)
}
sel, err := db.Prepare("select count(*) from t")
if err != nil {
log.Fatal(err)
}
for i := 0; i < 10; i++ {
go func() {
for i := 0; ; i++ {
runTx(db, ins)
}
}()
}
for range time.Tick(time.Second) {
var n int
r := sel.QueryRow()
if err := r.Scan(&n); err != nil {
fmt.Println("select", err)
continue
}
fmt.Printf("%d rows (should be 0)\n", n)
}
}
func runTx(db *sql.DB, stmt *sql.Stmt) error {
ctx, cancel := context.WithCancel(context.Background())
quit := make(chan struct{})
done := make(chan struct{})
tx, err := db.BeginTx(ctx, nil)
if err != nil {
fmt.Println("begin", err)
return err
}
defer tx.Rollback()
stmt = tx.Stmt(stmt)
go func() {
timer := time.NewTimer(time.Microsecond)
defer timer.Stop()
select {
case <-quit:
case <-timer.C:
}
cancel()
close(done)
}()
defer func() {
close(quit)
<-done
}()
for i := 0; i < 10; i++ {
if _, err := stmt.Exec(); err != nil {
return err
}
}
_ = tx.Rollback()
return nil
} outputs when you should never succeed in inserting any rows, since the transaction is always rolled back. $ go version |
This comment has been minimized.
This comment has been minimized.
|
I'm sorry, I've missed it. I confirm your repro. |
Here's another log trace, via the mysql driver, of driver.Tx.Rollback being called in the middle of a prepared driver.Stmt.ExecContext and wreaking havoc, because that driver is not expecting those calls to be interleaved:
|
It would be great if someone more familiar with this code could take a look. I think I've pretty much nailed down the problem, but I'm not really in a good position to propose a fix to this code which doesn't introduce new problems or break some intricate assumptions. I've confirmed the following race: Goroutine 1: Goroutine 2: Goroutine 1: Goroutine 2: It definitely allows a statement to execute on a connection after a rollback caused by context cancellation, which under mysql autocommits that statement unexpectedly. Examining the code, it seems like we should hold tx.closemu for the duration of tx.Close or tx.Rollback, certainly while we're changing the state of tx.done. I tried the following workaround: diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..abc892dc08 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2021,8 +2021,8 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
func (tx *Tx) close(err error) {
tx.cancel()
- tx.closemu.Lock()
- defer tx.closemu.Unlock()
+ //tx.closemu.Lock()
+ //defer tx.closemu.Unlock()
tx.releaseConn(err)
tx.dc = nil
@@ -2087,6 +2087,10 @@ func (tx *Tx) Commit() error {
}
return tx.ctx.Err()
}
+
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
@@ -2104,6 +2108,12 @@ func (tx *Tx) Commit() error {
// rollback aborts the transaction and optionally forces the pool to discard
// the connection.
func (tx *Tx) rollback(discardConn bool) error {
+ // Problem: We can't lock tx.closemu while an oustanding
+ // Rows holding tx.closemu in R mode.
+ // We must call tx.cancel to release that first.
+ tx.cancel()
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
} It was unsuccessful. First, I broke Issues 18719 #18719. The fix for that issue (https://go-review.googlesource.com/c/go/+/35550/) introduced the closemu mutex, which was supposed to be held exclusively while closing a tx and releasing its conn back to the pool, while queries hold closemu in read mode to prevent the close from occurring mid-query. Its test inserted a hook into tx.grabConn which expects to be able to wait for tx.done to change from 0 to 1 while it sits there holding tx.closemu in read mode. That no longer works, but the test could be changed. The fix to Issue 18719 caused another issue: #20575. Namely, *Rows objects hold closemu in read mode, which prevents us from rolling back a transaction without first calling (*Rows).Close. The fix for that issue (https://go-review.googlesource.com/c/go/+/44812/) sticks a call to tx.cancel() in tx.close() prior to obtaining tx.closemu, which the (*Rows) code catches in order to unlock tx.closemu. My attempted fix broke this, because I moved the call to tx.closemu.Lock() earlier. I can somewhat workaround this problem by inserting an extra call to tx.cancel() in tx.Rollback, but that's hacky and introduces a race between tx.rollback and tx.awaitDone also calling tx.rollback. I think I'm close to a reasonable solution, but I don't quite know how to resolve the problems with tx.closemu. We need the exclusion between the various queries we can run on a transaction, and commit/rollback. But, perhaps a mutex is not the right solution to hold across multiple calls while iterating over a resultset. Or perhaps context cancellation is not the right mechanism to reclaim a connection while iterating over a rowset. |
I now have a plausible solution, and a test case that it fixes (in addition to the repro program already posted). I'm awaiting approval from my employer to submit the change for review, but here it is in the meantime if anyone's interested. My company only uses even numbered versions of Go, so I'd like to get this fix into 1.14 if possible: diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..a0fd05423c 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2018,12 +2018,10 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
// close returns the connection to the pool and
// must only be called by Tx.rollback or Tx.Commit.
+//
+// The transaction context must already have been canceled,
+// and closemu must locked.
func (tx *Tx) close(err error) {
- tx.cancel()
-
- tx.closemu.Lock()
- defer tx.closemu.Unlock()
-
tx.releaseConn(err)
tx.dc = nil
tx.txi = nil
@@ -2090,6 +2088,18 @@ func (tx *Tx) Commit() error {
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
+
+ // cancel is called after done transitions from 0 to 1, but
+ // before we attempt to grab closemu, since context cancellation
+ // will cause any operations holding closemu in read mode to
+ // release their locks.
+ tx.cancel()
+
+ // closemu must be acquired exclusively before rolling back to
+ // allow any oustanding operations to finish.
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
var err error
withLock(tx.dc, func() {
err = tx.txi.Commit()
@@ -2107,6 +2117,18 @@ func (tx *Tx) rollback(discardConn bool) error {
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
+
+ // cancel is called after done transitions from 0 to 1, but
+ // before we attempt to grab closemu, since context cancellation
+ // will cause any operations holding closemu in read mode to
+ // release their locks.
+ tx.cancel()
+
+ // closemu must be acquired exclusively before rolling back to
+ // allow any oustanding operations to finish.
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
var err error
withLock(tx.dc, func() {
err = tx.txi.Rollback() |
Your patch looks good to me. Would you create a pull request? |
I have a pull request ready to go, as soon as I get the okay regarding
company CLA.
…-Dan
On Fri, Nov 8, 2019 at 4:49 AM Inada Naoki ***@***.***> wrote:
Your patch looks good to me. Would you create a pull request?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34775?email_source=notifications&email_token=ANNV6MCDQ5L5PHYFIAFE7LTQSVN47A5CNFSM4I6WCSP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDRMTGY#issuecomment-551733659>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANNV6MH2P5HO7SX4RSZUR5LQSVN47ANCNFSM4I6WCSPQ>
.
|
Change https://golang.org/cl/216240 mentions this issue: |
CL Ready, just waiting for a reviewer. |
I've got you! |
Will this be released as a patch? Are there any mitigations we can apply in the meantime? Having some DB queries executed instead of rolled-back is potentially catastrophic. |
@ire-and-curses so this issue has existed since Go1.12 and unfortunately because it wasn't a regression but rather a long standing bug, the policy is that we can only cherry pick to prior releases only regressions and security issues. Go1.15 release candidates and betas will be made between now and August 2020 when Go1.15 shall be released. Sorry about that. You can however, clone Go, add the patch to a branch and build Go, and backport the respective fixes you'd like. |
@odeke-em I believe this fix satisfies the requirements for a backport according to the minor version release policy because the issue is a serious issue that is causing databases to get into bad state where no work around is possible and updating to the latest release Go 1.14.3 does not fix the issue. If patching Go is the suggested work around that suggests there is no work around in an application. Unfortunately building custom Go isn't an option for everyone as our product is open source and so we'd need to ask everyone who uses it to build with a custom Go. @gopherbot please consider this for backport because this issue breaks production applications resulting in unexpected database state and cannot be fixed with a work around including updating the version of Go to the latest version. |
Backport issue(s) opened: #39101 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Awesome, thank you @leighmcculloch for the advocacy and for citing the backporting clause, it will be useful for folks to get it on Go1.14. The cherry pick has been approved and we shall get it up and approved shortly. |
We're hitting this bug in production. Really looking forward to seeing it fixed in 1.14.5. |
Change https://golang.org/cl/242101 mentions this issue: |
Change https://golang.org/cl/242102 mentions this issue: |
Based on #34775 (comment):
The backport policy requires that a fix must either be backported to both 1.14 and 1.13, or neither (see #34536 (comment)). I've opened #40205 for this fix to be considered for backporting to 1.13 as well. |
Change https://golang.org/cl/242522 mentions this issue: |
Thank you for doing that work, @odeke-em. The 1.13 backport issue is still in CherryPickCandidate state, so the release managers will need to decide whether it should be approved for backport. Having the backport CL ready makes that process easier. |
Manually backported the subject CLs, because of lack of Gerrit "forge-author" permissions, but also because the prior cherry picks didn't apply cleanly, due to a tight relation chain. The backport comprises of: * CL 174122 * CL 216197 * CL 223963 * CL 216240 * CL 216241 Note: Due to the restrictions that we cannot retroactively introduce API changes to Go1.13.13 that weren't in Go1.13, the Conn.Validator interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be inspected, for if they have an IsValid() bool method implemented. For a description of the content of each CL: * CL 174122: database/sql: process all Session Resets synchronously Adds a new interface, driver.ConnectionValidator, to allow drivers to signal they should not be used again, separatly from the session resetter interface. This is done now that the session reset is done after the connection is put into the connection pool. Previous behavior attempted to run Session Resets in a background worker. This implementation had two problems: untested performance gains for additional complexity, and failures when the pool size exceeded the connection reset channel buffer size. * CL 216197: database/sql: check conn expiry when returning to pool, not when handing it out With the original connection reuse strategy, it was possible that when a new connection was requested, the pool would wait for an an existing connection to return for re-use in a full connection pool, and then it would check if the returned connection was expired. If the returned connection expired while awaiting re-use, it would return an error to the location requestiong the new connection. The existing call sites requesting a new connection was often the last attempt at returning a connection for a query. This would then result in a failed query. This change ensures that we perform the expiry check right before a connection is inserted back in to the connection pool for while requesting a new connection. If requesting a new connection it will no longer fail due to the connection expiring. * CL 216240: database/sql: prevent Tx statement from committing after rollback It was possible for a Tx that was aborted for rollback asynchronously to execute a query after the rollback had completed on the database, which often would auto commit the query outside of the transaction. By W-locking the tx.closemu prior to issuing the rollback connection it ensures any Tx query either fails or finishes on the Tx, and never after the Tx has rolled back. * CL 216241: database/sql: on Tx rollback, retain connection if driver can reset session Previously the Tx would drop the connection after rolling back from a context cancel. Now if the driver can reset the session, keep the connection. * CL 223963 database/sql: add test for Conn.Validator interface This addresses comments made by Russ after https://golang.org/cl/174122 was merged. It addes a test for the connection validator and renames the interface to just "Validator". Updates #31480 Updates #32530 Updates #32942 Updates #34775 Fixes #40205 Change-Id: I6d7307180b0db0bf159130d91161764cf0f18b58 Reviewed-on: https://go-review.googlesource.com/c/go/+/242522 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
Manually backported the subject CLs, because of lack of Gerrit "forge-author" permissions, but also because the prior cherry picks didn't apply cleanly, due to a tight relation chain. The backport comprises of: * CL 174122 * CL 216197 * CL 223963 * CL 216240 * CL 216241 Note: Due to the restrictions that we cannot retroactively introduce API changes to Go1.14.6 that weren't in Go1.14, the Conn.Validator interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be inspected, for if they have an IsValid() bool method implemented. For a description of the content of each CL: * CL 174122: database/sql: process all Session Resets synchronously Adds a new interface, driver.ConnectionValidator, to allow drivers to signal they should not be used again, separatly from the session resetter interface. This is done now that the session reset is done after the connection is put into the connection pool. Previous behavior attempted to run Session Resets in a background worker. This implementation had two problems: untested performance gains for additional complexity, and failures when the pool size exceeded the connection reset channel buffer size. * CL 216197: database/sql: check conn expiry when returning to pool, not when handing it out With the original connection reuse strategy, it was possible that when a new connection was requested, the pool would wait for an an existing connection to return for re-use in a full connection pool, and then it would check if the returned connection was expired. If the returned connection expired while awaiting re-use, it would return an error to the location requestiong the new connection. The existing call sites requesting a new connection was often the last attempt at returning a connection for a query. This would then result in a failed query. This change ensures that we perform the expiry check right before a connection is inserted back in to the connection pool for while requesting a new connection. If requesting a new connection it will no longer fail due to the connection expiring. * CL 216240: database/sql: prevent Tx statement from committing after rollback It was possible for a Tx that was aborted for rollback asynchronously to execute a query after the rollback had completed on the database, which often would auto commit the query outside of the transaction. By W-locking the tx.closemu prior to issuing the rollback connection it ensures any Tx query either fails or finishes on the Tx, and never after the Tx has rolled back. * CL 216241: database/sql: on Tx rollback, retain connection if driver can reset session Previously the Tx would drop the connection after rolling back from a context cancel. Now if the driver can reset the session, keep the connection. * CL 223963 database/sql: add test for Conn.Validator interface This addresses comments made by Russ after https://golang.org/cl/174122 was merged. It addes a test for the connection validator and renames the interface to just "Validator". Updates #31480 Updates #32530 Updates #32942 Updates #34775 Fixes #39101 Change-Id: I043d2d724a367588689fd7d6f3cecb39abeb042c Reviewed-on: https://go-review.googlesource.com/c/go/+/242102 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I haven't been able to run with the latest version, but examination of the code shows that the problem is still there.
What did you do?
Very rarely, I have observed problems resulting from transactions rolled back as a result of context cancellation. Namely, I have observed writes from presumably aborted transactions persist nonetheless. I added enough debugging output to the mysql driver to observe the following sequence:
driver.Tx.Rollback()
driver.Stmt.ExecContext()
driver.Conn.Close()
What did you expect to see?
The driver documentation at https://golang.org/pkg/database/sql/driver/#ConnBeginTx:
Furthermore, the documentation promises that driver connections won't be used concurrently by multiple goroutines:
Conn is a connection to a database. It is not used concurrently by multiple goroutines.
What did you see instead?
I believe that contract was violated and that there is a gap between the "rollback" and the "close" in which a statement was allowed to execute. Since MySQL defaults to autocommit, that statement which we thought was executing within a transaction (which got rolled back) actually executed outside the transaction and committed.
I believe the bug is in tx.rollback in sql.go:
The connection is locked to call driver.Tx.Rollback(), but nothing prevents a statement from executing before we get to tx.close(err).
The text was updated successfully, but these errors were encountered: