Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packets.go: read tcp [ip-address]: connection timed out #257

Closed
eduncan911 opened this issue Jul 11, 2014 · 58 comments
Closed

packets.go: read tcp [ip-address]: connection timed out #257

eduncan911 opened this issue Jul 11, 2014 · 58 comments

Comments

@eduncan911
Copy link

UPDATE: My resolution was to remove all "Idle" connections from the pool. See this comment:

#257 (comment)

I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:

  • Visit a URL, any url on the site, and load the page completely (as in, the page actually loads and the logs show a complete page has loaded).
  • Close browser, and wait.

A typical request is logged like this:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?

There was no activity / no web request made. The web app was simply idle.

The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4'

    [ <=>                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4' saved [6310]

And it sits idle, no response, for 15 minutes until wget times out.

Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.

After this 15m, you can guess exactly what is logged next:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

Another 15m wait time.

I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran wget http://localhost locally on the same box, and I get this "stalled" request that never completes and never sends back anything.

There are a number of factors in my web app so I will try to outline them accordingly.

Using:

  • Go 1.3
  • go-sql-driver/mysql ## Version 1.2 (2014-06-03)
  • Ubuntu 12.04 LTS, ~June 2014 Updates
  • Windows Azure

Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.

Here is some related code:

// global handler for our DB
var db *sql.DB

// CLI parameter
var dbdsn string

func init() {

    flag.StringVar(&dbdsn, "dbdsn", "root:root@tcp(localhost:3306)/prod?timeout=5s&tls=false&autocommit=true", "Specifies the MySql DSN connection.")
    flag.Parse()

    var err error
    db, err = sql.Open("mysql", dbdsn)
    if err != nil {
        log.Printf("ERROR in sql.Open(): %v", err)
    }

    //db.SetMaxIdleConns(5)

    // verify the DSN is setup properly1
    err = db.Ping()
    if err != nil {
        panic("PANIC when pinging db: " + err.Error()) // proper error handling instead of panic in your app
    }
}

// **********
// * omitted is the Gorilla MUX router and http handler registrations
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

    m := NewArticle(c)
    id := c.Vars["id"]

    var pid int
    var title, body, excerpt, date, slug, fi, fv, region, region_slug string
    err = db.QueryRow(
        "SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, "+
            "(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = '_wp_attached_file' AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = '_thumbnail_id' LIMIT 1) LIMIT 1) AS featured_image, "+
            "(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = 'fv_video' AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, "+
            "t.name as region, t.slug as region_slug "+
            "FROM wp_posts p "+
            "JOIN wp_term_relationships tr ON tr.object_id=p.ID "+
            "JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id "+
            "JOIN wp_terms t ON t.term_id=tt.term_id "+
            "WHERE p.post_name=? AND p.post_type='post' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()"+
            "AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1",
        id, RegionsParentCategory).
        Scan(&pid, &title, &body, &excerpt, &date, &slug, &fi, &fv, &region, &region_slug)

    if err != nil {
        if err == sql.ErrNoRows {

            // snipped code for redirects

            // article was not found
            return handleNotFound(w, r, c)

        } else {
            log.Printf("ERROR in .Scan(): %v", err)
        }
    } else {
        m.Region = Region{
            Name: region,
            Slug: region_slug,
        }
        m.Id = pid
        m.Title = title
        m.Body = template.HTML(body) // render the raw html
        m.Excerpt = excerpt
        m.Datetime = date
        m.Slug = slug
        m.FeaturedImageUrl = fi
        m.FeaturedVideoUrl = fv
    }

    web.RenderTemplate(w, "article", m)
    return
}

5 more DB queries, per request

In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and *db global variable you see above.

Timeouts / errors are always on the same DB query

Here's one of the "context" queries as a comparison:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Nothing special there.

I do call defer rows2.Close() only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests as no response or mysql driver timing out.

Questions

Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.

I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?

It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.

It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.

Thanks in advance.

@eduncan911 eduncan911 changed the title packets.go: read tcp [ip-address]: connection timed out [MySQL] packets.go: read tcp [ip-address]: connection timed out Jul 11, 2014
@arnehormann
Copy link
Member

Regarding your biggest question: read this part of our README and pay special attention to the link at the end.
If I guess right, setting that parameter may help you and do what you expect from timeout. What you want is probably one of the server side timeouts, just read through the linked MySQL documentation for the matching one.

Answer 1: No, you should only Open() and Close() the DB once.
It's a connection pool, not a single connection. You only need one per user and server.
Your main problem with the code above is to defer Close() in init - do not do that. Defer it in main.
The deferred function is called when the scope is left - putting it in init will not help you there as init is done and Close is called before main is executed.

Answer 2: I don't know what problems you have with Prepare, but I guess it's that it is connected with the opening and closing of the connection pool. The prepared statements are invalid when the connection pool they are prepared in is closed.

Please get back to us here if that didn't help you.

@eduncan911
Copy link
Author

Thanks Arne. I have updated the SO with more debugging. It is an issue with the go app and/or the mysql driver explicitly. Please take another look at the SO as I completely re-worded it just now.

Also, I do set timeout=5s in my connection string. You can see the default one in the above example, and in production I am running:

./wwwgo -dbdsn="user:pwd@tcp(#.#.#.#:3306)/production_db?timeout=5s&tls=false&autocommit=true"

With the same timeout of 5s.

But as I stated in the updated SO, there is an interval of 15m timeouts (15m, 31m and 48m) that is happening, stalling the entire web request.

I don't think this issue should be closed. It still may be the MySql VM issue in Azure; but even then, why is there a 15 minute timeout happening in the mysql driver/packets?

Notice the logged datetime below from the MySql driver, and then the 31 minute request logged:

[MySQL] 2014/07/13 16:59:29 packets.go:32: read tcp #.#.#.#:3306: connection timed out
[MySQL] 2014/07/13 16:59:29 packets.go:118: write tcp #.#.#.#:3306: broken pipe
2014/07/13 16:59:29 INFO template rendering: article
2014/07/13 16:59:29 METRIC, URL: /new-york-city/test-1, HANDLER TIME: 4.5973ms, CTX TIME: 31m13.9383742s, TOTAL TIME: 31m13.9429739s

@julienschmidt julienschmidt reopened this Jul 13, 2014
@julienschmidt julienschmidt changed the title [MySQL] packets.go: read tcp [ip-address]: connection timed out packets.go: read tcp [ip-address]: connection timed out Jul 13, 2014
@eduncan911
Copy link
Author

I've updated the original post with the exact details now after more debugging.

I've confirmed the MySql driver is the one blocking for 15 minutes.

Once I make 5 to 8 requests to the homepage, I then let the wwwgo app sit idle for at least 30m. I then attempt to make a wget request and it gets blocked on db.Query().

user@govm1:~$ wget http://localhost
--2014-07-14 00:59:13--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response...

And that's it. It sits idle until a wget error of a timeout response for 15 minutes. Do note that my other 15 minutes came from web browser requests, not wget.

I added additional logging and when the following line executes, it blocks the handler (web request) for 15 minutes. See my previous comment for those errors.

rows, err := db.Query(
    "SELECT t.name, t.slug "+
        "FROM wp_terms t "+
        "JOIN wp_term_taxonomy tt ON tt.term_id = t.term_id "+
        "WHERE tt.taxonomy = 'category' AND tt.parent = (SELECT t2.term_id FROM wp_terms t2 WHERE t2.name=? LIMIT 1)",
    RegionsParentCategory)

When that line executes, it blocks the method for 15 minutes.

Then after waiting with no activity, all of the sudden this is logged below. It happens in one of the 15m intervals previously mentioned (15m, 31m or 48m), this is logged:

[MySQL] 2014/07/14 01:14:52 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/14 01:14:52 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/14 01:14:52 INFO template rendering: index
2014/07/14 01:14:52 METRIC, END URL: /, HANDLER TIME: 7.9183ms, CTX TIME: 15m39.2565228s, TOTAL TIME: 15m39.2644567s

This is obviously a timeout in the MySql driver, perhaps in the packets.go package around line 32.

Again, this may be an issue with my MySql VM I have running causing a tcp issue. But even so, there is a 15 minute timeout happening in the Go application using this MySql driver. That should not be happening.

Ps: note the datetimes logged above. There was no activity (no web requests) on the go app during this period. It just sat and waited, in idle.

@arnehormann
Copy link
Member

@eduncan911 do you use a custom Dialer?

I only ask because the timeout parameter is not used by the driver directly, it is handed through to net.Dialer and any errors are propagated to database/sql.
If you use a custom dialer, the timeout is only included if you add it yourself.

@eduncan911
Copy link
Author

Not that I know of. Not using the "net" package, other than HTTP.

Packages:

"fmt"
"database/sql"
"flag"
"github.com/go-sql-driver/mysql"
"github.com/gorilla/mux"
"github.com/justinas/nosurf"
"html/template"
"log"
"net/http"
"runtime"
"time"

And the app/main is pretty boiler plate standard website setup stuff.

var router = mux.NewRouter()

region := router.PathPrefix("/{region}/").Subrouter()
region.HandleFunc("/{id}", defaultHandler(ArticleHandler)).Name("article")

csrfHandler := nosurf.New(http.DefaultServeMux)
csrfHandler.ExemptRegexp("/assets(.*)")
csrfHandler.ExemptRegexp("/favicon.ico")
csrfHandler.SetFailureHandler(http.HandlerFunc(csrfFailedHandler))

The defaultHandler is basically a wrapper around a normal http handler, that creates a new context and passes it into the new handlers. That's all it is.

@eduncan911
Copy link
Author

So I had a theory... Since the MySql query is stalling on execution of an "idle" application for a long period of time, perhaps Windows Azure does something to that TCP connection. Was there a persistent connection in the goland mysql driver that maybe Azure doesn't like across cloud services? (a cloud service is a collection of resources for scaling. e.g. all web servers in one cloud service to auto-scale up and down, all database servers in another cloud service, all backend jobs processing in another cloud service, etc). Perhaps Azure didn't like a new tcp connection after being idle for so long?

So i set out to investigate these... Also, the SetMaxIdleConns() had me courious...

sudo netstat -tnp | grep wwwgo

Sure enough, this showed 1 single persistent connection from my wwwgo app to my mysql IP address and 3306 port - even after letting it sit for several hours without any interation, it was still open and persistent.

This got me thinking... Maybe Azure didn't like long running tcp connections with no activity. So i set these:

db.SetMaxIdleConns(0)
db.SetMaxOpenConns(500) // matches my MySql my.cnf file settings

Running netstat shows no persistent connection now. And I have not seen the original problem since! I cannot reproduce the original problem any longer! Not only that, but now load-testing the site no longer gives too many files open error on a single CPu instance, nor do I have the i/o socket error error i got on multi-core MAXCPUS setting. Cool.

So the issue seems to be with a long running persistent connection that never closes.

I know I know.. The next words out of anyone's mouth are, "Are you sure you are calling rows.Close(), and that you are iterating over the entire collection? That will hold the connection open if you don't!"

The answer is that I either do Scan() on single rows, or I follow this pattern for rows.Close() and iterating:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Note that I call defer rows2.Close() only if there was no error. I have also guaranteed I am always iterating and always rows.Close() in this pattern. Some may say, "well, there still may be data in the rows and you should iterate over them." well, during my load tests today of 1000 concurrent users off and on for 15m intervals several times today, there was only the 1 persistent connection open to mysql at all times. This tells me that rows.Close() is being called ,and I am iterating over all rows - else, I would have more persistent connections open, right?

Shouldn't a long-running idle connection be dropped?

Entity Framework, for Microsoft SQL Server, doesn't do this "long running" idle connection pooling. It closes all queries.

If it is to remain open, shouldn't there be a timed "ping" across the wire to keep it active?

Similar to how Android tcp connections are handled in where you constantly send a ping to keep the connection alive.

@arnehormann
Copy link
Member

It's great that you found the solution yourself. Congrats!

Still, this is out of scope for drivers.

SetMaxIdleConns and SetMaxOpenConns are database/sql functions. database/sql uses and wraps the database/sql/driver implementations provided by the driver. If the number of idle connections is more than 0, there will be open connections until the pool is closed.

The driver itself does not provide any pooling functionality, it is based on simple connections.
We do enable TCP keepalive by default.

Sending a ping-like query at regular intervals would require an additional goroutine per connection and synchronisation with any other query running at the same time. All of this would complicate maintenance of the driver and degrade its performance.

I also have a hunch Azure is misbehaving here.

IMO, we should not make any changes in the driver based on this issue.

@julienschmidt close if you agree, take over if you don't 😀

@eduncan911
Copy link
Author

When I ran netstat on the mysql VM, it did not show the connection - only the go app showed a connection

What I suspect is the packets.go code, or the mysql driver code that is calling it, isn't detecting that the connection has been broken by the network - nor is it timing out. It continues to think there is a connection open, and to attempts to use it an hour later.

This theory matches the exact sympons I originally had: after a long idle, attempting to make a SQL query blocks for 15m. But during that block, even 5 seconds after its was initially blocked, I could make another query and it was OK. I suspect a 2nd connection was created in the pool, which worked fine.

Therefore two issues need to be addressed here:

  • failure to detect a broken idle TCP pipe/connection
  • a 15 minute timeout, waiting for that TCP pipe to return data when it is reused (e.g. a new SQL query is issued).

If neither of these are part of the driver but instead the underlying Go packages, then I am more than happy to close the issue and take these two issues to the go team.

But I believe we can agree that there are two known issues here that do need to be addressed.

@arnehormann
Copy link
Member

@eduncan911 I think your theory is right. This is what's happening:

  • you issue the query. It is sent to the server.
  • something is broken - you wait 15 min for the reply.
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

The mystery of the second log line (packets.go:118) is the call to Close after logging the first error. We could make that a little smarter, but I don't think it's really a bug.

What is broken?
My new suspicion: running your queries takes more than 5 seconds, they always time out unless they are in the query cache. That also explains why you see multiple queries - the first ones fail.
For the ones after, database/sql establishes a new connection and tries again (10 times by default) if a connection fails.

Can you please try what happens if you remove the timeout from the dsn?

@eduncan911
Copy link
Author

Interesting. I'll remove the timeout; but, on average my pages load in 12ms overall, including all 7 SQL queries ran synchronously (I was going to move to channels later). It is lightning fast.

Therefore, the 5 second thought i don't think is valid - unless that accounts for dozens of page requests back to back - so the connection stays open for longer than 5 seconds as it is being used over and over again. But, even that doesn't hold water - as I can 1) start the go app, 2) make 1 single request (which ends in 38ms, for the first request at the start of the go app), and 3) wait an hour. The pipe is broken, so after 1 hour when I attempt to make a 2nd request - the first initial SQL query is blocked in this state.

When you said:

  • you issue the query. It is sent to the server.
  • something is broken - you wait 15 min for the reply.
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

Let me clarify it below...

  • you make a single request, page loads fine and all queries run within 38ms.
  • you wait 1 hour with the MySql idle connection remaining open on tcp port 3306.
  • you attempt to issue a new query, but nothing is sent to the server because the tcp pipe of the idle connection is now broken - it does not exist on the mysql VM, but yet the VM running the go app things it is still open as netstat shows (it is broken though and does not exist on the mysql VM).
  • something is broken - you wait 15 min for a timeout
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

And if that ErrBadConn is indeed this error:

`[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out'

Then yes, that all sounds right.

@arnehormann
Copy link
Member

Please add if err != nil { panic(err) } to your code wherever this bug happens.
I'd like to see the full stacktrace for the driver - to get a clue in which connection phase this happens. You may edit out your own stuff, but I need it for the driver - readPacket from packets.go is called by a lot of different functions...
It's weird that you fail to send a query but the error happens when a packet is read. Could be in the handshake phase when a new connection is established - but I can't see why that should fail due to a timeout.

@eduncan911
Copy link
Author

Will do. Let me schedule some downtime... 👍

@arnehormann
Copy link
Member

cough of course you can also use debug.PrintStack and guard it with some kind of command line argument.
Probably better if you have to test in the live environment.

@arnehormann
Copy link
Member

@eduncan911 small but relevant correction - the PrintStack must be in the driver, not in your code. Just put it below the log line in packets.go.

@coldfire-x
Copy link

is there any progress yet?

I got this error in our production server.

@arnehormann
Copy link
Member

No progress on our side.
I don't get this error and don't have a reproducer, so I have to rely on users seeing the crashes.

If you also get this error, please edit your driver version, add PrintStack to packets.go as I described above.
When you get the stacktrace, add it in a comment.
I have to know where this happens before I can fix it.

@eduncan911
Copy link
Author

Sorry I haven't been able to "crash production" yet for an update to this issue. It's been fine since setting the SetMaxIdleConns() to zero, even under load tests across 3 go instances on 3 VMs and 1 MySql backend.

Quite surprised actually that MySql can take a beating like that on a single core VM with a limit of 300 IOPS for the VHD - it got about 700 RPS with heavy 5/6 queries per request, across 3 Go instances on 3 VMs. I account it to the 1.5 GB of memory the VM has since they are all READ queries. That's before I add indexes. I always maximize code performance and queries first, before moving to caching.

@coldfire-x
Copy link

actually, you can re-produce this debug by set mysql's wait_timeout = 2, then, after 2 seconds, issue some sql commands.

the stack log is:

/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/packets.go:33 (0x4a6673)
com/go-sql-driver/mysql.(_mysqlConn).readPacket: debug.PrintStack()
/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/packets.go:660 (0x4a9f2a)
com/go-sql-driver/mysql.(_mysqlStmt).readPrepareResultPacket: data, err := stmt.mc.readPacket()
/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/connection.go:148 (0x4a39c4)
com/go-sql-driver/mysql.(_mysqlConn).Prepare: columnCount, err := stmt.readPrepareResultPacket()
/usr/lib/go/src/pkg/database/sql/sql.go:945 (0x42fbbc)
(_DB).queryConn: si, err := dc.ci.Prepare(query)
/usr/lib/go/src/pkg/database/sql/sql.go:913 (0x42f8dd)
(_DB).query: return db.queryConn(ci, ci.releaseConn, query, args)
/usr/lib/go/src/pkg/database/sql/sql.go:899 (0x42f75b)
(_DB).Query: rows, err = db.query(query, args)
/usr/lib/go/src/pkg/database/sql/sql.go:977 (0x430037)
(_DB).QueryRow: rows, err := db.Query(query, args...)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/dwz.go:175 (0x40240d)
getUrl: db.QueryRow("SELECT url FROM url WHERE id = ?", id).Scan(&url)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/dwz.go:60 (0x400dcc)
(_DwzHandler).ServeHTTP: url := getUrl(hash)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/middlewares/middlewares.go:23 (0x4b438f)
com/mop/dwz/middlewares.(_CrossDomain).ServeHTTP: c.handler.ServeHTTP(w, r)
/usr/lib/go/src/pkg/net/http/server.go:1597 (0x46204e)
serverHandler.ServeHTTP: handler.ServeHTTP(rw, req)
/usr/lib/go/src/pkg/net/http/server.go:1167 (0x460057)
(_conn).serve: serverHandler{c.server}.ServeHTTP(w, w.req)
/usr/lib/go/src/pkg/runtime/proc.c:1394 (0x419540)
goexit: runtime·goexit(void)
[MySQL] 2014/07/25 12:42:30 packets.go:34: EOF
[MySQL] 2014/07/25 12:42:30 packets.go:120: write tcp 127.0.0.1:3306: broken pipe

@coldfire-x
Copy link

mysql closes the connection after a period of inactive time.

@ernestoalejo
Copy link

I can reproduce the "write part" of this issue consistenly in a vagrant machine and in production:

[MySQL] 2014/08/27 19:39:32 packets.go:118: write unix /var/lib/mysql/mysql.sock: broken pipe

@eduncan911
Copy link
Author

I found it to be the state of the connection pool, where the pool thinks there still is a connection open. But in fact, it has been closed remotely by the mysqld.

Scroll up for my entire story and debugging: I verified this with netstat where I saw the Go application having a connection to port 3306 on the remote server. But yet, the remote server running MySql no longer had any open pipes (after the timeout I noted above). I did verify that upon making the application active again, opening many mysql connections, that I saw both open pipes on both servers. But again, after some time, the remote mysqld server would drop the open "idle" connection (still not sure if it is mysqld, or Windows Azure doing it).

The root problem is that the net package, and/or this mysql driver, does not detect the dropped connection and thinks that there is still an open idle connection - there isn't. So upon the next attempt by the mysql driver to make a query, with this stale idle connection that doesn't exist, it creates a block and eventually times out after a long period of time. Any additional hits to the mysql driver for queries works fine, as the connection pool simply creates new connections going forward - after that 1 idle connection (e.g. pooling the connections, use the idle open one for the first 1, then create new connections). New connections are fine; it is that idle one that is the problem because it simply isn't connected.

I resolved my issues by getting rid of all idle connections. See my posts above for the history and resolutions. But in short:

db.SetMaxIdleConns(0) // this is the root problem! set it to 0 to remove all idle connections
db.SetMaxOpenConns(500) // or whatever is appropriate for your setup.

@ernestoalejo
Copy link

It's mysqld, as I'm only using Ubuntu/CentOS and MySQL all in the same machine (both vagrant and production).

Getting rid of all idle connections it's obviously a very good option (hasn't any appreciable performance hit in my responses times). Anyway I think we should bring this to the end and find out if it's a library problem or an issue of the standard one that should be reported and fixed.

At the very least if there's no way to solve this it should be documented where appropriate that a pool of permanent connections cannot be used with this library.

Anyway thanks for writing the solution in this issue explicitly.

@arnehormann
Copy link
Member

Anyone in this thread:
please change http://play.golang.org/p/RXeKHioTpo so it reliably triggers the problem and post the link to your version here when it does.
I have never hit this bug, even after investing quite some time in it.
Help me with a reproducer so this can finally get closed.

UPDATE if it only happens on a VPS or in a specific environment I'd like to have a look there. If so, I can send you my public key for access.

@eduncan911
Copy link
Author

@arnehormann Put a delay in for 15 or 20 minutes after the first query.

then attempt 2 SQL queries after that 20 minutes in two go routines (GOMAXPROCS = 1, so you can somewhat control the order they run in). you want to wait for the amount of time that it takes for mysql to drop that idle connection.

the first query after 20 minutes will eventually timeout/error after several minutes of waiting - the reason is that it attempts to use what the driver/netpipes thinks is an "idle" live-still-connected connection in the pool. but, mysql has dropped that connection remotely already and it doesn't exist.

the second query will execute almost instantly, and will return without error. you'll still be waiting for the first query to timeout over several minutes. the second query works fine because it did NOT use an idle connection; but instead, the pool created a new tcp connection for the query. this is why it works.

^- as long as you have set idle connections to 1, that is.

it may also require mysql to be installed a remote machine. perhaps mysql breaks the long-idle connection for remote connections only after a period of time (where perhaps long-idle connections locally are fine).

@arnehormann
Copy link
Member

@eduncan911 thanks, but please don't describe it - change my program linked above so it reliably hangs on your own machine when you start it there.
And I'd be glad if that happens faster than after 15 minutes, @pengfei-xue indicated in can be done after 2 seconds.
I just want a fast feedback loop when I debug this. And I don't want to guess concerning the details of how to hit the bug. I'd like some code, please...

@derekperkins
Copy link

👍 I think I'm experiencing this exact same issue on Google Cloud SQL, but my requests time out at 100 seconds, so I can't verify that they are lasting 15+ minutes. The symptoms appear to be the same, though because the request is halted, I don't get a concrete error back.

@arnehormann
Copy link
Member

I have an idea how to tackle this issue, but it's a little too brittle for my taste. I still don't see another way after a lot of hard thinking.

As I see it, the cause is the server cutting of the connection in a way that's not easily detectable by the client (same problem as ripping out the network cable). To detect this faster and to let the server know we are still there, we need a MySQL protocol based keepalive on the same connection, the TCP keepalive with it's overly long intervals doesn't help.

Ironically, @xaprb recently published a blog post praising our driver for not using timeouts... but what I propose is different to what other connectors do.

My idea:
the time of the last successfull Read can be used as a reliable liveness signal. If the time since last Read exceeds a keepalive duration, we have to issue a lightweight query (CMD_PING is most lightweight, Exec("DO 1") is a close second that's not a command packet) on the same MySQL connection leading to a Read. If that does not succeed, reconnect or close the connection.

Steps to do this:
Add fields to mysqlConn: lastRead time.Time, claimed bool and sync.Mutex. sync/atomic/... would be even better, but that probably won't do. Update lastRead whenever a Read happens.
Make existing code threadsafe enough for this (that's easy to get wrong and hard to check) - add locking and updates of claim and lastRead.
Introduce keepalive=<time.Duration interval> in the DSN.
If keepalive is set, spin up a goroutine when Open is called. It should have a time.Tick channel with that interval looping as long as the connection is not closed. Inside the loop, continue if the connection is claimed or if lastRead + interval > time.Now(). Otherwise, claim the connection and do a full CMD_PING roundtrip - if that does not finish before the next Tick, reconnect or call Close.
Make sure that goroutine exits when the connection is closed.
Claiming is needed so we never ever do this if the sequence id is going to be incremented as a result of the last command. I don't think calling CMD_PING is ok if we are inside a still open sequence...

Downsides:
Easy to get wrong, has to be kept in mind for all bugfixes, feature updates and additions (exactly like #302).
Has a (low) performance impact on connections without keepalive; locking and lastRead updates are required.
Introduces a small delay if the CMD_PING roundtrip is still active while a query is executed.
Adds network traffic (though that's negligable with a keepalive interval of 1 minute or so).
If we are in a sequence and the connection is claimed, the keepalive checks do not happen. Being slow in result processing can still lead to the error.

@julienschmidt
Copy link
Member

Ok TL;DR:

From #257 (comment)

  1. The TCP connection is killed by some network device / server but it doesn't tell the other end about it. This leaves the TCP connection in ESTABLISHED on the client's end while its state is actually gone from the server.
  2. Driver calls net.Conn.Write to send a query over, which ultimately effects a write() system call. The data fits into the OS' buffers, and thus the call doesn't block.
  3. Now, with the data in its buffers, the operating system's TCP stack starts sending the data to the remote end.
    4.The driver calls net.Conn.Read. This will block since there's nothing to read yet since the server hasn't replied (it hasn't even received the query, since there's no healthy connection).
  4. The TCP stack tries to send, re-send, re-send, etc. the data to the server. It will keep doing this for some specific amount of time (it was 15 minutes on our out-of-the-box Linux installation) before giving up, and then tells the client that the connection timed out.

So what we have to avoid is that we write to an connection, which we think is still alive but in reality the server closed it already.

We could could also do something like golang/go#9851 driver side by returning ErrBadConn when time.Now() - conn.LastCmdTime > maxIdleTime

And regarding Arne's keepalive feature... Déjà-vu? 9d66799

@arnehormann
Copy link
Member

No Déjà-vu. My first commit was 2 or 3 months later - and I'm not much of a git historian 😀

@ghost
Copy link

ghost commented Dec 15, 2015

#394
If anyone interest, this pr should help.

@eduncan911
Copy link
Author

👍

@guotie
Copy link

guotie commented Jun 24, 2016

I have set the SetMaxIdleConns(0), And there are no acitve connections, but after a long peroid, the response is still long, about 19 seconds.

@steviesama
Copy link

steviesama commented Feb 20, 2017

@eduncan911

I was having a problem that intuitively seems like what you are describing. I am writing a login/authentication micro-service, and frankly just the initial prototype that generically checks against a user/pass fetching from the database was resulting in read time out, then a broken pipe, then a write time out on different lines.

I have since tried your solution of:

db.SetMaxIdleConns(0)
db.SetMaxOpenConns(151)

...and that seemed to do the trick. I was checking it for 1-2 hours later, and it wasn't hanging after the initial request. But I left the go server running over night and when I checked this morning, the first request failed. However, I don't get the errors logged in the console, but the behavior is identical, just a MUCH longer interval between the breaks.

The long period of time seems to be a trait from the post above me, but the response isn't long, it just breaks or blocks.

I figure I can probably work around this since the micro-service is for private consumption and I can just issue single retry from JS on the front-end since I can expect 1 failure on requests that make a trip to the database and the subsequent request will succeed, but clearly that is tacky and would prefer to avoid that.

Have you any more experience with this problem over a longer period of time of it running after an initial request from a connection?

@methane
Copy link
Member

methane commented Feb 20, 2017

@steviesama
Copy link

@methane

I am currently using the following from another solution I tried before this:

db.SetConnMaxLifetime(time.Second * 14400)

...but I'm not sure if you are suggesting to use it to extend it, or to set it to zero for infinite reuse?

@methane
Copy link
Member

methane commented Feb 20, 2017

14400 sec is too long. One minutes is enough for most use cases.
Even if you configure entire your DC (OS, switch, router, etc...), TCP connection may be lost from various reasons. (bug in router firmware, unstable power voltage, electric nose, etc...)

@steviesama
Copy link

@methane

Yeah, I didn't want it to be that long, but I had it in there from a previous solution attempt. I'll remove it and see what happens.

@eduncan911
Copy link
Author

eduncan911 commented Feb 21, 2017

@steviesama for the record, we migrated off to non-mysql datastores for that project. i haven't been part of projects that used mysql w/Go since.

@pjebs
Copy link

pjebs commented Oct 19, 2017

Had the same problem:
Didn't want to set SetMaxIdleConns to 0 since that would effectively disable connection pooling.
What I did was changed SetConnMaxLifetime to a value that was less than the setting on MYSQL server that closed connections that were idle. MySQL was set to 10seconds before it closed an idle connection so I changed it to 9 seconds. Bug was fixed. The default value keeps idle connections as long as possible, so the connection pool thinks a particular connection is alive when in fact MySQL had already closed it. The connection pool then attempts to use the connection and you get the error.

shusson added a commit to shusson/variantstore that referenced this issue Nov 26, 2017
@eexit
Copy link

eexit commented Mar 2, 2018

Hello guys,

From what I understood, the SetConnMaxLifetime overrides the MySQL value wait_timeout, right? I mean, default MySQL wait_timeout is 28800 so if I set, let's say, 28790 in SetConnMaxLifetime, it should never throw a write: connection timed out error.

However, I still get the error. Any insight here?
Thanks!

@methane
Copy link
Member

methane commented Mar 2, 2018

@eexit Did you read this comment?

@eexit
Copy link

eexit commented Mar 2, 2018

@methane Will try that, thanks!

@pinylin
Copy link

pinylin commented Aug 16, 2018

@arnehormann thanks,your answer is helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests