-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
packets.go: read tcp [ip-address]: connection timed out #257
Comments
Regarding your biggest question: read this part of our README and pay special attention to the link at the end. Answer 1: No, you should only Answer 2: I don't know what problems you have with Prepare, but I guess it's that it is connected with the opening and closing of the connection pool. The prepared statements are invalid when the connection pool they are prepared in is closed. Please get back to us here if that didn't help you. |
Thanks Arne. I have updated the SO with more debugging. It is an issue with the go app and/or the mysql driver explicitly. Please take another look at the SO as I completely re-worded it just now. Also, I do set
With the same timeout of 5s. But as I stated in the updated SO, there is an interval of 15m timeouts (15m, 31m and 48m) that is happening, stalling the entire web request. I don't think this issue should be closed. It still may be the MySql VM issue in Azure; but even then, why is there a 15 minute timeout happening in the mysql driver/packets? Notice the logged datetime below from the MySql driver, and then the 31 minute request logged:
|
I've updated the original post with the exact details now after more debugging. I've confirmed the MySql driver is the one blocking for 15 minutes. Once I make 5 to 8 requests to the homepage, I then let the wwwgo app sit idle for at least 30m. I then attempt to make a wget request and it gets blocked on db.Query().
And that's it. It sits idle until a wget error of a timeout response for 15 minutes. Do note that my other 15 minutes came from web browser requests, not wget. I added additional logging and when the following line executes, it blocks the handler (web request) for 15 minutes. See my previous comment for those errors.
When that line executes, it blocks the method for 15 minutes. Then after waiting with no activity, all of the sudden this is logged below. It happens in one of the 15m intervals previously mentioned (15m, 31m or 48m), this is logged:
This is obviously a timeout in the MySql driver, perhaps in the Again, this may be an issue with my MySql VM I have running causing a tcp issue. But even so, there is a 15 minute timeout happening in the Go application using this MySql driver. That should not be happening. Ps: note the datetimes logged above. There was no activity (no web requests) on the go app during this period. It just sat and waited, in idle. |
@eduncan911 do you use a custom Dialer? I only ask because the |
Not that I know of. Not using the "net" package, other than HTTP. Packages:
And the app/main is pretty boiler plate standard website setup stuff.
The |
So I had a theory... Since the MySql query is stalling on execution of an "idle" application for a long period of time, perhaps Windows Azure does something to that TCP connection. Was there a persistent connection in the goland mysql driver that maybe Azure doesn't like across cloud services? (a So i set out to investigate these... Also, the SetMaxIdleConns() had me courious...
Sure enough, this showed 1 single persistent connection from my wwwgo app to my mysql IP address and 3306 port - even after letting it sit for several hours without any interation, it was still open and persistent. This got me thinking... Maybe Azure didn't like long running tcp connections with no activity. So i set these:
Running So the issue seems to be with a long running persistent connection that never closes. I know I know.. The next words out of anyone's mouth are, "Are you sure you are calling rows.Close(), and that you are iterating over the entire collection? That will hold the connection open if you don't!" The answer is that I either do
Note that I call defer Shouldn't a long-running idle connection be dropped?Entity Framework, for Microsoft SQL Server, doesn't do this "long running" idle connection pooling. It closes all queries. If it is to remain open, shouldn't there be a timed "ping" across the wire to keep it active?Similar to how Android tcp connections are handled in where you constantly send a ping to keep the connection alive. |
It's great that you found the solution yourself. Congrats! Still, this is out of scope for drivers.
The driver itself does not provide any pooling functionality, it is based on simple connections. Sending a ping-like query at regular intervals would require an additional goroutine per connection and synchronisation with any other query running at the same time. All of this would complicate maintenance of the driver and degrade its performance. I also have a hunch Azure is misbehaving here. IMO, we should not make any changes in the driver based on this issue. @julienschmidt close if you agree, take over if you don't 😀 |
When I ran netstat on the mysql VM, it did not show the connection - only the go app showed a connection What I suspect is the packets.go code, or the mysql driver code that is calling it, isn't detecting that the connection has been broken by the network - nor is it timing out. It continues to think there is a connection open, and to attempts to use it an hour later. This theory matches the exact sympons I originally had: after a long idle, attempting to make a SQL query blocks for 15m. But during that block, even 5 seconds after its was initially blocked, I could make another query and it was OK. I suspect a 2nd connection was created in the pool, which worked fine. Therefore two issues need to be addressed here:
If neither of these are part of the driver but instead the underlying Go packages, then I am more than happy to close the issue and take these two issues to the go team. But I believe we can agree that there are two known issues here that do need to be addressed. |
@eduncan911 I think your theory is right. This is what's happening:
The mystery of the second log line ( What is broken? Can you please try what happens if you remove the timeout from the dsn? |
Interesting. I'll remove the timeout; but, on average my pages load in 12ms overall, including all 7 SQL queries ran synchronously (I was going to move to channels later). It is lightning fast. Therefore, the 5 second thought i don't think is valid - unless that accounts for dozens of page requests back to back - so the connection stays open for longer than 5 seconds as it is being used over and over again. But, even that doesn't hold water - as I can 1) start the go app, 2) make 1 single request (which ends in 38ms, for the first request at the start of the go app), and 3) wait an hour. The pipe is broken, so after 1 hour when I attempt to make a 2nd request - the first initial SQL query is blocked in this state. When you said:
Let me clarify it below...
And if that
Then yes, that all sounds right. |
Please add |
Will do. Let me schedule some downtime... 👍 |
cough of course you can also use debug.PrintStack and guard it with some kind of command line argument. |
@eduncan911 small but relevant correction - the |
is there any progress yet? I got this error in our production server. |
No progress on our side. If you also get this error, please edit your driver version, add |
Sorry I haven't been able to "crash production" yet for an update to this issue. It's been fine since setting the SetMaxIdleConns() to zero, even under load tests across 3 go instances on 3 VMs and 1 MySql backend. Quite surprised actually that MySql can take a beating like that on a single core VM with a limit of 300 IOPS for the VHD - it got about 700 RPS with heavy 5/6 queries per request, across 3 Go instances on 3 VMs. I account it to the 1.5 GB of memory the VM has since they are all READ queries. That's before I add indexes. I always maximize code performance and queries first, before moving to caching. |
actually, you can re-produce this debug by set mysql's wait_timeout = 2, then, after 2 seconds, issue some sql commands. the stack log is: /home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/packets.go:33 (0x4a6673) |
mysql closes the connection after a period of inactive time. |
I can reproduce the "write part" of this issue consistenly in a vagrant machine and in production:
|
I found it to be the state of the connection pool, where the pool thinks there still is a connection open. But in fact, it has been closed remotely by the mysqld. Scroll up for my entire story and debugging: I verified this with netstat where I saw the Go application having a connection to port 3306 on the remote server. But yet, the remote server running MySql no longer had any open pipes (after the timeout I noted above). I did verify that upon making the application active again, opening many mysql connections, that I saw both open pipes on both servers. But again, after some time, the remote mysqld server would drop the open "idle" connection (still not sure if it is mysqld, or Windows Azure doing it). The root problem is that the net package, and/or this mysql driver, does not detect the dropped connection and thinks that there is still an open idle connection - there isn't. So upon the next attempt by the mysql driver to make a query, with this stale idle connection that doesn't exist, it creates a block and eventually times out after a long period of time. Any additional hits to the mysql driver for queries works fine, as the connection pool simply creates new connections going forward - after that 1 idle connection (e.g. pooling the connections, use the idle open one for the first 1, then create new connections). New connections are fine; it is that idle one that is the problem because it simply isn't connected. I resolved my issues by getting rid of all idle connections. See my posts above for the history and resolutions. But in short:
|
It's mysqld, as I'm only using Ubuntu/CentOS and MySQL all in the same machine (both vagrant and production). Getting rid of all idle connections it's obviously a very good option (hasn't any appreciable performance hit in my responses times). Anyway I think we should bring this to the end and find out if it's a library problem or an issue of the standard one that should be reported and fixed. At the very least if there's no way to solve this it should be documented where appropriate that a pool of permanent connections cannot be used with this library. Anyway thanks for writing the solution in this issue explicitly. |
Anyone in this thread: UPDATE if it only happens on a VPS or in a specific environment I'd like to have a look there. If so, I can send you my public key for access. |
@arnehormann Put a delay in for 15 or 20 minutes after the first query. then attempt 2 SQL queries after that 20 minutes in two go routines (GOMAXPROCS = 1, so you can somewhat control the order they run in). you want to wait for the amount of time that it takes for mysql to drop that idle connection. the first query after 20 minutes will eventually timeout/error after several minutes of waiting - the reason is that it attempts to use what the driver/netpipes thinks is an "idle" live-still-connected connection in the pool. but, mysql has dropped that connection remotely already and it doesn't exist. the second query will execute almost instantly, and will return without error. you'll still be waiting for the first query to timeout over several minutes. the second query works fine because it did NOT use an idle connection; but instead, the pool created a new tcp connection for the query. this is why it works. ^- as long as you have set idle connections to 1, that is. it may also require mysql to be installed a remote machine. perhaps mysql breaks the long-idle connection for remote connections only after a period of time (where perhaps long-idle connections locally are fine). |
@eduncan911 thanks, but please don't describe it - change my program linked above so it reliably hangs on your own machine when you start it there. |
👍 I think I'm experiencing this exact same issue on Google Cloud SQL, but my requests time out at 100 seconds, so I can't verify that they are lasting 15+ minutes. The symptoms appear to be the same, though because the request is halted, I don't get a concrete error back. |
I have an idea how to tackle this issue, but it's a little too brittle for my taste. I still don't see another way after a lot of hard thinking. As I see it, the cause is the server cutting of the connection in a way that's not easily detectable by the client (same problem as ripping out the network cable). To detect this faster and to let the server know we are still there, we need a MySQL protocol based keepalive on the same connection, the TCP keepalive with it's overly long intervals doesn't help. Ironically, @xaprb recently published a blog post praising our driver for not using timeouts... but what I propose is different to what other connectors do. My idea: Steps to do this: Downsides: |
Ok TL;DR: From #257 (comment)
So what we have to avoid is that we write to an connection, which we think is still alive but in reality the server closed it already.
We could could also do something like golang/go#9851 driver side by returning And regarding Arne's keepalive feature... Déjà-vu? 9d66799 |
No Déjà-vu. My first commit was 2 or 3 months later - and I'm not much of a git historian 😀 |
#394 |
👍 |
I have set the SetMaxIdleConns(0), And there are no acitve connections, but after a long peroid, the response is still long, about 19 seconds. |
I was having a problem that intuitively seems like what you are describing. I am writing a login/authentication micro-service, and frankly just the initial prototype that generically checks against a user/pass fetching from the database was resulting in read time out, then a broken pipe, then a write time out on different lines. I have since tried your solution of:
...and that seemed to do the trick. I was checking it for 1-2 hours later, and it wasn't hanging after the initial request. But I left the go server running over night and when I checked this morning, the first request failed. However, I don't get the errors logged in the console, but the behavior is identical, just a MUCH longer interval between the breaks. The long period of time seems to be a trait from the post above me, but the response isn't long, it just breaks or blocks. I figure I can probably work around this since the micro-service is for private consumption and I can just issue single retry from JS on the front-end since I can expect 1 failure on requests that make a trip to the database and the subsequent request will succeed, but clearly that is tacky and would prefer to avoid that. Have you any more experience with this problem over a longer period of time of it running after an initial request from a connection? |
I am currently using the following from another solution I tried before this:
...but I'm not sure if you are suggesting to use it to extend it, or to set it to zero for infinite reuse? |
14400 sec is too long. One minutes is enough for most use cases. |
Yeah, I didn't want it to be that long, but I had it in there from a previous solution attempt. I'll remove it and see what happens. |
@steviesama for the record, we migrated off to non-mysql datastores for that project. i haven't been part of projects that used mysql w/Go since. |
Had the same problem: |
Hello guys, From what I understood, the However, I still get the error. Any insight here? |
@eexit Did you read this comment? |
@methane Will try that, thanks! |
@arnehormann thanks,your answer is helpful |
UPDATE: My resolution was to remove all "Idle" connections from the pool. See this comment:
#257 (comment)
I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:
A typical request is logged like this:
After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:
Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?
There was no activity / no web request made. The web app was simply idle.
The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:
And it sits idle, no response, for 15 minutes until wget times out.
Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.
After this 15m, you can guess exactly what is logged next:
Another 15m wait time.
I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran
wget http://localhost
locally on the same box, and I get this "stalled" request that never completes and never sends back anything.There are a number of factors in my web app so I will try to outline them accordingly.
Using:
Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.
Here is some related code:
5 more DB queries, per request
In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and
*db
global variable you see above.Timeouts / errors are always on the same DB query
Here's one of the "context" queries as a comparison:
Nothing special there.
I do call
defer rows2.Close()
only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests asno response
or mysql driver timing out.Questions
Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.
I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?
It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.
It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.
Thanks in advance.
The text was updated successfully, but these errors were encountered: