-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c.HTTPClient.Do (in c.Push) hanging intermittently #20
Comments
Obviously related to #17 |
lol, just wanted to report this bug :) |
+1, we faced the same bug aswell. Updating crypto and http2 libraries didnt help. @sideshow any idea? |
I've been digging into this issue and it turns out the problem is
you will start receiving proper timeouts. As to why it hangs when dialing, I still have no idea, but at least this modification allows you to take proper actions (such as trying to call @sideshow, would you accept a PR with an optional timeout variable for |
I have noticed some weirdness too. Trying to find the source of the issue has been hard because we have noticed Apple has intermittently been rejecting connections or timing out over the last 2-3 weeks. Its rare but it has been happening. @c3mb0 if others can confirm that the timeout fixes their issue then the PR makes sense. I am going to look into this over the next few days. Any more info you guys could share would be awesome |
Here are a few more findings and tests:
@sideshow If it is agreed upon that timing out Edit: It seems like the error string can also contain Edit 2: We have now started seeing some cases where redials do not end up opening a successful connection. Said posts randomly start working after a few minutes. Even though these posts don't end up succeeding, at least they are not hanging around forever. |
Transparent retry logic would be nice. @c3mb0 The timeout can also be triggered when reading the response body, as documented here: https://golang.org/pkg/net/http/#Client That would account for the i/o timeouts. |
@dwieeb Do you think this might indicate an improper termination of the connection? |
Maybe? The errors from golang http and net packages are notoriously obscure and unhelpful. The go authors know about this, but can't fix it because it would break backwards compatibility. Let's hope it gets addressed in Go 2, which likely won't be a thing. 👍 |
We've got another finding. Monitoring the OS's open TCP connections via
It seems like Apple does not respond with SYN/ACK which makes the dial halt (by default
gives consistent |
I was able to reproduce this in its simplest form with the following code https://gist.github.com/sideshow/ae221a792261180c954d9bea72780f85 If I run a few times with @c3mb0 The lack of a connection timeout definitely appears to be the root of the issue, but I also think the default connection pool may be compounding the issue. I may be way off on this but it looks like with the default connection pool if a dial is already in-flight, it returns that, rather than starting another - This is probably why it just hangs forever. https://github.com/golang/net/blob/master/http2/client_conn_pool.go#L72 For us, We are using this inside go routines rather than a buffered channel. Because theres no back pressure (ie its just thousands of go routines and not a buffered channel) we are saturating the underlying transport connection. When the connection becomes saturated, the connection pool tries to make a new connection to apns, and in some scenarios because some connections fail it will just hang. In our particular case, because we are slamming it, with every new connection comes the chance of it hanging due to no timeout setting. I was verifying the amount of connections open with Presumably there is a limit to the amount of connections Apple may allow per cert or IP, and this may or may not be compounding the issue. I think the changes to set some kind of default timeout are a good idea, but ultimately maybe there also should be a better connection pool that would allow us to set the min, max amount of connections to apns. (@dwieeb 's suggested in #21) |
@sideshow I believe you are correct with your assumptions, since when we put an For APNS2 connection rate limiting, I've actually written a package some time ago to use with this package. Even though it has horrible tests, it has been running on production without any problems so far. Even with maximum connection attempts capped at a low value, we get a timeout every now and then. Worse, it sometimes happens with a single attempt. |
The source of the problem seems to be Apple's APNS2 service (check here and here). The date for the posted error aligns with the first day we started experiencing these hanging problems. Also, OneSignal recently announced that they're using Rust for their APNS2 service, so the problem is not Go specific. Thanks to @evrenios for the resources. Since I am almost certain that the root of the problem is SSL/TLS, I'm updating my pull request to give a timeout to |
Cool, I also believe the problem is SSL/TLS. The related Apple bug is logged here I have verified by running the example code i have posted as a gist here, until it locks. Leaving the go program in its current locked state I try and connect to the apns IP in the terminal with
Whilst we may patch with TLSHandshakeTimeout for an interim fix, I am unsure if we need to have a default timeout in apns2 for TLSHandshakeTimeout or we just document it - It almost seems like something the go http2 client should take care of with better defaults. Especially given the fact that ALL subsequent requests try and use the same locked connection. |
As I observed, it is not a handshake problem in my case. If a connection (from apns2 to apple server) become idle (e.g. keep alive for 10 minutes without sending any request), and then we use it to send requests again, the apple server just refuse to read incoming data. I could see the "Send-Q" column from A workaround for my case is to specify a |
It seems to me that there are 2 problems:
I might be a good idea to include a default timeout in apns2 until that is hopefully added in go. |
@zjx20 Today we have experienced another halt despite giving a timeout to TLS handshake and based on your reporting, we gave giving a timeout to |
As @chimpmaster72 pointed out, there were two issues;
Since this was logged, we have noticed the Apple problem seems to have gone away. We have also pushed back a change to master (14b46f8) based on @c3mb0 's code that has default timeouts for the TLS connection and the HTTPClient timeout. Please pull master and confirm if this works for you guys and resolves the problem, as I would like to close off this issue. |
Closing this as the issue seems to have been resolved since this pull request was merged. Feel free to reopen if it appears again |
As a preface, I'm reporting this to open a dialog. I think the ultimate problem will be with something I'm doing or a bug deep within the http2 libs of Go.
I've noticed that occasionally the call to
c.HTTPClient.Do
hangs indefinitely. The problem occurs intermittently and seemingly not because of the certificate used in the connection. Given enough retries (where the connection is remade by making a new handle onapns2.Client
), it will succeed without error.I'm not convinced this is a network issue. It seems something is deadlocking within the http2 libs. I set
c.HTTPClient.Timeout
to 1 second, which never triggers. Additionally I spin up a timeout goroutine of 3 seconds, which is how I determine that something is hanging, and at which point I attempt a retry. (As a side note, I just realized this may not cleanly kill the connection. Perhaps I should callCloseIdleConnections
on the http2 transport?) It doesn't seem to be a network issue. Despite settingGODEBUG=http2debug=2
, no http2 logs are outputted.My sender is massively concurrent, having thousands of goroutines at any moment, but my interpretation is that if it gets to
c.HTTPClient.Do
, which is documented to be thread-safe, and then hangs, then there is a problem within the http2 libs.Am I weirdly running out of possible connections or something? Does anyone have thoughts on this?
The text was updated successfully, but these errors were encountered: