Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c.HTTPClient.Do (in c.Push) hanging intermittently #20

Closed
imhoffd opened this issue Apr 29, 2016 · 19 comments
Closed

c.HTTPClient.Do (in c.Push) hanging intermittently #20

imhoffd opened this issue Apr 29, 2016 · 19 comments
Labels

Comments

@imhoffd
Copy link
Contributor

imhoffd commented Apr 29, 2016

As a preface, I'm reporting this to open a dialog. I think the ultimate problem will be with something I'm doing or a bug deep within the http2 libs of Go.

I've noticed that occasionally the call to c.HTTPClient.Do hangs indefinitely. The problem occurs intermittently and seemingly not because of the certificate used in the connection. Given enough retries (where the connection is remade by making a new handle on apns2.Client), it will succeed without error.

I'm not convinced this is a network issue. It seems something is deadlocking within the http2 libs. I set c.HTTPClient.Timeout to 1 second, which never triggers. Additionally I spin up a timeout goroutine of 3 seconds, which is how I determine that something is hanging, and at which point I attempt a retry. (As a side note, I just realized this may not cleanly kill the connection. Perhaps I should call CloseIdleConnections on the http2 transport?) It doesn't seem to be a network issue. Despite setting GODEBUG=http2debug=2, no http2 logs are outputted.

My sender is massively concurrent, having thousands of goroutines at any moment, but my interpretation is that if it gets to c.HTTPClient.Do, which is documented to be thread-safe, and then hangs, then there is a problem within the http2 libs.

Am I weirdly running out of possible connections or something? Does anyone have thoughts on this?

@imhoffd
Copy link
Contributor Author

imhoffd commented Apr 29, 2016

Obviously related to #17

@sideshow sideshow added the bug label Apr 30, 2016
@dryaf
Copy link

dryaf commented Apr 30, 2016

lol, just wanted to report this bug :)

@evrenios
Copy link

evrenios commented May 1, 2016

+1, we faced the same bug aswell. Updating crypto and http2 libraries didnt help.

@sideshow any idea?

@c3mb0
Copy link
Contributor

c3mb0 commented May 2, 2016

I've been digging into this issue and it turns out the problem is tls.Dial rather than http.Client.Do. If you modify the transport in this package's client.go as such

transport := &http2.Transport{
    TLSClientConfig: tlsConfig,
    DialTLS: func(network, addr string, cfg *tls.Config) (net.Conn, error) {
        return tls.DialWithDialer(&net.Dialer{Timeout: 10 * time.Second}, network, addr, cfg)
    },
}

you will start receiving proper timeouts. DialTLS is called whenever a cached connection is not available or received a GOAWAY frame, so that would explain the random nature of the problem.

As to why it hangs when dialing, I still have no idea, but at least this modification allows you to take proper actions (such as trying to call client.Push again) when a timeout is received. I'll keep digging and report my findings.

@sideshow, would you accept a PR with an optional timeout variable for NewClient?

@sideshow
Copy link
Owner

sideshow commented May 2, 2016

I have noticed some weirdness too. Trying to find the source of the issue has been hard because we have noticed Apple has intermittently been rejecting connections or timing out over the last 2-3 weeks. Its rare but it has been happening.

@c3mb0 if others can confirm that the timeout fixes their issue then the PR makes sense.
Just want to make sure its not something else such as not reading and closing off the response body correctly in the push method or the transport deadlocking as @dwieeb mentioned.

I am going to look into this over the next few days. Any more info you guys could share would be awesome

@c3mb0
Copy link
Contributor

c3mb0 commented May 2, 2016

Here are a few more findings and tests:

  • As of now, we are able to reproduce the timeout/hanging issue by making back-to-back requests using an expired certificate.
  • As mentioned here and here, I suspected that the root of the problem might be using json.Decode instead of ioutil.ReadAll. I've changed the Push function to use the latter, alas the problem persists.
  • After some tests, we've concluded that a dial timeout of 1 second is plenty enough for a successful connection. After calling Push, checking if the error string contains DialWithDialer and retrying the push (which retriggers DialTLS) seems to work well. So far, no more than 2 tries were required.

@sideshow If it is agreed upon that timing out DialTLS fixes hanging problems and has no side-effects, we might even consider integrating the retry logic within Push to make the process transparent to package users.

Edit: It seems like the error string can also contain i/o timeout. We are not sure what makes them alternate.

Edit 2: We have now started seeing some cases where redials do not end up opening a successful connection. Said posts randomly start working after a few minutes. Even though these posts don't end up succeeding, at least they are not hanging around forever.

@imhoffd
Copy link
Contributor Author

imhoffd commented May 2, 2016

Transparent retry logic would be nice.

@c3mb0 The timeout can also be triggered when reading the response body, as documented here: https://golang.org/pkg/net/http/#Client That would account for the i/o timeouts.

@c3mb0
Copy link
Contributor

c3mb0 commented May 2, 2016

@dwieeb Do you think this might indicate an improper termination of the connection?

@imhoffd
Copy link
Contributor Author

imhoffd commented May 2, 2016

Maybe? The errors from golang http and net packages are notoriously obscure and unhelpful. The go authors know about this, but can't fix it because it would break backwards compatibility. Let's hope it gets addressed in Go 2, which likely won't be a thing. 👍

@c3mb0
Copy link
Contributor

c3mb0 commented May 3, 2016

We've got another finding. Monitoring the OS's open TCP connections via netstat -c -anp --tcp shows that the halt happens at:

tcp 0 1 10.0.0.169:46348 17.172.238.203:443 SYN_SENT

It seems like Apple does not respond with SYN/ACK which makes the dial halt (by default TLSHandshakeTimeout is infinte). Correct me if I'm wrong, but a quick glance reveals that TLSHandshakeTimeout is only settable from http.Transport, so modifying NewClient as such

transport := &http.Transport{
    TLSClientConfig:     tlsConfig,
    TLSHandshakeTimeout: 1 * time.Second,
}
// Upgrade the transport to HTTP/2
if err := http2.ConfigureTransport(transport); err != nil {
    return nil
}

gives consistent TLS handshake timeout errors. This seems like a more "natural" way to terminate the connection and may also be considered as a workaround.

@sideshow
Copy link
Owner

sideshow commented May 3, 2016

I was able to reproduce this in its simplest form with the following code https://gist.github.com/sideshow/ae221a792261180c954d9bea72780f85

If I run a few times with GODEBUG=http2debug=2 go run _example/apns.go it will eventually stall.

@c3mb0 The lack of a connection timeout definitely appears to be the root of the issue, but I also think the default connection pool may be compounding the issue.

I may be way off on this but it looks like with the default connection pool if a dial is already in-flight, it returns that, rather than starting another - This is probably why it just hangs forever. https://github.com/golang/net/blob/master/http2/client_conn_pool.go#L72

For us, We are using this inside go routines rather than a buffered channel. Because theres no back pressure (ie its just thousands of go routines and not a buffered channel) we are saturating the underlying transport connection. When the connection becomes saturated, the connection pool tries to make a new connection to apns, and in some scenarios because some connections fail it will just hang.

In our particular case, because we are slamming it, with every new connection comes the chance of it hanging due to no timeout setting.

I was verifying the amount of connections open with lsof -i | grep "apns"

Presumably there is a limit to the amount of connections Apple may allow per cert or IP, and this may or may not be compounding the issue.

I think the changes to set some kind of default timeout are a good idea, but ultimately maybe there also should be a better connection pool that would allow us to set the min, max amount of connections to apns. (@dwieeb 's suggested in #21)

@c3mb0
Copy link
Contributor

c3mb0 commented May 3, 2016

@sideshow I believe you are correct with your assumptions, since when we put an fmt.Println to DialTLS and try to send 10 concurrent pushes, we see that DialTLS is called only once and they all fail together throwing tls: DialWithDialer timed out.

For APNS2 connection rate limiting, I've actually written a package some time ago to use with this package. Even though it has horrible tests, it has been running on production without any problems so far.

Even with maximum connection attempts capped at a low value, we get a timeout every now and then. Worse, it sometimes happens with a single attempt.

@c3mb0
Copy link
Contributor

c3mb0 commented May 3, 2016

The source of the problem seems to be Apple's APNS2 service (check here and here). The date for the posted error aligns with the first day we started experiencing these hanging problems. Also, OneSignal recently announced that they're using Rust for their APNS2 service, so the problem is not Go specific. Thanks to @evrenios for the resources.

Since I am almost certain that the root of the problem is SSL/TLS, I'm updating my pull request to give a timeout to TLSHandshakeTimeout instead. Feel free to chime in regarding the timeout duration.

@sideshow
Copy link
Owner

sideshow commented May 4, 2016

Cool, I also believe the problem is SSL/TLS.

The related Apple bug is logged here

I have verified by running the example code i have posted as a gist here, until it locks.
I then run lsof -i | grep "apns" and get the last ip that the connection has stalled on.

Leaving the go program in its current locked state I try and connect to the apns IP in the terminal with openssl s_client -connect x.x.x.x:443 -cert ../cert.pem where x.x.x.x is the last IP listed from the lsof command. Surely enough, it blocks, which confirms its an Apple problem, and is further proof that it is not go specific.

openssl s_client -connect api.push.apple.com:443 -cert ../cert.pem without the bogus IP does not block and connects as expected.

Whilst we may patch with TLSHandshakeTimeout for an interim fix, I am unsure if we need to have a default timeout in apns2 for TLSHandshakeTimeout or we just document it - It almost seems like something the go http2 client should take care of with better defaults. Especially given the fact that ALL subsequent requests try and use the same locked connection.

@zjx20
Copy link

zjx20 commented May 4, 2016

As I observed, it is not a handshake problem in my case. If a connection (from apns2 to apple server) become idle (e.g. keep alive for 10 minutes without sending any request), and then we use it to send requests again, the apple server just refuse to read incoming data. I could see the "Send-Q" column from netstat keep growing for each request.

A workaround for my case is to specify a Timeout to http.Client.

@Nenblereeyw
Copy link

It seems to me that there are 2 problems:

  1. The apple problem
  2. The go problem that any http server can cause an out of the box go http client to hang forever

I might be a good idea to include a default timeout in apns2 until that is hopefully added in go.
Otherwise any use of apns2 is broken by default until the user of the library adds their own timeout.

@c3mb0
Copy link
Contributor

c3mb0 commented May 6, 2016

@zjx20 Today we have experienced another halt despite giving a timeout to TLS handshake and based on your reporting, we gave giving a timeout to http.Client a try. Just a few hours later, we started receiving client timeouts for the same certificate. Thus, I can also confirm that lack of a timeout for http.Client can also cause infinite locks.

@sideshow
Copy link
Owner

As @chimpmaster72 pointed out, there were two issues;

  1. The Apple problem
  2. The go problem that any http server can cause an out of the box go http client to hang forever

Since this was logged, we have noticed the Apple problem seems to have gone away.

We have also pushed back a change to master (14b46f8) based on @c3mb0 's code that has default timeouts for the TLS connection and the HTTPClient timeout. TLSDialTimeout and HTTPClientTimeout are both vars, and have been left reasonably large, but can be configured.

Please pull master and confirm if this works for you guys and resolves the problem, as I would like to close off this issue.

@sideshow
Copy link
Owner

Closing this as the issue seems to have been resolved since this pull request was merged. Feel free to reopen if it appears again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants