Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

Closed
mwitkow opened this issue Feb 8, 2016 · 7 comments
Closed

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

mwitkow opened this issue Feb 8, 2016 · 7 comments

Comments

@mwitkow
Copy link
Contributor

mwitkow commented Feb 8, 2016

For the last couple of days we've been trying to debug an issue that was going whenever we held a long-lasting gRPC connection open to an instance on GCE behind the Google L3 Load Balancer. This was puzzling since we were using long-standing connections inside our private Network for a long time now.

Apparently the L3 load balancer, and the GCE egress, have a 600s timeout on long-lived connections
https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet

It seems that this could be mitigated in 3 ways:
a) clients sending a ping_pong message on HTTP2 every X configurable seconds , which would have the benefit of inducing a GO_AWAY message and being able to rebalance earlier
b) making sure that net.Dialer used by default has a KeepAlive duration (https://golang.org/src/net/dial.go) of < 600s
c) at least documenting that setting a KeepAlive is useful and that using a custom WithDialer is recommended

@ejona86 since this may affect other versions of gRPC.

@iamqizhao
Copy link
Contributor

In your use case, you only need hold a health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go). And then your client can periodically health check the server to keep the connection on.

This does not mean Ping API is not needed. We will add it later.

@mwitkow
Copy link
Contributor Author

mwitkow commented Feb 9, 2016

@iamqizhao, I do agree that having a periodic health check would solve the problem. However, I don't agree that expecting a user to do user-code level solutions is appropriate when the issue is on the transport level.

The issue with GCE LB, and many other routers, terminating long running connections is due to Connection Tracking that stateful routing/firewalling mechanisms have. If the tracking expires, the connection gets dropped, without a TCP RST packet sent to either the server nor the client. This is incredibly hard to debug if no KeepAlive is present, since the gRPC transport layer has no idea that it's sending stuff into a "void", as the underlying TCP socket will never return a timeout (the default system timeout is like 2h nowadays). This manifested itself in two ways for us:

  • RPCs would reach the server and finish with OK, but no response would be passed to the client calle
  • the client caller would hang forever (in absence of context timeouts) since none of the ConnectionTimeout gRPC errors would be raised, and the request would be consumed by the "void" TCP stream and never reach the server

@bradfitz, does HTTP2 support TCP KeepAlives like HTTP1.1? If so, can we document that using KeepAlives for using gRPC over public internet?

@bradfitz
Copy link
Contributor

bradfitz commented Feb 9, 2016

@mwitkow, yes, net/http.Server accepts connections and sets TCP keep-alives on connections, before http1 or http2 is determined.

@iamqizhao
Copy link
Contributor

There are 3 different layers relevant to this issue:
i) transport layer (e.g., TCP): You can use grpc.WithDialer to pass a Dialer with KeepAlives on;
ii) rpc layer (e.g., PING frame if we use http2):
a) user-initiated PING: we will have API to support this;
b) grpc-initated PING: under debating.
iii) application layer: health check service as I mentioned
In principle, we lean to letting users initiate this health signal by themselves instead of grpc lib because it could generate non-trivial traffic which could be charged.

Overall, I do not see any problem here.

@mwitkow
Copy link
Contributor Author

mwitkow commented Feb 9, 2016

Do we document it somehow for other users, so they don't have to waste time
troubleshooting this? ☺️

On Tue, 9 Feb 2016, 19:02 Qi Zhao notifications@github.com wrote:

There are 3 different layers relevant to this issue:
i) transport layer (e.g., TCP): You can use grpc.WithDialer to pass a
Dialer with KeepAlives on;
ii) rpc layer (e.g., PING frame if we use http2):
a) user-initiated PING: we will have API to support this;
b) grpc-initated PING: under debating.
iii) application layer: health check service as I mentioned
In principle, we lean to letting users initiate this health signal by
themselves instead of grpc lib because it could generate non-trivial
traffic which could be charged.

Overall, I do not see any problem here.


Reply to this email directly or view it on GitHub
#536 (comment).

@iamqizhao
Copy link
Contributor

I think we should. Need talk to the team.

@mwitkow
Copy link
Contributor Author

mwitkow commented Feb 9, 2016

Cool, thanks!

On Tue, 9 Feb 2016, 20:45 Qi Zhao notifications@github.com wrote:

I think we should. Need talk to the team.


Reply to this email directly or view it on GitHub
#536 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants