Add support for HTTP2 ping or document net.Dialer KeepAlive #536

mwitkow · 2016-02-08T23:42:20Z

For the last couple of days we've been trying to debug an issue that was going whenever we held a long-lasting gRPC connection open to an instance on GCE behind the Google L3 Load Balancer. This was puzzling since we were using long-standing connections inside our private Network for a long time now.

Apparently the L3 load balancer, and the GCE egress, have a 600s timeout on long-lived connections
https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet

It seems that this could be mitigated in 3 ways:
a) clients sending a ping_pong message on HTTP2 every X configurable seconds , which would have the benefit of inducing a GO_AWAY message and being able to rebalance earlier
b) making sure that net.Dialer used by default has a KeepAlive duration (https://golang.org/src/net/dial.go) of < 600s
c) at least documenting that setting a KeepAlive is useful and that using a custom WithDialer is recommended

@ejona86 since this may affect other versions of gRPC.

The text was updated successfully, but these errors were encountered:

iamqizhao · 2016-02-09T00:16:44Z

In your use case, you only need hold a health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go). And then your client can periodically health check the server to keep the connection on.

This does not mean Ping API is not needed. We will add it later.

mwitkow · 2016-02-09T07:32:20Z

@iamqizhao, I do agree that having a periodic health check would solve the problem. However, I don't agree that expecting a user to do user-code level solutions is appropriate when the issue is on the transport level.

The issue with GCE LB, and many other routers, terminating long running connections is due to Connection Tracking that stateful routing/firewalling mechanisms have. If the tracking expires, the connection gets dropped, without a TCP RST packet sent to either the server nor the client. This is incredibly hard to debug if no KeepAlive is present, since the gRPC transport layer has no idea that it's sending stuff into a "void", as the underlying TCP socket will never return a timeout (the default system timeout is like 2h nowadays). This manifested itself in two ways for us:

RPCs would reach the server and finish with OK, but no response would be passed to the client calle
the client caller would hang forever (in absence of context timeouts) since none of the ConnectionTimeout gRPC errors would be raised, and the request would be consumed by the "void" TCP stream and never reach the server

@bradfitz, does HTTP2 support TCP KeepAlives like HTTP1.1? If so, can we document that using KeepAlives for using gRPC over public internet?

bradfitz · 2016-02-09T16:29:16Z

@mwitkow, yes, net/http.Server accepts connections and sets TCP keep-alives on connections, before http1 or http2 is determined.

iamqizhao · 2016-02-09T19:02:12Z

There are 3 different layers relevant to this issue:
i) transport layer (e.g., TCP): You can use grpc.WithDialer to pass a Dialer with KeepAlives on;
ii) rpc layer (e.g., PING frame if we use http2):
a) user-initiated PING: we will have API to support this;
b) grpc-initated PING: under debating.
iii) application layer: health check service as I mentioned
In principle, we lean to letting users initiate this health signal by themselves instead of grpc lib because it could generate non-trivial traffic which could be charged.

Overall, I do not see any problem here.

mwitkow · 2016-02-09T19:19:01Z

Do we document it somehow for other users, so they don't have to waste time
troubleshooting this? ☺️

On Tue, 9 Feb 2016, 19:02 Qi Zhao notifications@github.com wrote:

There are 3 different layers relevant to this issue:
i) transport layer (e.g., TCP): You can use grpc.WithDialer to pass a
Dialer with KeepAlives on;
ii) rpc layer (e.g., PING frame if we use http2):
a) user-initiated PING: we will have API to support this;
b) grpc-initated PING: under debating.
iii) application layer: health check service as I mentioned
In principle, we lean to letting users initiate this health signal by
themselves instead of grpc lib because it could generate non-trivial
traffic which could be charged.

Overall, I do not see any problem here.

—
Reply to this email directly or view it on GitHub
#536 (comment).

iamqizhao · 2016-02-09T20:45:50Z

I think we should. Need talk to the team.

mwitkow · 2016-02-09T22:36:45Z

Cool, thanks!

On Tue, 9 Feb 2016, 20:45 Qi Zhao notifications@github.com wrote:

I think we should. Need talk to the team.

—
Reply to this email directly or view it on GitHub
#536 (comment).

iamqizhao closed this as completed Feb 9, 2016

mwitkow mentioned this issue Apr 8, 2016

Add KeepAlive support grpc/grpc-java#1648

Closed

ejona86 mentioned this issue Jun 27, 2016

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

mwitkow commented Feb 8, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016

bradfitz commented Feb 9, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

Add support for HTTP2 ping or document net.Dialer KeepAlive #536

Comments

mwitkow commented Feb 8, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016

bradfitz commented Feb 9, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016

iamqizhao commented Feb 9, 2016

mwitkow commented Feb 9, 2016