-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HTTP2 ping or document net.Dialer KeepAlive #536
Comments
In your use case, you only need hold a health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go). And then your client can periodically health check the server to keep the connection on. This does not mean Ping API is not needed. We will add it later. |
@iamqizhao, I do agree that having a periodic health check would solve the problem. However, I don't agree that expecting a user to do user-code level solutions is appropriate when the issue is on the transport level. The issue with GCE LB, and many other routers, terminating long running connections is due to Connection Tracking that stateful routing/firewalling mechanisms have. If the tracking expires, the connection gets dropped, without a TCP RST packet sent to either the server nor the client. This is incredibly hard to debug if no KeepAlive is present, since the gRPC transport layer has no idea that it's sending stuff into a "void", as the underlying TCP socket will never return a timeout (the default system timeout is like 2h nowadays). This manifested itself in two ways for us:
@bradfitz, does HTTP2 support TCP KeepAlives like HTTP1.1? If so, can we document that using KeepAlives for using gRPC over public internet? |
@mwitkow, yes, net/http.Server accepts connections and sets TCP keep-alives on connections, before http1 or http2 is determined. |
There are 3 different layers relevant to this issue: Overall, I do not see any problem here. |
Do we document it somehow for other users, so they don't have to waste time On Tue, 9 Feb 2016, 19:02 Qi Zhao notifications@github.com wrote:
|
I think we should. Need talk to the team. |
Cool, thanks! On Tue, 9 Feb 2016, 20:45 Qi Zhao notifications@github.com wrote:
|
For the last couple of days we've been trying to debug an issue that was going whenever we held a long-lasting gRPC connection open to an instance on GCE behind the Google L3 Load Balancer. This was puzzling since we were using long-standing connections inside our private Network for a long time now.
Apparently the L3 load balancer, and the GCE egress, have a 600s timeout on long-lived connections
https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet
It seems that this could be mitigated in 3 ways:
a) clients sending a
ping_pong
message on HTTP2 every X configurable seconds , which would have the benefit of inducing aGO_AWAY
message and being able to rebalance earlierb) making sure that
net.Dialer
used by default has a KeepAlive duration (https://golang.org/src/net/dial.go) of < 600sc) at least documenting that setting a
KeepAlive
is useful and that using a customWithDialer
is recommended@ejona86 since this may affect other versions of gRPC.
The text was updated successfully, but these errors were encountered: