kvserver: reconsider gRPC server-side keepalive timeout #109317

erikgrinaker · 2023-08-23T09:07:56Z

The gRPC server-side keepalive timeout defaults to 2 seconds, via COCKROACH_NETWORK_TIMEOUT. This sets TCP_USER_TIMEOUT on the socket, instructing the OS to close connections if a send remains unacknowledged by the peer for 2 seconds.

We would ideally want this on the gRPC client side instead, since that would more quickly detect network outages and allow the client to fail over to a different node, reducing the duration of unavailability. Unfortunately, gRPC does not allow client-side keepalive timeouts below 10 seconds. Instead, we use application-level client-side RPC heartbeats, but unlike TCP_USER_TIMEOUT these are subject to head-of-line blocking in the RPC stack (including effects like scheduling latency). We therefore have to use a large RPC heartbeat timeout of 6 seconds to tolerate heavy loads in high-latency clusters (see #93397).

Given this, it isn't clear that an aggressive server-side gRPC timeout provides much benefit. It doesn't help in the case of a network outage -- the server is unlikely to care about the connection remaining open, and the client has to wait for its own timeout to fire. It would more quickly detect asymmetric partitions, allowing the dialback mechanism to promote it to a symmetric partition (see #94778), but this is likely rare enough that a higher timeout is ok. Under persistently high latency it might be helpful to eagerly close the connection and let the client fail over to a lower-latency node, but this assumes the lease also moves and that the client won't block and re-dial the node (mostly true in 23.2 with #99191). It can also be detrimental since it can cause additional disruption with transient latency spikes like intermittent packet loss.

We should reconsider the value for the server-side keepalive timeout, subject to availability benchmarks (especially with asymmetric partitions and latency spikes).

See also #93007.

cockroach/pkg/rpc/keepalive.go

Lines 26 to 47 in e97fe41

    
           // To prevent unidirectional network partitions from keeping an unhealthy 
        
           // connection alive, we use both client-side and server-side keepalive pings. 
        
           var clientKeepalive = keepalive.ClientParameters{ 
        
           	// Send periodic pings on the connection. 
        
           	Time: minimumClientKeepaliveInterval, 
        
           	// If the pings don't get a response within the timeout, we might be 
        
           	// experiencing a network partition. gRPC will close the transport-level 
        
           	// connection and all the pending RPCs (which may not have timeouts) will 
        
           	// fail eagerly. gRPC will then reconnect the transport transparently. 
        
           	Timeout: minimumClientKeepaliveInterval, 
        
           	// Do the pings even when there are no ongoing RPCs. 
        
           	PermitWithoutStream: true, 
        
           } 
        
           var serverKeepalive = keepalive.ServerParameters{ 
        
           	// Send periodic pings on the connection when there is no other traffic. 
        
           	Time: base.PingInterval, 
        
           	// If the pings don't get a response within the timeout, we might be 
        
           	// experiencing a network partition. gRPC will close the transport-level 
        
           	// connection and all the pending RPCs (which may not have timeouts) will 
        
           	// fail eagerly. 
        
           	Timeout: base.NetworkTimeout, 
        
           }

Jira issue: CRDB-30871

The text was updated successfully, but these errors were encountered:

109186: pkg/util/log: flush buffered network sinks on panic r=knz a=abarganier Previously, our crash reporter system would flush file log sinks as part of the process to handle a panic. This was an incomplete process, since buffered network sinks were not included in part of this flush process. This means that many times, panic logs would not make it to the network target, leading to a loss in observability. This patch introduces `log.FlushAllSync()`, which flushes both file and buffered network log sinks. It then updates the crash reporter to call into this, instead of just flushing file log sinks. `FlushAllSync()` contains timeout logic to prevent the process from completing if one of the underlying child sinks that a bufferedSink wraps becomes unavailable/hangs on its `output()` call. We originally attempted to fix this in #101562, but a bug in the bufferedSink code led us to roll back those changes. The bug in the bufferedSink code has since been fixed (#108928), so we can safely introduce this logic again. Release note: none Fixes: #106345 109578: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout r=andrewbaptist a=erikgrinaker This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc). --- This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket. While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer. Touches #109317. Epic: none Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT. 109610: kv: remove assertions around non-txn'al locking reqs r=nvanbenschoten a=nvanbenschoten Closes #107860. Closes #109222. Closes #109581. Closes #109582. We might want to re-introduce these assertions in the future and reject these requests higher up the stack. For now, just remove them to deflake tests. Release note: None Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

erikgrinaker · 2023-08-29T14:35:51Z

Turns out the timeout is also used for gRPC's own keepalive pings, which unlike TCP_USER_TIMEOUT is subject to scheduling latency and such. This explains why we saw this in cases such as #108157.

https://github.com/grpc/grpc-go/blob/9b9b38127038e347420ef08c42e5d74d181b8749/internal/transport/http2_server.go#L1175-L1204

erikgrinaker · 2023-08-30T09:42:35Z

We increased the server-side timeout in #109578 (2 to 4 seconds) and the keepalive ping interval in #109715 (1 to 2 seconds).

However, we see a possible regression in failover/non-system/blackhole-recv on August 29, i.e. an asymmetric partition dropping inbound packets to a node. I think this is expected:

Consider a client C sending a request to the partitioned leaseholder L. C will normally have to wait out the RPC heartbeat timeout (6 seconds) before failing and trying someone else. However, L's server-side timeout would have fired after 3 seconds (1s keepalive interval + 2s server-side timeout), closing the connection and forcing the client elsewhere. It can do this since the outbound TCP RST packets can still make it through (only inbound packets C->L are dropped). Thus, increasing the server-side timeout will increase the time until the TCP RST is sent, and the client will have to fall back to the RPC heartbeat timeout, increasing the overall recovery time.

Considering the timeout is sensitive to node overload and Go runtime scheduling latencies and such, and this kind of asymmetric partition is relatively rare, this seems like a justified tradeoff.

109715: rpc: increase gRPC server keepalive interval to 2x PingInterval r=erikgrinaker a=erikgrinaker This patch increases the gRPC server keepalive interval from 1x to 2x PingInterval. These keepalive pings are used both to keep the connection alive, and to detect and close failed connections from the server-side. Keepalive pings are only sent and checked if there is no other activity on the connection. We set this to 2x PingInterval, since we expect RPC heartbeats to be sent regularly (obviating the need for the keepalive ping), and there's no point sending keepalive pings until we've waited long enough for the RPC heartbeat to show up. An environment variable COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL is also added to tune this. Touches #109317. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team labels Aug 23, 2023

exalate-issue-sync bot assigned andrewbaptist Aug 24, 2023

This was referenced Aug 28, 2023

rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109578

Merged

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

Closed

This was referenced Aug 28, 2023

release-23.1: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109620

Merged

release-23.1.9-rc: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109621

Merged

erikgrinaker mentioned this issue Aug 30, 2023

rpc: increase gRPC server keepalive interval to 2x PingInterval #109715

Merged

This was referenced Aug 30, 2023

release-23.1: rpc: increase gRPC server keepalive interval to 2x PingInterval #109723

Merged

release-23.1.9-rc: rpc: increase gRPC server keepalive interval to 2x PingInterval #109724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: reconsider gRPC server-side keepalive timeout #109317

kvserver: reconsider gRPC server-side keepalive timeout #109317

erikgrinaker commented Aug 23, 2023 •

edited

Loading

erikgrinaker commented Aug 29, 2023

erikgrinaker commented Aug 30, 2023 •

edited

Loading

kvserver: reconsider gRPC server-side keepalive timeout #109317

kvserver: reconsider gRPC server-side keepalive timeout #109317

Comments

erikgrinaker commented Aug 23, 2023 • edited Loading

erikgrinaker commented Aug 29, 2023

erikgrinaker commented Aug 30, 2023 • edited Loading

erikgrinaker commented Aug 23, 2023 •

edited

Loading

erikgrinaker commented Aug 30, 2023 •

edited

Loading