Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open
erikgrinaker opened this issue Aug 23, 2023 · 2 comments
Open

kvserver: reconsider gRPC server-side keepalive timeout #109317

erikgrinaker opened this issue Aug 23, 2023 · 2 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 23, 2023

The gRPC server-side keepalive timeout defaults to 2 seconds, via COCKROACH_NETWORK_TIMEOUT. This sets TCP_USER_TIMEOUT on the socket, instructing the OS to close connections if a send remains unacknowledged by the peer for 2 seconds.

We would ideally want this on the gRPC client side instead, since that would more quickly detect network outages and allow the client to fail over to a different node, reducing the duration of unavailability. Unfortunately, gRPC does not allow client-side keepalive timeouts below 10 seconds. Instead, we use application-level client-side RPC heartbeats, but unlike TCP_USER_TIMEOUT these are subject to head-of-line blocking in the RPC stack (including effects like scheduling latency). We therefore have to use a large RPC heartbeat timeout of 6 seconds to tolerate heavy loads in high-latency clusters (see #93397).

Given this, it isn't clear that an aggressive server-side gRPC timeout provides much benefit. It doesn't help in the case of a network outage -- the server is unlikely to care about the connection remaining open, and the client has to wait for its own timeout to fire. It would more quickly detect asymmetric partitions, allowing the dialback mechanism to promote it to a symmetric partition (see #94778), but this is likely rare enough that a higher timeout is ok. Under persistently high latency it might be helpful to eagerly close the connection and let the client fail over to a lower-latency node, but this assumes the lease also moves and that the client won't block and re-dial the node (mostly true in 23.2 with #99191). It can also be detrimental since it can cause additional disruption with transient latency spikes like intermittent packet loss.

We should reconsider the value for the server-side keepalive timeout, subject to availability benchmarks (especially with asymmetric partitions and latency spikes).

See also #93007.

// To prevent unidirectional network partitions from keeping an unhealthy
// connection alive, we use both client-side and server-side keepalive pings.
var clientKeepalive = keepalive.ClientParameters{
// Send periodic pings on the connection.
Time: minimumClientKeepaliveInterval,
// If the pings don't get a response within the timeout, we might be
// experiencing a network partition. gRPC will close the transport-level
// connection and all the pending RPCs (which may not have timeouts) will
// fail eagerly. gRPC will then reconnect the transport transparently.
Timeout: minimumClientKeepaliveInterval,
// Do the pings even when there are no ongoing RPCs.
PermitWithoutStream: true,
}
var serverKeepalive = keepalive.ServerParameters{
// Send periodic pings on the connection when there is no other traffic.
Time: base.PingInterval,
// If the pings don't get a response within the timeout, we might be
// experiencing a network partition. gRPC will close the transport-level
// connection and all the pending RPCs (which may not have timeouts) will
// fail eagerly.
Timeout: base.NetworkTimeout,
}

Jira issue: CRDB-30871

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team labels Aug 23, 2023
craig bot pushed a commit that referenced this issue Aug 28, 2023
109186: pkg/util/log: flush buffered network sinks on panic r=knz a=abarganier

Previously, our crash reporter system would flush file log sinks
as part of the process to handle a panic.

This was an incomplete process, since buffered network sinks were
not included in part of this flush process. This means that many
times, panic logs would not make it to the network target, leading
to a loss in observability.

This patch introduces `log.FlushAllSync()`, which flushes both file
and buffered network log sinks. It then updates the crash reporter
to call into this, instead of just flushing file log sinks.

`FlushAllSync()` contains timeout logic to prevent the process from
completing if one of the underlying child sinks that a bufferedSink
wraps becomes unavailable/hangs on its `output()` call.

We originally attempted to fix this in #101562, but a bug in the 
bufferedSink code led us to roll back those changes. The bug in the 
bufferedSink code has since been fixed (#108928), so we can safely 
introduce this logic again.

Release note: none

Fixes: #106345

109578: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout r=andrewbaptist a=erikgrinaker

This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc).

---

This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer.

Touches #109317.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT.

109610: kv: remove assertions around non-txn'al locking reqs r=nvanbenschoten a=nvanbenschoten

Closes #107860.
Closes #109222.
Closes #109581.
Closes #109582.

We might want to re-introduce these assertions in the future and reject these requests higher up the stack. For now, just remove them to deflake tests.

Release note: None

Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@erikgrinaker
Copy link
Contributor Author

Turns out the timeout is also used for gRPC's own keepalive pings, which unlike TCP_USER_TIMEOUT is subject to scheduling latency and such. This explains why we saw this in cases such as #108157.

https://github.com/grpc/grpc-go/blob/9b9b38127038e347420ef08c42e5d74d181b8749/internal/transport/http2_server.go#L1175-L1204

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Aug 30, 2023

We increased the server-side timeout in #109578 (2 to 4 seconds) and the keepalive ping interval in #109715 (1 to 2 seconds).

However, we see a possible regression in failover/non-system/blackhole-recv on August 29, i.e. an asymmetric partition dropping inbound packets to a node. I think this is expected:

Consider a client C sending a request to the partitioned leaseholder L. C will normally have to wait out the RPC heartbeat timeout (6 seconds) before failing and trying someone else. However, L's server-side timeout would have fired after 3 seconds (1s keepalive interval + 2s server-side timeout), closing the connection and forcing the client elsewhere. It can do this since the outbound TCP RST packets can still make it through (only inbound packets C->L are dropped). Thus, increasing the server-side timeout will increase the time until the TCP RST is sent, and the client will have to fall back to the RPC heartbeat timeout, increasing the overall recovery time.

Considering the timeout is sensitive to node overload and Go runtime scheduling latencies and such, and this kind of asymmetric partition is relatively rare, this seems like a justified tradeoff.

craig bot pushed a commit that referenced this issue Aug 30, 2023
109715: rpc: increase gRPC server keepalive interval to 2x PingInterval r=erikgrinaker a=erikgrinaker

This patch increases the gRPC server keepalive interval from 1x to 2x PingInterval. These keepalive pings are used both to keep the connection alive, and to detect and close failed connections from the server-side. Keepalive pings are only sent and checked if there is no other activity on the connection.

We set this to 2x PingInterval, since we expect RPC heartbeats to be sent regularly (obviating the need for the keepalive ping), and there's no point sending keepalive pings until we've waited long enough for the RPC heartbeat to show up.

An environment variable COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL is also added to tune this.

Touches #109317.
Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants