kv: increase dist sender's RPC timeout #16088

tamird · 2017-05-23T19:31:46Z

No description provided.

cockroach-teamcity · 2017-05-23T19:31:52Z

This change is

spencerkimball

Yes, I'm not too worried about this change. This is a relatively rare occurrence and I'm not convinced there's going to be a big win to allow these ambiguous results to return after a wait. In those cases, it's likely that the live node (live because gRPC heartbeats are still working) is blocked up in a Raft group or similar. In those cases, there are all kinds of other problems, so letting this pending request continue waiting isn't likely to make the system more responsive.

andreimatei · 2017-05-23T19:58:09Z

FWIW, I think what we want for the DistSender timeouts is some mechanism to differentiate between slow RPCs that have been accepted by the lease holder versus RPCs that haven't even been accepted yet (and thus we don't really know if the recipient even is the leaseholder). In lack of that, I always disliked those timeouts, so I think this change is great.

bdarnell · 2017-05-24T15:10:57Z

This isn't the timeout you want to remove - This controls how long we wait when we already have multiple RPCs in flight, so removing it means we'll wait as long as it takes for them to complete. You want to remove SendNextTimeout to avoid having multiple RPCs in flight in the first place.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

tbg · 2017-05-24T15:21:08Z

@bdarnell I also chatted with @tamird because I thought the PR description suggested that's what he wanted to do, but it seems intentional that he's only removing this smaller bit now (which to me seems also reasonable, and perhaps less contentious). Perhaps the PR message should be clarified though.

bdarnell · 2017-05-24T15:23:39Z

Ah. Well in that case I disagree with this change. The reliance on a single leader means that we may not need to send RPCs to multiple replicas, but having sent them, they may be slow and we should use timeouts when waiting on secondary RPC attempts.

spencerkimball · 2017-05-24T15:46:58Z

pkg/kv/dist_sender.go

-								}
-							} else if pendingCall.Reply.Error == nil {
-								return pendingCall.Reply, nil
+					for ; pending > 0; pending-- {


There's actually no point in waiting for these pending RPCs unless haveCommit is true. Let's change this line to:

for ; haveCommit && pending > 0; pending-- {

spencerkimball · 2017-05-24T15:58:58Z

I think @bdarnell has a point: removing this timeout entirely means a client can become permanently stuck if a range is jammed and not processing Raft commands. I think it might make more sense to increase the value of the timeout instead (perhaps to 10s?). @tamird's main concern is that this is breaking benchmarking because we do end up with sloooooow RPCs in some cases.

We can rely on the gRPC heartbeat timeout to cover the case of a connection which has become too unhealthy to serve RPCs. That's a 3s timeout. Let's leave this code in place and up the timeout to further attenuate occurrences of AmbiguousResultError while also protecting clients from permanent "black hole" ranges.

Also, @tamird, please note my comment around haveCommit being used to avoid waiting at all for pending RPCs.

petermattis · 2017-05-24T16:06:06Z

This could only break multi-node benchmarks, right? If this is somehow affecting single-node benchmarks I think there is a bug somewhere.

tamird · 2017-05-24T17:53:32Z

Revised this to increase the timeout instead of removing it, PTAL

tamird · 2017-05-24T17:54:20Z

@petermattis indeed, only affects multi-node benchmarks.

petermattis · 2017-05-24T18:02:20Z

@tamird The multi-node sql benchmarks are somewhat uninteresting in that the nodes are all running on the same machine. That's not to dissuade this PR, but to help you unblock the benchmark storage work. I'd be fine in the short term with only recording the single-node sql benchmarks.

tamird · 2017-05-24T18:03:23Z

Yep, I'm proceeding that way.

…

On Wed, May 24, 2017 at 2:02 PM, Peter Mattis ***@***.***> wrote: @tamird <https://github.com/tamird> The multi-node sql benchmarks are somewhat uninteresting in that the nodes are all running on the same machine. That's not to dissuade this PR, but to help you unblock the benchmark storage work. I'd be fine in the short term with only recording the single-node sql benchmarks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16088 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPLVVb7OiFo-steo5srDE6t5ANP3Rks5r9HCxgaJpZM4NkKlL> .

spencerkimball · 2017-05-24T18:08:44Z

lgtm

bdarnell · 2017-05-24T18:27:06Z

Reviewed 6 of 7 files at r3, 1 of 2 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

This timeout is believed to protect against hanging RPCs. Unfortunately whenever such an abandonment takes place, an ambiguous result error is returned to the client, which can be difficult to deal with. This greatly increases the timeout in the hope of reducing these errors at the expense of tail latencies. The upshot is that if such hanging RPCs are common, we'll be able to track them down.

tamird requested review from bdarnell and spencerkimball May 23, 2017 19:31

spencerkimball approved these changes May 23, 2017

View reviewed changes

tbg approved these changes May 23, 2017

View reviewed changes

spencerkimball reviewed May 24, 2017

View reviewed changes

kv: DRY

d0f503b

tamird changed the title ~~kv: remove dist sender's RPC timeout~~ kv: increase dist sender's RPC timeout May 24, 2017

tamird added 2 commits May 24, 2017 14:28

kv: move constant closer to use site

c9a3439

tamird merged commit 3a0ee35 into cockroachdb:master May 24, 2017

tamird deleted the remove-dist-sender-timeout branch May 24, 2017 19:11

This was referenced May 26, 2017

storage: TestRaftRemoveRace failed under stress #15687

Closed

storage: TestReplicateRemovedNodeDisruptiveElection failed under stress #16119

Closed

m-schneider mentioned this pull request Oct 12, 2017

stability: stuck requests after rebalancing-inducing downtime #19165

Closed

bdarnell mentioned this pull request May 21, 2020

kv: partial/asymmetric partition that does not isolate node liveness does not force range lease revocation #49220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: increase dist sender's RPC timeout #16088

kv: increase dist sender's RPC timeout #16088

tamird commented May 23, 2017 •

edited

Loading

cockroach-teamcity commented May 23, 2017

spencerkimball left a comment •

edited

Loading

andreimatei commented May 23, 2017

bdarnell commented May 24, 2017

tbg commented May 24, 2017

bdarnell commented May 24, 2017

spencerkimball May 24, 2017

tamird May 24, 2017

spencerkimball commented May 24, 2017

petermattis commented May 24, 2017

tamird commented May 24, 2017

tamird commented May 24, 2017

petermattis commented May 24, 2017

tamird commented May 24, 2017 via email

spencerkimball commented May 24, 2017

bdarnell commented May 24, 2017

kv: increase dist sender's RPC timeout #16088

kv: increase dist sender's RPC timeout #16088

Conversation

tamird commented May 23, 2017 • edited Loading

cockroach-teamcity commented May 23, 2017

spencerkimball left a comment • edited Loading

Choose a reason for hiding this comment

andreimatei commented May 23, 2017

bdarnell commented May 24, 2017

tbg commented May 24, 2017

bdarnell commented May 24, 2017

spencerkimball May 24, 2017

Choose a reason for hiding this comment

tamird May 24, 2017

Choose a reason for hiding this comment

spencerkimball commented May 24, 2017

petermattis commented May 24, 2017

tamird commented May 24, 2017

tamird commented May 24, 2017

petermattis commented May 24, 2017

tamird commented May 24, 2017 via email

spencerkimball commented May 24, 2017

bdarnell commented May 24, 2017

tamird commented May 23, 2017 •

edited

Loading

spencerkimball left a comment •

edited

Loading