stability: stuck requests after rebalancing-inducing downtime #19165

m-schneider · 2017-10-10T20:34:33Z

When running an experiment for issue #10972, node 1 on navy was set up to be shut down for 10 minutes and then started back up for 10 more minutes and repeated indefinitely.

Here's a graph of SQL queries going through node 1 and through the whole cluster.

As you can see, when the node is down the QPS of the cluster drops, but when it comes back online the QPS drops even further and if you look into any node other than node 1 you see a graph like this:

The logs on node2 looked like the following:

I171010 18:53:24.898329 161 storage/replicate_queue.go:230  [replicate,n2,s2,r464/3:/Table/53/1/1{098745…-134768…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
I171010 18:53:26.899992 167 storage/allocator_scorer.go:612  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] s6: should-rebalance(ranges-overfull): rangeCount=331, mean=308.67, overfull-threshold=325
I171010 18:53:26.900521 161 storage/allocator_scorer.go:612  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] s3: should-rebalance(ranges-overfull): rangeCount=335, mean=308.67, overfull-threshold=325
I171010 18:53:26.901887 161 storage/replica_raftstorage.go:526  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] generated preemptive snapshot 849a4da6 at index 259745
I171010 18:53:26.904338 161 storage/replicate_queue.go:230  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
I171010 18:53:27.789343 105 server/status/runtime.go:219  [n2] runtime stats: 1.1 GiB RSS, 340 goroutines, 269 MiB/378 MiB/784 MiB GO alloc/idle/total, 156 MiB/204 MiB CGO alloc/total, 3113.85cgo/sec, 0.32/0.11 %(u/s)time, 0.00 %gc (1x)
I171010 18:53:28.102891 167 storage/allocator_scorer.go:612  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] s3: should-rebalance(ranges-overfull): rangeCount=335, mean=308.67, overfull-threshold=325
I171010 18:53:28.103119 161 storage/allocator_scorer.go:612  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] s6: should-rebalance(ranges-overfull): rangeCount=331, mean=308.67, overfull-threshold=325
I171010 18:53:28.104017 161 storage/replica_raftstorage.go:526  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] generated preemptive snapshot 306b5882 at index 260051
I171010 18:53:28.106350 161 storage/replicate_queue.go:230  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
W171010 18:53:29.728460 389 kv/dist_sender.go:1147  [client=127.0.0.1:48766,user=root,n2] have been waiting 1m0s sending RPC to r291 for batch: [txn: ecb538a7], BeginTransaction [/Table/53/1/-2669039724765244081/0,/Min), Put [/Table/53/1/-2669039724765244081/0,/Min), EndTransaction [/Table/53/1/-2669039724765244081/0,/Min)
W171010 18:53:30.156390 394 kv/dist_sender.go:1147  [client=127.0.0.1:48780,user=root,n2] have been waiting 1m0s sending RPC to r133 for batch: [txn: 5a487fc4], BeginTransaction [/Table/53/1/-3873090155649392032/0,/Min), Put [/Table/53/1/-3873090155649392032/0,/Min), EndTransaction [/Table/53/1/-3873090155649392032/0,/Min)
W171010 18:53:30.184079 360 kv/dist_sender.go:1147  [client=127.0.0.1:48708,user=root,n2] have been waiting 1m0s sending RPC to r509 for batch: [txn: e80dfaed], BeginTransaction [/Table/53/1/164539215546815089/0,/Min), Put [/Table/53/1/164539215546815089/0,/Min), EndTransaction [/Table/53/1/164539215546815089/0,/Min)
W171010 18:53:30.720522 401 kv/dist_sender.go:1147  [client=127.0.0.1:48792,user=root,n2] have been waiting 1m0s sending RPC to r140 for batch: [txn: 2294b96e], BeginTransaction [/Table/53/1/8822121958297262286/0,/Min), Put [/Table/53/1/8822121958297262286/0,/Min), EndTransaction [/Table/53/1/8822121958297262286/0,/Min)

So the dist sender was blocked and it was trying to send preemptive snapshots to node 1.

Looking for r140 on the freshly started node 1 the logs looked like the following:

I171010 18:50:08.191224 662 storage/store.go:3241  [n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] added to replica GC queue (peer suggestion)
I171010 18:50:13.255711 157 storage/allocator_scorer.go:612  [replicate,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] s6: should-rebalance(ranges-overfull): rangeCount=333, mean=311.00, overfull-threshold=327
...
I171010 18:56:42.705486 157 storage/allocator_scorer.go:612  [replicate,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] s2: should-rebalance(ranges-overfull): rangeCount=331, mean=305.67, overfull-threshold=321
I171010 18:56:53.137619 152 storage/store.go:2165  [replicaGC,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] removing replica
I171010 18:56:53.141272 152 storage/replica.go:763  [replicaGC,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] removed 590434 (590424+10) keys in 3ms [clear=2ms commit=1ms]

This looks like node 1 still thought that it had the lease for r140 until r140 got GCed.

The text was updated successfully, but these errors were encountered:

a-robinson · 2017-10-10T20:40:30Z

And to be more specific, the two main questions to follow up on are:

Why did node 1 think that it had a valid lease even though the other replicas of the stuck ranges didn't even consider node 1's replica a part of the raft group?
Why weren't any of the valid replicas (the ones not on node 1) responding to requests for the range? Were requests even being sent to them by the distsender?

m-schneider · 2017-10-12T17:50:49Z

After adding more logging, I've noticed two things that could potentially be contributing to this problem. The first is that after the node that is being cycled, node 1 starts up, other nodes start sending it requests through their dist senders and those requests start timing out. I'm still looking into why node 1 isn't responding quickly enough.

The second problem is that the timeout is currently 1 minute after PR #16088. That slows down all of the nodes, because they all end up sending requests to node 1 and end up waiting the timeout period. I looked at the lease histories of the ranges that were causing these timeouts and noticed that the leases for them were previously on node 1 before it was shut down, but then weren't being accessed until node 1 started back up and wasn't the leaseholder anymore.

As part of the fix for this, I'm thinking that we should remove entries for nodes that have gone down from the dist sender leaseholder caches when we get node liveness updates from gossip.

tbg · 2017-10-12T18:23:38Z

Is this after the node having been down for >10s? What should happen is that the node loses its leadership for all of its ranges and so when it comes back up, nobody's really talking to it initially. Any idea why that doesn't happen? Are these ranges that aren't accessed while the node is down? Or is the node just down for <9s? I'd be interested to hear where these requests are stuck. Sure, they'll be on the DistSender, but where's the "other end" of that request? In the worst case, it'll be on the socket waiting for Accept() (I think we open the listener fairly early but then it can take a bit until we handle connections), but perhaps it's elsewhere.

Also, what does the "took X" say when you cycle the node in the below startup message? Is it in any relation to how long the connections hang?

CockroachDB node starting at 2017-10-12 18:22:20.293357639 +0000 UTC (took 0.7s)

m-schneider · 2017-10-12T20:55:58Z

These ranges weren't access while the node was down which is why their leaseholder caches weren't updated. I'm looking into the other side right now using traces as you suggested. I'm see a lot of things that look like:

2017/10/12 20:31:13.306622 | 62.405386 | /cockroach.roachpb.Internal/Batch
-- | -- | --
20:31:13.306624 | .     2 | ... span.kind:server
20:31:13.306625 | .     1 | ... component:gRPC
20:31:13.306626 | .     1 | ... txnid:8a353734-c06b-4de0-ac24-c9ee5f5ed3dd
20:31:13.306652 | .    25 | ... [n1] 1 Put, 1 BeginTxn, 1 EndTxn
20:31:13.306662 | .    10 | ... [n1,s1] executing 3 requests
20:31:13.306670 | .     9 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] read-write path
20:31:13.306681 | .    10 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] command queue
20:31:13.306736 | .    56 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #1)
20:31:13.759605 | .452868 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #2)
20:31:13.773860 | . 14256 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #3)
20:31:18.158867 | 4.385006 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #4)
20:31:21.163256 | 3.004389 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #5)
20:31:21.177350 | . 14094 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #6)
20:31:24.959592 | 3.782242 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #7)
20:31:24.992235 | . 32643 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #8)
20:31:28.158014 | 3.165779 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #9)
20:31:32.159787 | 4.001773 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #10)
20:31:32.176981 | . 17193 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #11)
20:31:35.558960 | 3.381979 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #12)
20:31:39.158253 | 3.599294 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #13)
20:31:44.358320 | 5.200066 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #14)
20:31:48.997910 | 4.639590 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #15)
20:31:53.558277 | 4.560367 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #16)
20:31:57.359522 | 3.801244 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #17)
20:31:57.365505 | .  5983 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #18)
20:32:02.759262 | 5.393758 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #19)
20:32:08.558803 | 5.799540 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #20)
20:32:13.760520 | 5.201717 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #21)
20:32:15.664269 | 1.903750 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #22)
20:32:15.711998 | . 47729 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] replica.Send got error: r859 was not found
20:32:15.712004 | .     6 | ... [n1] *roachpb.RangeNotFoundError

After the range gets GCed, we break out of the loop in redirectOnOrAcquireLease.

I171012 20:32:15.708618 166 storage/replica.go:763  [replicaGC,n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] removed 548965 (548955+10) keys in 45ms [clear=1ms commit=45ms]

The node startup message doesn't seem to be related to the delays.

From here it looks like we need to prioritize GC which at the moment is getting interleaved with applying snapshots or at least error out of the redirectOnOrAcquireLease loop earlier.

tbg · 2017-10-13T02:34:54Z

Glad to see the working theory checks out! I'll respond more tomorrow.

tbg · 2017-10-13T10:12:21Z

Interesting to see that we looped around in the lease code 21 times. Could you add the following diff and paste another trace?

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 3351713f1..3051bcc3e 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -1403,6 +1403,7 @@ func (r *Replica) redirectOnOrAcquireLease(ctx context.Context) (LeaseStatus, *r
                                select {
                                case pErr = <-llChan:
                                        if pErr != nil {
+                                               log.Eventf(ctx, "error: %s", pErr)
                                                switch tErr := pErr.GetDetail().(type) {
                                                case *roachpb.AmbiguousResultError:
                                                        // This can happen if the RequestLease command we sent has been

Do you also see traces for the replica gc queue? If you don't, I think the following is enough to get a rough idea from the log:

diff --git a/pkg/storage/replica_gc_queue.go b/pkg/storage/replica_gc_queue.go
index f360da754..81316c866 100644
--- a/pkg/storage/replica_gc_queue.go
+++ b/pkg/storage/replica_gc_queue.go
@@ -186,6 +186,8 @@ func replicaGCShouldQueueImpl(
 func (rgcq *replicaGCQueue) process(
        ctx context.Context, repl *Replica, _ config.SystemConfig,
 ) error {
+       log.Warningf(ctx, "processing %s", repl)
+       defer log.Warningf(ctx, "done processing %s", repl)
        // Note that the Replicas field of desc is probably out of date, so
        // we should only use `desc` for its static fields like RangeID and
        // StartKey (and avoid rng.GetReplica() for the same reason).

What we should see is that the "done processing" is pretty much immediately followed by the next "processing %s". (You may also want to add tBegin := timeutil.Now() and log time.Since(tBegin) in the defer to avoid having to look at the log timestamps). Presumably the processing will take longer than we'd like to (perhaps due to the snapshots), or there's some reason replicas aren't taken off the queue as quickly as we'd like, or there are simply a lot of replicas in the queue and so it takes a bit until it gets to the one we're looking at.

m-schneider · 2017-10-16T15:41:19Z

I added the logging for pErr:

2017/10/12 21:10:06.441921 | 25.412361 | /cockroach.roachpb.Internal/Batch
-- | -- | --
21:10:06.441924 | .     3 | ... span.kind:server
21:10:06.441924 | .     1 | ... component:gRPC
21:10:06.441926 | .     2 | ... txnid:54e68336-c2ae-4fad-890b-80c80345e31a
21:10:06.441953 | .    27 | ... [n1] 1 Put, 1 BeginTxn, 1 EndTxn
21:10:06.441965 | .    12 | ... [n1,s1] executing 3 requests
21:10:06.441976 | .    11 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] read-write path
21:10:06.441989 | .    13 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] command queue
21:10:06.442021 | .    32 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #1)
21:10:09.969594 | 3.527573 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:09.969610 | .    16 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #2)
21:10:09.994949 | . 25338 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:09.994954 | .     5 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #3)
21:10:14.717450 | 4.722496 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:14.717455 | .     5 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #4)
21:10:19.517515 | 4.800061 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:19.517519 | .     3 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #5)
21:10:23.117569 | 3.600050 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:23.117602 | .    33 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #6)
21:10:23.132076 | . 14474 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:23.132095 | .    19 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #7)
21:10:28.918464 | 5.786370 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:28.918468 | .     4 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #8)
21:10:31.851473 | 2.933005 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:31.851643 | .   170 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #9)
21:10:31.854258 | .  2616 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt r167 was not found
21:10:31.854273 | .    14 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] replica.Send got error: r167 was not found
21:10:31.854277 | .     5 | ... [n1] *roachpb.RangeNotFoundError

The error is coming from 771: func (r *Replica) cancelPendingCommandsLocked() {

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

Here's a snippet of the GC queue log line 181 is processing, line 258 is done processing:

W171016 15:11:01.697146 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r626/22:/Table/53/1/63{58714…-67781…}] processing [n1,s1,r626/22:/Table/53/1/63{58714…-67781…}]
W171016 15:11:01.698820 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r626/22:/Table/53/1/63{58714…-67781…}] done processing [n1,s1,r626/22:/Table/53/1/63{58714…-67781…}]
W171016 15:11:02.870510 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r762/24:/Table/53/1/38{63714…-72749…}] processing [n1,s1,r762/24:/Table/53/1/38{63714…-72749…}]
W171016 15:11:02.872341 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r762/24:/Table/53/1/38{63714…-72749…}] done processing [n1,s1,r762/24:/Table/53/1/38{63714…-72749…}]
W171016 15:11:03.448207 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}] processing [n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}]
W171016 15:11:03.451160 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}] done processing [n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}]
W171016 15:11:05.759834 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}] processing [n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}]
W171016 15:11:11.663542 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}] done processing [n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}]
W171016 15:11:15.330663 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}] processing [n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}]
W171016 15:11:26.027591 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}] done processing [n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}]

Sometimes one is following pretty closely by the next and sometimes there's a gap of more than a second.

a-robinson · 2017-10-16T18:59:55Z

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

I do think that returning an AmbiguousResultError is reasonable in the context of cancelling proposed commands on a removed replica. There are other situations where a replica's proposed command may have succeeded before it got removed from the raft group. However, what we could more safely do is set a field on the replica to tell it to refuse to propose any more commands (with a RangeNotFoundError) unless it gets assigned a new replica ID. That would short-circuit the the infinite retry loop we're hitting in the lease acquisition code.

tbg · 2017-10-17T11:38:27Z

The error is returned from this path (in the context that a Raft peer tells us that our Replica is too old, i.e. we've been removed, and we receive that repeatedly until the replica GC queue picks up the replica and actually removes it). I think Alex' suggestion would work: the first time we get this error, we set a flag on the replica (which is cleared whenever its `mu.replicaID` changes). With the flag set, new proposals return `NotLeaseholderError` (or, at least lease acquisition attempts are no longer made). // If the replica ID in the error matches (which is the usual // case; the exception is when a replica has been removed and // re-added rapidly), we know the replica will be removed and we // can cancel any pending commands. This is sometimes necessary // to unblock PushTxn operations that are necessary for the // replica GC to succeed. if tErr.ReplicaID == repl.mu.replicaID { repl.cancelPendingCommandsLocked() } @bdarnell: are there any false positives when a peer gives us `ReplicaTooOld` and we still have that ReplicaID? As in, in that case, do we need to perform the consistent RangeLookup?

On Mon, Oct 16, 2017 at 9:00 PM Alex Robinson ***@***.***> wrote: My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed. I do think that returning an AmbiguousResultError is reasonable in the context of cancelling proposed commands on a removed replica. There are other situations where a replica's proposed command may have succeeded before it got removed from the raft group. However, what we could more safely do is set a field on the replica to tell it to refuse to propose any more commands (with a RangeNotFoundError) unless it gets assigned a new replica ID. That would short-circuit the the infinite retry loop we're hitting in the lease acquisition code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19165 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135BaKex9QXp9I9xYLU7kLixprUZWgks5ss6fNgaJpZM4P0iTe> .

--

…

-- Tobias

bdarnell · 2017-10-17T19:00:36Z

Recapping our conversation at lunch: I think it's probably going to be more effective to make sure that we can return the NotLeaseHolderError quickly and move on to the correct node than to try and make replica GC faster (I had wondered if we were using rocksdb range deletions for replica GC yet; we are). To @tschottdorf's question, I think ReplicaTooOldError is unambiguous so we could skip the consistent lookup, so that might speed things up.

One detail I was missing is that the place where we're spending time is with the DistSender's pending RPC timeout. This sounds like a context propagation issue. We should be cancelling the RPC's context when we go into the pending RPC wait loop so they'll return quickly instead of running out the clock.

We should probably also put an upper bound timeout on all lease attempts - if we're waiting more than a few seconds we should just return a NotLeaseHolderError instead of continuing to wait (the upper levels can retry if they want). But just putting a 10s timeout on all lease acquisitions would still be pretty bad so this is just a backstop to prevent the worst case.

The error is coming from 771: func (r *Replica) cancelPendingCommandsLocked() {

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

It's correct for cancelPendingCommandsLocked to return an ambiguous error. But if that gets returned from redirectOnOrAcquireLease, it should get turned into an unambiguous error since we know that the client operation was never proposed.

tbg · 2017-10-18T18:40:28Z

It's correct for cancelPendingCommandsLocked to return an ambiguous error. But if that gets returned from redirectOnOrAcquireLease, it should get turned into an unambiguous error since we know that the client operation was never proposed.

That's a good point, though what error would we return? NotLeaseholder? That's reasonable, though it wouldn't completely solve the loopy trace above because there could be a lot of DistSenders with the stale cache, plus after eviction the DistSenders will try randomly and might just hit the same node again. I think that setting (Replica).mu.destroyed from cancelPendingCommandsLocked() (or its relevant caller) would be more reliable. (Assuming we reset it appropriately when the replica gets recreated).

tbg · 2017-10-18T18:42:43Z

(And this isn't to say that we shouldn't do the lease acquisition error termination - we should of course get rid of ambiguous errors wherever we can)

cuongdo · 2017-10-30T17:33:15Z

@m-schneider What's the latest on this issue?

m-schneider · 2017-10-30T17:44:39Z

I'm working on reproducing both the issue and the fix on an ephemeral cluster. If we can see that the fix from #19353 works, then we'll be able to close it.

…

On Mon, Oct 30, 2017 at 1:33 PM, Cuong Do ***@***.***> wrote: @m-schneider <https://github.com/m-schneider> What's the latest on this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19165 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ab6W52KVfrY--WS_C7uKhA304MQoC33iks5sxghrgaJpZM4P0iTe> .

a-robinson · 2017-11-21T17:26:27Z

This was closed by #19353, right?

m-schneider · 2017-11-21T17:37:08Z

Yes.

…

On Tue, Nov 21, 2017 at 12:26 PM, Alex Robinson ***@***.***> wrote: This was closed by #19353 <#19353>, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19165 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ab6W5042kACHt-T3gwezY0MTHO82HmMJks5s4wfSgaJpZM4P0iTe> .

gpaul · 2017-11-28T11:38:04Z

@m-schneider I am trying to replicate this issue on v1.0.6.

I've tried a couple of different strategies but have yet to succeed. Would you mind providing the exact steps you used to reproduce?

m-schneider · 2017-11-28T21:16:47Z

In order to reproduce, bring up a cluster with 3 or more nodes. Leave it running for an hour with a load generator writing to every node. Then bring down one of the nodes for 10 minutes at a time and you should see latencies on the other ones going up when you bring it back up. I had a cron job that killed the node on the 00, 20 and 40 of every hour and brought it back up on the 10s, 30s, and 50s. The whole cluster ground to a halt on the 10s, 30s and 50s.

…

On Tue, Nov 28, 2017 at 6:38 AM, Gustav Paul ***@***.***> wrote: @m-schneider <https://github.com/m-schneider> I am trying to replicate this issue on v1.0.6. I've tried a couple of different strategies but have yet to succeed. Would you mind providing the exact steps you used to reproduce? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19165 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ab6W50b84V2rFe98dT3MISaPxH1JapN6ks5s6_CsgaJpZM4P0iTe> .

gpaul · 2017-11-28T21:58:34Z

Thanks that helps already. Could you expand on the kind of load generator? Does the schema matter? Should I fill the database from all nodes, concurrently or can all inserts be sent to a single node. Do queries need a specific form, eg., JOINs? When you say 'latencies', do you mean inserts become slow or selects become slow, or both?

Would the following work?

5 nodes
a table with a single column consisting of integers
perform serial inserts against a single node, using a single client, records consisting of a single incrementing integer.
after 1 hour, stop inserting.
perform select * from numbers where numbers.num=1 on each of the 5 nodes, serially and continuously using a single client per node.
stop a node, sleep 10mins, start the node, sleep 10mins, repeat.

My timezone is pretty much 'opposite side of the planet' to SF which is why I want to make sure I understand the instructions clearly. Thanks @m-schneider

m-schneider · 2017-11-28T22:10:01Z

No problem. I'm not quite in SF land myself :) The schema doesn't really matter and neither does the load generator so I think what you have should work. The core of the problem was that the node that was coming up and down didn't reply if it was the lease holder for the replicas for which it was the leaseholder before a restart in a reasonable amount of time. So for you points. 1. 5 node is a good amount 2. The schema is fine. 3. Serial inserts are reasonable. 4. Fine 5. You should perform selects that would touch more of data because you want to make sure that you're touching a replica that was on the node being restarted, but not there anymore. 6. Sounds good.

…

On Tue, Nov 28, 2017 at 4:58 PM, Gustav Paul ***@***.***> wrote: Thanks that helps already. Could you expand on the kind of load generator? Does the schema matter? Should I fill the database from all nodes, concurrently or can all inserts be sent to a single node. Do queries need a specific form, eg., JOINs? When you say 'latencies', do you mean inserts become slow or selects become slow, or both? Would the following work? 1. 5 nodes 2. a table with a single column consisting of integers 3. perform serial inserts against a single node, using a single client, records consisting of a single incrementing integer. 4. after 1 hour, stop inserting. 5. perform select * from numbers where numbers.num=1 on each of the 5 nodes, serially and continuously using a single client per node. 6. stop a node, sleep 10mins, start the node, sleep 10mins, repeat. My timezone is pretty much 'opposite side of the planet' to SF which is why I want to make sure I understand the instructions clearly. Thanks @m-schneider <https://github.com/m-schneider> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19165 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ab6W55h3ZFycvulAdUfr7ZpEDAmD_uxUks5s7IIagaJpZM4P0iTe> .

m-schneider self-assigned this Oct 10, 2017

tbg changed the title ~~stability: stuck requests after chaos~~ stability: stuck requests after rebalancing-inducing downtime Oct 13, 2017

tbg mentioned this issue Oct 13, 2017

stability: Need to improve performance of widely distributed clusters under chaos #17611

Closed

cuongdo added this to the 1.2 milestone Oct 30, 2017

a-robinson mentioned this issue Nov 21, 2017

5-node cluster unavailable for an hour after one server reboots #20114

Closed

a-robinson closed this as completed Nov 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: stuck requests after rebalancing-inducing downtime #19165

stability: stuck requests after rebalancing-inducing downtime #19165

m-schneider commented Oct 10, 2017

a-robinson commented Oct 10, 2017

m-schneider commented Oct 12, 2017

tbg commented Oct 12, 2017

m-schneider commented Oct 12, 2017

tbg commented Oct 13, 2017 via email •

edited

Loading

tbg commented Oct 13, 2017

m-schneider commented Oct 16, 2017

a-robinson commented Oct 16, 2017

tbg commented Oct 17, 2017 via email

bdarnell commented Oct 17, 2017

tbg commented Oct 18, 2017

tbg commented Oct 18, 2017

cuongdo commented Oct 30, 2017

m-schneider commented Oct 30, 2017 via email

a-robinson commented Nov 21, 2017

m-schneider commented Nov 21, 2017 via email

gpaul commented Nov 28, 2017

m-schneider commented Nov 28, 2017 via email

gpaul commented Nov 28, 2017

m-schneider commented Nov 28, 2017 via email

stability: stuck requests after rebalancing-inducing downtime #19165

stability: stuck requests after rebalancing-inducing downtime #19165

Comments

m-schneider commented Oct 10, 2017

a-robinson commented Oct 10, 2017

m-schneider commented Oct 12, 2017

tbg commented Oct 12, 2017

m-schneider commented Oct 12, 2017

tbg commented Oct 13, 2017 via email • edited Loading

tbg commented Oct 13, 2017

m-schneider commented Oct 16, 2017

a-robinson commented Oct 16, 2017

tbg commented Oct 17, 2017 via email

bdarnell commented Oct 17, 2017

tbg commented Oct 18, 2017

tbg commented Oct 18, 2017

cuongdo commented Oct 30, 2017

m-schneider commented Oct 30, 2017 via email

a-robinson commented Nov 21, 2017

m-schneider commented Nov 21, 2017 via email

gpaul commented Nov 28, 2017

m-schneider commented Nov 28, 2017 via email

gpaul commented Nov 28, 2017

m-schneider commented Nov 28, 2017 via email

tbg commented Oct 13, 2017 via email •

edited

Loading