Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: stuck requests after rebalancing-inducing downtime #19165

Closed
m-schneider opened this issue Oct 10, 2017 · 20 comments
Closed

stability: stuck requests after rebalancing-inducing downtime #19165

m-schneider opened this issue Oct 10, 2017 · 20 comments
Assignees
Milestone

Comments

@m-schneider
Copy link
Contributor

When running an experiment for issue #10972, node 1 on navy was set up to be shut down for 10 minutes and then started back up for 10 more minutes and repeated indefinitely.

Here's a graph of SQL queries going through node 1 and through the whole cluster.

screenshot-2017-10-10 grafana - cockroach sql 1

screenshot-2017-10-10 grafana - cockroach sql

As you can see, when the node is down the QPS of the cluster drops, but when it comes back online the QPS drops even further and if you look into any node other than node 1 you see a graph like this:

screenshot-2017-10-10 grafana - cockroach sql 2

The logs on node2 looked like the following:

I171010 18:53:24.898329 161 storage/replicate_queue.go:230  [replicate,n2,s2,r464/3:/Table/53/1/1{098745…-134768…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
I171010 18:53:26.899992 167 storage/allocator_scorer.go:612  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] s6: should-rebalance(ranges-overfull): rangeCount=331, mean=308.67, overfull-threshold=325
I171010 18:53:26.900521 161 storage/allocator_scorer.go:612  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] s3: should-rebalance(ranges-overfull): rangeCount=335, mean=308.67, overfull-threshold=325
I171010 18:53:26.901887 161 storage/replica_raftstorage.go:526  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] generated preemptive snapshot 849a4da6 at index 259745
I171010 18:53:26.904338 161 storage/replicate_queue.go:230  [replicate,n2,s2,r367/2:/Table/53/1/-39{43355…-07282…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
I171010 18:53:27.789343 105 server/status/runtime.go:219  [n2] runtime stats: 1.1 GiB RSS, 340 goroutines, 269 MiB/378 MiB/784 MiB GO alloc/idle/total, 156 MiB/204 MiB CGO alloc/total, 3113.85cgo/sec, 0.32/0.11 %(u/s)time, 0.00 %gc (1x)
I171010 18:53:28.102891 167 storage/allocator_scorer.go:612  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] s3: should-rebalance(ranges-overfull): rangeCount=335, mean=308.67, overfull-threshold=325
I171010 18:53:28.103119 161 storage/allocator_scorer.go:612  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] s6: should-rebalance(ranges-overfull): rangeCount=331, mean=308.67, overfull-threshold=325
I171010 18:53:28.104017 161 storage/replica_raftstorage.go:526  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] generated preemptive snapshot 306b5882 at index 260051
I171010 18:53:28.106350 161 storage/replicate_queue.go:230  [replicate,n2,s2,r456/2:/Table/53/1/-38{71223…-35138…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
W171010 18:53:29.728460 389 kv/dist_sender.go:1147  [client=127.0.0.1:48766,user=root,n2] have been waiting 1m0s sending RPC to r291 for batch: [txn: ecb538a7], BeginTransaction [/Table/53/1/-2669039724765244081/0,/Min), Put [/Table/53/1/-2669039724765244081/0,/Min), EndTransaction [/Table/53/1/-2669039724765244081/0,/Min)
W171010 18:53:30.156390 394 kv/dist_sender.go:1147  [client=127.0.0.1:48780,user=root,n2] have been waiting 1m0s sending RPC to r133 for batch: [txn: 5a487fc4], BeginTransaction [/Table/53/1/-3873090155649392032/0,/Min), Put [/Table/53/1/-3873090155649392032/0,/Min), EndTransaction [/Table/53/1/-3873090155649392032/0,/Min)
W171010 18:53:30.184079 360 kv/dist_sender.go:1147  [client=127.0.0.1:48708,user=root,n2] have been waiting 1m0s sending RPC to r509 for batch: [txn: e80dfaed], BeginTransaction [/Table/53/1/164539215546815089/0,/Min), Put [/Table/53/1/164539215546815089/0,/Min), EndTransaction [/Table/53/1/164539215546815089/0,/Min)
W171010 18:53:30.720522 401 kv/dist_sender.go:1147  [client=127.0.0.1:48792,user=root,n2] have been waiting 1m0s sending RPC to r140 for batch: [txn: 2294b96e], BeginTransaction [/Table/53/1/8822121958297262286/0,/Min), Put [/Table/53/1/8822121958297262286/0,/Min), EndTransaction [/Table/53/1/8822121958297262286/0,/Min)

So the dist sender was blocked and it was trying to send preemptive snapshots to node 1.

Looking for r140 on the freshly started node 1 the logs looked like the following:

I171010 18:50:08.191224 662 storage/store.go:3241  [n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] added to replica GC queue (peer suggestion)
I171010 18:50:13.255711 157 storage/allocator_scorer.go:612  [replicate,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] s6: should-rebalance(ranges-overfull): rangeCount=333, mean=311.00, overfull-threshold=327
...
I171010 18:56:42.705486 157 storage/allocator_scorer.go:612  [replicate,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] s2: should-rebalance(ranges-overfull): rangeCount=331, mean=305.67, overfull-threshold=321
I171010 18:56:53.137619 152 storage/store.go:2165  [replicaGC,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] removing replica
I171010 18:56:53.141272 152 storage/replica.go:763  [replicaGC,n1,s1,r140/1:/Table/53/1/8{791903…-827853…}] removed 590434 (590424+10) keys in 3ms [clear=2ms commit=1ms]

This looks like node 1 still thought that it had the lease for r140 until r140 got GCed.

@m-schneider m-schneider self-assigned this Oct 10, 2017
@a-robinson
Copy link
Contributor

And to be more specific, the two main questions to follow up on are:

  1. Why did node 1 think that it had a valid lease even though the other replicas of the stuck ranges didn't even consider node 1's replica a part of the raft group?
  2. Why weren't any of the valid replicas (the ones not on node 1) responding to requests for the range? Were requests even being sent to them by the distsender?

@m-schneider
Copy link
Contributor Author

After adding more logging, I've noticed two things that could potentially be contributing to this problem. The first is that after the node that is being cycled, node 1 starts up, other nodes start sending it requests through their dist senders and those requests start timing out. I'm still looking into why node 1 isn't responding quickly enough.

The second problem is that the timeout is currently 1 minute after PR #16088. That slows down all of the nodes, because they all end up sending requests to node 1 and end up waiting the timeout period. I looked at the lease histories of the ranges that were causing these timeouts and noticed that the leases for them were previously on node 1 before it was shut down, but then weren't being accessed until node 1 started back up and wasn't the leaseholder anymore.

As part of the fix for this, I'm thinking that we should remove entries for nodes that have gone down from the dist sender leaseholder caches when we get node liveness updates from gossip.

@tbg
Copy link
Member

tbg commented Oct 12, 2017

Is this after the node having been down for >10s? What should happen is that the node loses its leadership for all of its ranges and so when it comes back up, nobody's really talking to it initially. Any idea why that doesn't happen? Are these ranges that aren't accessed while the node is down? Or is the node just down for <9s? I'd be interested to hear where these requests are stuck. Sure, they'll be on the DistSender, but where's the "other end" of that request? In the worst case, it'll be on the socket waiting for Accept() (I think we open the listener fairly early but then it can take a bit until we handle connections), but perhaps it's elsewhere.

Also, what does the "took X" say when you cycle the node in the below startup message? Is it in any relation to how long the connections hang?

CockroachDB node starting at 2017-10-12 18:22:20.293357639 +0000 UTC (took 0.7s)

@m-schneider
Copy link
Contributor Author

These ranges weren't access while the node was down which is why their leaseholder caches weren't updated. I'm looking into the other side right now using traces as you suggested. I'm see a lot of things that look like:

2017/10/12 20:31:13.306622 | 62.405386 | /cockroach.roachpb.Internal/Batch
-- | -- | --
20:31:13.306624 | .     2 | ... span.kind:server
20:31:13.306625 | .     1 | ... component:gRPC
20:31:13.306626 | .     1 | ... txnid:8a353734-c06b-4de0-ac24-c9ee5f5ed3dd
20:31:13.306652 | .    25 | ... [n1] 1 Put, 1 BeginTxn, 1 EndTxn
20:31:13.306662 | .    10 | ... [n1,s1] executing 3 requests
20:31:13.306670 | .     9 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] read-write path
20:31:13.306681 | .    10 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] command queue
20:31:13.306736 | .    56 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #1)
20:31:13.759605 | .452868 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #2)
20:31:13.773860 | . 14256 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #3)
20:31:18.158867 | 4.385006 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #4)
20:31:21.163256 | 3.004389 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #5)
20:31:21.177350 | . 14094 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #6)
20:31:24.959592 | 3.782242 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #7)
20:31:24.992235 | . 32643 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #8)
20:31:28.158014 | 3.165779 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #9)
20:31:32.159787 | 4.001773 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #10)
20:31:32.176981 | . 17193 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #11)
20:31:35.558960 | 3.381979 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #12)
20:31:39.158253 | 3.599294 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #13)
20:31:44.358320 | 5.200066 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #14)
20:31:48.997910 | 4.639590 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #15)
20:31:53.558277 | 4.560367 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #16)
20:31:57.359522 | 3.801244 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #17)
20:31:57.365505 | .  5983 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #18)
20:32:02.759262 | 5.393758 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #19)
20:32:08.558803 | 5.799540 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #20)
20:32:13.760520 | 5.201717 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #21)
20:32:15.664269 | 1.903750 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] request range lease (attempt #22)
20:32:15.711998 | . 47729 | ... [n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] replica.Send got error: r859 was not found
20:32:15.712004 | .     6 | ... [n1] *roachpb.RangeNotFoundError

After the range gets GCed, we break out of the loop in redirectOnOrAcquireLease.

I171012 20:32:15.708618 166 storage/replica.go:763  [replicaGC,n1,s1,r859/10:/Table/53/1/7{587239…-605278…}] removed 548965 (548955+10) keys in 45ms [clear=1ms commit=45ms]

The node startup message doesn't seem to be related to the delays.

From here it looks like we need to prioritize GC which at the moment is getting interleaved with applying snapshots or at least error out of the redirectOnOrAcquireLease loop earlier.

@tbg
Copy link
Member

tbg commented Oct 13, 2017 via email

@tbg
Copy link
Member

tbg commented Oct 13, 2017

Interesting to see that we looped around in the lease code 21 times. Could you add the following diff and paste another trace?

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 3351713f1..3051bcc3e 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -1403,6 +1403,7 @@ func (r *Replica) redirectOnOrAcquireLease(ctx context.Context) (LeaseStatus, *r
                                select {
                                case pErr = <-llChan:
                                        if pErr != nil {
+                                               log.Eventf(ctx, "error: %s", pErr)
                                                switch tErr := pErr.GetDetail().(type) {
                                                case *roachpb.AmbiguousResultError:
                                                        // This can happen if the RequestLease command we sent has been

Do you also see traces for the replica gc queue? If you don't, I think the following is enough to get a rough idea from the log:

diff --git a/pkg/storage/replica_gc_queue.go b/pkg/storage/replica_gc_queue.go
index f360da754..81316c866 100644
--- a/pkg/storage/replica_gc_queue.go
+++ b/pkg/storage/replica_gc_queue.go
@@ -186,6 +186,8 @@ func replicaGCShouldQueueImpl(
 func (rgcq *replicaGCQueue) process(
        ctx context.Context, repl *Replica, _ config.SystemConfig,
 ) error {
+       log.Warningf(ctx, "processing %s", repl)
+       defer log.Warningf(ctx, "done processing %s", repl)
        // Note that the Replicas field of desc is probably out of date, so
        // we should only use `desc` for its static fields like RangeID and
        // StartKey (and avoid rng.GetReplica() for the same reason).

What we should see is that the "done processing" is pretty much immediately followed by the next "processing %s". (You may also want to add tBegin := timeutil.Now() and log time.Since(tBegin) in the defer to avoid having to look at the log timestamps). Presumably the processing will take longer than we'd like to (perhaps due to the snapshots), or there's some reason replicas aren't taken off the queue as quickly as we'd like, or there are simply a lot of replicas in the queue and so it takes a bit until it gets to the one we're looking at.

@tbg tbg changed the title stability: stuck requests after chaos stability: stuck requests after rebalancing-inducing downtime Oct 13, 2017
@m-schneider
Copy link
Contributor Author

I added the logging for pErr:

2017/10/12 21:10:06.441921 | 25.412361 | /cockroach.roachpb.Internal/Batch
-- | -- | --
21:10:06.441924 | .     3 | ... span.kind:server
21:10:06.441924 | .     1 | ... component:gRPC
21:10:06.441926 | .     2 | ... txnid:54e68336-c2ae-4fad-890b-80c80345e31a
21:10:06.441953 | .    27 | ... [n1] 1 Put, 1 BeginTxn, 1 EndTxn
21:10:06.441965 | .    12 | ... [n1,s1] executing 3 requests
21:10:06.441976 | .    11 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] read-write path
21:10:06.441989 | .    13 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] command queue
21:10:06.442021 | .    32 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #1)
21:10:09.969594 | 3.527573 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:09.969610 | .    16 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #2)
21:10:09.994949 | . 25338 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:09.994954 | .     5 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #3)
21:10:14.717450 | 4.722496 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:14.717455 | .     5 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #4)
21:10:19.517515 | 4.800061 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:19.517519 | .     3 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #5)
21:10:23.117569 | 3.600050 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:23.117602 | .    33 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #6)
21:10:23.132076 | . 14474 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:23.132095 | .    19 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #7)
21:10:28.918464 | 5.786370 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:28.918468 | .     4 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #8)
21:10:31.851473 | 2.933005 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt result is ambiguous (removing replica)
21:10:31.851643 | .   170 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] request range lease (attempt #9)
21:10:31.854258 | .  2616 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] error from attempt r167 was not found
21:10:31.854273 | .    14 | ... [n1,s1,r167/14:/Table/53/1/-53{45628…-27670…}] replica.Send got error: r167 was not found
21:10:31.854277 | .     5 | ... [n1] *roachpb.RangeNotFoundError

The error is coming from 771: func (r *Replica) cancelPendingCommandsLocked() {

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

Here's a snippet of the GC queue log line 181 is processing, line 258 is done processing:

W171016 15:11:01.697146 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r626/22:/Table/53/1/63{58714…-67781…}] processing [n1,s1,r626/22:/Table/53/1/63{58714…-67781…}]
W171016 15:11:01.698820 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r626/22:/Table/53/1/63{58714…-67781…}] done processing [n1,s1,r626/22:/Table/53/1/63{58714…-67781…}]
W171016 15:11:02.870510 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r762/24:/Table/53/1/38{63714…-72749…}] processing [n1,s1,r762/24:/Table/53/1/38{63714…-72749…}]
W171016 15:11:02.872341 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r762/24:/Table/53/1/38{63714…-72749…}] done processing [n1,s1,r762/24:/Table/53/1/38{63714…-72749…}]
W171016 15:11:03.448207 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}] processing [n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}]
W171016 15:11:03.451160 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}] done processing [n1,s1,r2101/7:/Table/53/1/75{06442…-15415…}]
W171016 15:11:05.759834 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}] processing [n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}]
W171016 15:11:11.663542 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}] done processing [n1,s1,r2141/6:/Table/53/1/-5{60562…-59670…}]
W171016 15:11:15.330663 64 storage/replica_gc_queue.go:189  [replicaGC,n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}] processing [n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}]
W171016 15:11:26.027591 64 storage/replica_gc_queue.go:258  [replicaGC,n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}] done processing [n1,s1,r2038/9:/Table/53/1/70{01092…-10077…}]

Sometimes one is following pretty closely by the next and sometimes there's a gap of more than a second.

@a-robinson
Copy link
Contributor

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

I do think that returning an AmbiguousResultError is reasonable in the context of cancelling proposed commands on a removed replica. There are other situations where a replica's proposed command may have succeeded before it got removed from the raft group. However, what we could more safely do is set a field on the replica to tell it to refuse to propose any more commands (with a RangeNotFoundError) unless it gets assigned a new replica ID. That would short-circuit the the infinite retry loop we're hitting in the lease acquisition code.

@tbg
Copy link
Member

tbg commented Oct 17, 2017 via email

@bdarnell
Copy link
Contributor

Recapping our conversation at lunch: I think it's probably going to be more effective to make sure that we can return the NotLeaseHolderError quickly and move on to the correct node than to try and make replica GC faster (I had wondered if we were using rocksdb range deletions for replica GC yet; we are). To @tschottdorf's question, I think ReplicaTooOldError is unambiguous so we could skip the consistent lookup, so that might speed things up.

One detail I was missing is that the place where we're spending time is with the DistSender's pending RPC timeout. This sounds like a context propagation issue. We should be cancelling the RPC's context when we go into the pending RPC wait loop so they'll return quickly instead of running out the clock.

We should probably also put an upper bound timeout on all lease attempts - if we're waiting more than a few seconds we should just return a NotLeaseHolderError instead of continuing to wait (the upper levels can retry if they want). But just putting a 10s timeout on all lease acquisitions would still be pretty bad so this is just a backstop to prevent the worst case.

The error is coming from 771: func (r *Replica) cancelPendingCommandsLocked() {

My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed.

It's correct for cancelPendingCommandsLocked to return an ambiguous error. But if that gets returned from redirectOnOrAcquireLease, it should get turned into an unambiguous error since we know that the client operation was never proposed.

@tbg
Copy link
Member

tbg commented Oct 18, 2017

It's correct for cancelPendingCommandsLocked to return an ambiguous error. But if that gets returned from redirectOnOrAcquireLease, it should get turned into an unambiguous error since we know that the client operation was never proposed.

That's a good point, though what error would we return? NotLeaseholder? That's reasonable, though it wouldn't completely solve the loopy trace above because there could be a lot of DistSenders with the stale cache, plus after eviction the DistSenders will try randomly and might just hit the same node again. I think that setting (Replica).mu.destroyed from cancelPendingCommandsLocked() (or its relevant caller) would be more reliable. (Assuming we reset it appropriately when the replica gets recreated).

@tbg
Copy link
Member

tbg commented Oct 18, 2017

(And this isn't to say that we shouldn't do the lease acquisition error termination - we should of course get rid of ambiguous errors wherever we can)

@cuongdo cuongdo added this to the 1.2 milestone Oct 30, 2017
@cuongdo
Copy link
Contributor

cuongdo commented Oct 30, 2017

@m-schneider What's the latest on this issue?

@m-schneider
Copy link
Contributor Author

m-schneider commented Oct 30, 2017 via email

@a-robinson
Copy link
Contributor

This was closed by #19353, right?

@m-schneider
Copy link
Contributor Author

m-schneider commented Nov 21, 2017 via email

@gpaul
Copy link

gpaul commented Nov 28, 2017

@m-schneider I am trying to replicate this issue on v1.0.6.

I've tried a couple of different strategies but have yet to succeed. Would you mind providing the exact steps you used to reproduce?

@m-schneider
Copy link
Contributor Author

m-schneider commented Nov 28, 2017 via email

@gpaul
Copy link

gpaul commented Nov 28, 2017

Thanks that helps already. Could you expand on the kind of load generator? Does the schema matter? Should I fill the database from all nodes, concurrently or can all inserts be sent to a single node. Do queries need a specific form, eg., JOINs? When you say 'latencies', do you mean inserts become slow or selects become slow, or both?

Would the following work?

  1. 5 nodes
  2. a table with a single column consisting of integers
  3. perform serial inserts against a single node, using a single client, records consisting of a single incrementing integer.
  4. after 1 hour, stop inserting.
  5. perform select * from numbers where numbers.num=1 on each of the 5 nodes, serially and continuously using a single client per node.
  6. stop a node, sleep 10mins, start the node, sleep 10mins, repeat.

My timezone is pretty much 'opposite side of the planet' to SF which is why I want to make sure I understand the instructions clearly. Thanks @m-schneider

@m-schneider
Copy link
Contributor Author

m-schneider commented Nov 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants