stability: PreVote can lead to unavailable ranges #18151

petermattis · 2017-09-01T17:38:50Z

Possibly related to #17741.

blue is running 90840ef. It has been experiencing 0 QPS for most of the last week. The only spikes of throughput are immediately upon a node recovering from chaos. I've now disabled chaos, enabled some additional debugging settings and stopped load on all nodes except blue-01.

/debug/requests on blue-01 shows a common pattern:

2017/09/01 17:32:45.440569 | 23.573349 | sql txn implicit
-- | -- | --
17:32:45.440633 | .    64 | ... [client=127.0.0.1:51556,user=root,n1] executing 1/1: UPSERT INTO test.kv(k, v) VALUES ($1, $2)
17:32:45.440703 | .    70 | ... [client=127.0.0.1:51556,user=root,n1] client.Txn did AutoCommit. err: <nil> txn: "unnamed" id=d4d211c5 key=/Min rw=false pri=0.03355085 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1504287165.440651232,0 orig=1504287165.440651232,0 max=1504287165.940651232,0 wto=false rop=false seq=1
17:32:45.440719 | .    16 | ... [client=127.0.0.1:51556,user=root,n1] added table 'test.kv' to table collection
17:32:45.440798 | .    80 | ... [client=127.0.0.1:51556,user=root,n1] query not supported for distSQL: mutations not supported
17:32:45.440876 | .    77 | ... txnID:4dc3decd-2e34-4be6-aeb8-5860ba5f9e90
17:32:45.440879 | .     4 | ... txnID:4dc3decd-2e34-4be6-aeb8-5860ba5f9e90
17:32:45.440908 | .    29 | ... [client=127.0.0.1:51556,user=root,n1] querying next range at /Table/51/1/-2175475428815365171/0
17:32:45.440979 | .    71 | ... [client=127.0.0.1:51556,user=root,n1] r1530: sending batch 1 Put, 1 BeginTxn, 1 EndTxn to (n2,s2):52
17:32:45.441017 | .    38 | ... [client=127.0.0.1:51556,user=root,n1] sending request to cockroach-blue-02:26257

Every single request is stuck sending to blue-02. On blue-02 we see:


2017/09/01 17:21:42.192069 | 773.548566 | /cockroach.roachpb.Internal/Batch
-- | -- | --
17:21:42.192073 | .     3 | ... span.kind:server
17:21:42.192074 | .     1 | ... component:gRPC
17:21:42.192075 | .     2 | ... txnid:576b8304-2787-4093-b5ac-43799a20b7e5
17:21:42.192114 | .    38 | ... [n2] 1 Scan
17:21:42.192126 | .    12 | ... [n2,s2] executing 1 requests
17:21:42.192138 | .    12 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] read-only path
17:21:42.192146 | .     8 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #1)
17:21:45.957045 | 3.764899 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #2)
17:21:51.342252 | 5.385207 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #3)
...

In the above instance, there are now 145 lease attempts. Would be nice to know why they are failing and why another node isn't grabbing the lease.

Node liveness looks reasonable on every node except for blue-02:

Hmm, something seems to have unwedged itself as we I just saw a burst of throughput, but now it is falling again.

The text was updated successfully, but these errors were encountered:

petermattis · 2017-09-01T17:44:16Z

Strange, the badness seems to have migrated to blue-05 as evidenced by where requests on blue-01 are now stuck. On blue-05:

2017/09/01 17:38:01.851387 | 180.396263 | /cockroach.roachpb.Internal/Batch
-- | -- | --
17:38:01.851390 | .     3 | ... span.kind:server
17:38:01.851391 | .     1 | ... component:gRPC
17:38:01.851393 | .     2 | ... txnid:66b8fceb-cf9f-4e0a-92eb-c412937a8d53
17:38:01.851434 | .    41 | ... [n5] 1 Put, 1 BeginTxn, 1 EndTxn
17:38:01.851448 | .    13 | ... [n5,s5] executing 3 requests
17:38:01.851460 | .    12 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] read-write path
17:38:01.851475 | .    15 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] command queue
17:38:01.851574 | .   100 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] request range lease (attempt #1)
17:39:01.851835 | 60.000261 | ... storage/replica.go:1378 [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] have been waiting 1m0s attempting to acquire lease

r2041 has only 2 replicas:

petermattis · 2017-09-01T17:58:47Z

Trying to piece together what happened to r2041. On n3, I see:

I170825 02:18:31.174687 80060 storage/replica_raftstorage.go:718  [n3,s3,r2041/?:{-}] applying preemptive snapshot at index 573633 (id=39a11ddd, encoded size=56930303, 217 rocksdb batches, 177 log entries)
I170825 02:18:33.313474 80060 storage/replica_raftstorage.go:726  [n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] applied preemptive snapshot in 2139ms [clear=614ms batch=857ms entries=0ms commit=666ms]
I170825 02:21:12.232413 79 storage/store.go:2146  [replicaGC,n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] removing replica
I170825 02:21:12.236977 79 storage/replica.go:733  [replicaGC,n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] removed 568275 (568263+12) keys in 4ms [clear=3ms commit=1ms]

This seems to correspond to this rebalance on n7:

I170825 02:18:23.784999 91 storage/replica_raftstorage.go:509  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] generated preemptive snapshot 39a11ddd at index 573633
I170825 02:18:31.169945 91 storage/store.go:3453  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] streamed snapshot to (n3,s3):?: kv pairs: 568321, log entries: 177, rate-limit: 8.0 MiB/sec, 7385ms
I170825 02:18:33.315863 91 storage/replica_command.go:3639  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas (ADD_REPLICA (n3,s3):26): read existing descriptor r2041:/Table/51/1/64{77451831742334078-81949168901650570} [(n5,s5):20, (n7,s7):23, next=26]

Note that the range already has only 2 replicas at this point.

petermattis · 2017-09-01T18:08:06Z

Ah, the next few log lines on n7 show:

W170825 02:19:23.782155 680410 storage/replica.go:2089  [n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] context deadline exceeded while in command queue: [txn: 768f8aa9], BeginTransaction [/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor,/Min), ConditionalPut [/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor,/Min)
E170825 02:19:23.782662 91 storage/queue.go:656  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas of r2041 failed: context deadline exceeded

So that rebalance never succeeded which is probably why the replica was eventually GC'd be n3. Looking further, I'm seeing that every rebalance attempt for this range is failing.

n5 (the other node containing a replica of r2041) has lots of the following messages:

W170825 02:18:22.374618 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted
W170825 02:18:28.170909 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted
W170825 02:18:33.971905 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted

bdarnell · 2017-09-01T18:38:09Z

It's normal that every subsequent rebalance would fail if neither of the two existing replicas is the raft leader.

This appears to be a bug in PreVote: the replica on node 7 is ahead of the one on node 5 (by the raft log LastIndex), but it has a lower term. Node 7 will never vote (pre-vote or regular) for node 5 because of the lower LastIndex, but node 7's MsgPreVote is getting dropped (early in the raft message pipeline) because its term is too low. Without prevote, terms are adjusted more often so I don't think you could get stuck like this. (This is probably also an artifact of the transition from CheckQuorum to PreVote: node 5 wouldn't have increased its term under PreVote, but now that it has, it's persistent).

So before we can re-enable preVote, we need another change to etcd/raft to allow positive responses to preVotes with an older term (as long as they're higher than our last log term).

petermattis · 2017-09-01T18:39:09Z

Here is when r2041 dropped from 3 to 2 replicas:

I170825 01:42:21.177613 91 storage/replica_command.go:3639  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas (REMOVE_REPLICA (n3,s3):25): read existing descriptor r2041:/Table/51/1/64{77451831742334078-81949168901650570} [(n5,s5):20, (n7,s7):23, (n3,s3):25, next=26]
I170825 01:42:46.484066 640116 storage/replica.go:3005  [n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] proposing REMOVE_REPLICA((n3,s3):25): updated=[(n5,s5):20 (n7,s7):23] next=26

Along with that removal I see:

W170825 01:42:47.646765 640081 kv/txn_coord_sender.go:775  [replicate,s7,r2041/23:/Table/51/1/64{77451…-81949…},n7] heartbeat to "change-replica" id=358f57e7 key=/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor rw=true pri=0.02760799 iso=SERIALIZABLE stat=PENDING epo=0 ts=1503625341.173597875,0 orig=1503625341.173597875,0 max=1503625341.673597875,0 wto=false rop=false seq=9 failed: context canceled

petermattis · 2017-09-01T18:39:54Z

@bdarnell Do you think the cluster will recover if I restart with pre-vote disabled?

bdarnell · 2017-09-01T18:40:25Z

Yes, this range should recover when pre-vote is disabled.

petermattis · 2017-09-01T18:41:07Z

Bold declaration. I'll restart the cluster with pre-vote disabled to see if you're right.

bdarnell · 2017-09-01T18:43:01Z

Not too bold. I specifically said "this range", not the whole cluster :)

petermattis · 2017-09-01T18:45:43Z

Ok, restarting with pre-vote disabled at r2041 elected n7 on the leader.

petermattis · 2017-09-01T18:49:27Z

Throughput on the cluster is dropping to 0 again due to r2041, but I think it will resolve itself. We're unable to get a lease for that range because we've built up an enormous Raft log. n7 is at Raft log index 738999 while n5 is at 582182. n5 is slowly catching up though, but this will likely take a while.

petermattis · 2017-09-01T19:48:21Z

r2041 caught up about a minute ago and throughput on blue recovered at the same time.

irfansharif · 2017-09-05T19:37:32Z

Filed upstream: etcd-io/etcd#8501.
This sufficiently explains #18129, so closing that in favor of this in the interim. It aligns with the fact that problem ranges didn't always persist through chaos as with enough leadership changes, replicas stuck in the PreCandidate phase with higher terms were eventually overtaken by healthy replica pairs (with ever increasing terms).

bdarnell · 2018-02-01T03:22:22Z

I've been doing some chaos testing and i haven't seen any persistently-stuck ranges, so I'm tentatively marking this as fixed by etcd-io/etcd#8525. Discussion will continue in #16950.

petermattis added this to the 1.1 milestone Sep 1, 2017

petermattis self-assigned this Sep 1, 2017

irfansharif mentioned this issue Sep 1, 2017

storage: stuck pre-candidate phase #18129

Closed

petermattis changed the title ~~stability: zero QPS on blue~~ stability: PreVote can lead to unavailable ranges Sep 1, 2017

petermattis assigned bdarnell Sep 1, 2017

petermattis modified the milestones: 1.2, 1.1 Sep 1, 2017

bdarnell mentioned this issue Sep 2, 2017

release: v1.1-alpha.20170831 #17748

Closed

10 tasks

a6802739 self-assigned this Sep 4, 2017

irfansharif mentioned this issue Sep 5, 2017

raft: lower test coverage for {CheckQuorum,PreVote} permutations etcd-io/etcd#8498

Closed

bdarnell mentioned this issue Sep 5, 2017

Release notes for alpha-20170831 cockroachdb/docs#1873

Closed

irfansharif mentioned this issue Sep 5, 2017

raft: deadlock during PreVote migration process etcd-io/etcd#8501

Closed

bdarnell mentioned this issue Dec 20, 2017

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Closed

bdarnell mentioned this issue Jan 23, 2018

storage: re-enable Raft PreVote RPC #16950

Closed

petermattis removed their assignment Jan 25, 2018

bdarnell closed this as completed Feb 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: PreVote can lead to unavailable ranges #18151

stability: PreVote can lead to unavailable ranges #18151

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

irfansharif commented Sep 5, 2017

bdarnell commented Feb 1, 2018 •

edited by tamird

Loading

stability: PreVote can lead to unavailable ranges #18151

stability: PreVote can lead to unavailable ranges #18151

Comments

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

bdarnell commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

petermattis commented Sep 1, 2017

irfansharif commented Sep 5, 2017

bdarnell commented Feb 1, 2018 • edited by tamird Loading

bdarnell commented Feb 1, 2018 •

edited by tamird

Loading