Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: PreVote can lead to unavailable ranges #18151

Closed
petermattis opened this issue Sep 1, 2017 · 14 comments
Closed

stability: PreVote can lead to unavailable ranges #18151

petermattis opened this issue Sep 1, 2017 · 14 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

Possibly related to #17741.

blue is running 90840ef. It has been experiencing 0 QPS for most of the last week. The only spikes of throughput are immediately upon a node recovering from chaos. I've now disabled chaos, enabled some additional debugging settings and stopped load on all nodes except blue-01.

/debug/requests on blue-01 shows a common pattern:

2017/09/01 17:32:45.440569 | 23.573349 | sql txn implicit
-- | -- | --
17:32:45.440633 | .    64 | ... [client=127.0.0.1:51556,user=root,n1] executing 1/1: UPSERT INTO test.kv(k, v) VALUES ($1, $2)
17:32:45.440703 | .    70 | ... [client=127.0.0.1:51556,user=root,n1] client.Txn did AutoCommit. err: <nil> txn: "unnamed" id=d4d211c5 key=/Min rw=false pri=0.03355085 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1504287165.440651232,0 orig=1504287165.440651232,0 max=1504287165.940651232,0 wto=false rop=false seq=1
17:32:45.440719 | .    16 | ... [client=127.0.0.1:51556,user=root,n1] added table 'test.kv' to table collection
17:32:45.440798 | .    80 | ... [client=127.0.0.1:51556,user=root,n1] query not supported for distSQL: mutations not supported
17:32:45.440876 | .    77 | ... txnID:4dc3decd-2e34-4be6-aeb8-5860ba5f9e90
17:32:45.440879 | .     4 | ... txnID:4dc3decd-2e34-4be6-aeb8-5860ba5f9e90
17:32:45.440908 | .    29 | ... [client=127.0.0.1:51556,user=root,n1] querying next range at /Table/51/1/-2175475428815365171/0
17:32:45.440979 | .    71 | ... [client=127.0.0.1:51556,user=root,n1] r1530: sending batch 1 Put, 1 BeginTxn, 1 EndTxn to (n2,s2):52
17:32:45.441017 | .    38 | ... [client=127.0.0.1:51556,user=root,n1] sending request to cockroach-blue-02:26257

Every single request is stuck sending to blue-02. On blue-02 we see:


2017/09/01 17:21:42.192069 | 773.548566 | /cockroach.roachpb.Internal/Batch
-- | -- | --
17:21:42.192073 | .     3 | ... span.kind:server
17:21:42.192074 | .     1 | ... component:gRPC
17:21:42.192075 | .     2 | ... txnid:576b8304-2787-4093-b5ac-43799a20b7e5
17:21:42.192114 | .    38 | ... [n2] 1 Scan
17:21:42.192126 | .    12 | ... [n2,s2] executing 1 requests
17:21:42.192138 | .    12 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] read-only path
17:21:42.192146 | .     8 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #1)
17:21:45.957045 | 3.764899 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #2)
17:21:51.342252 | 5.385207 | ... [n2,s2,r1782/40:/Table/51/1/469{36192…-81511…}] request range lease (attempt #3)
...

In the above instance, there are now 145 lease attempts. Would be nice to know why they are failing and why another node isn't grabbing the lease.

Node liveness looks reasonable on every node except for blue-02:

screen shot 2017-09-01 at 1 36 57 pm

Hmm, something seems to have unwedged itself as we I just saw a burst of throughput, but now it is falling again.

@petermattis petermattis added this to the 1.1 milestone Sep 1, 2017
@petermattis petermattis self-assigned this Sep 1, 2017
@petermattis
Copy link
Collaborator Author

Strange, the badness seems to have migrated to blue-05 as evidenced by where requests on blue-01 are now stuck. On blue-05:

2017/09/01 17:38:01.851387 | 180.396263 | /cockroach.roachpb.Internal/Batch
-- | -- | --
17:38:01.851390 | .     3 | ... span.kind:server
17:38:01.851391 | .     1 | ... component:gRPC
17:38:01.851393 | .     2 | ... txnid:66b8fceb-cf9f-4e0a-92eb-c412937a8d53
17:38:01.851434 | .    41 | ... [n5] 1 Put, 1 BeginTxn, 1 EndTxn
17:38:01.851448 | .    13 | ... [n5,s5] executing 3 requests
17:38:01.851460 | .    12 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] read-write path
17:38:01.851475 | .    15 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] command queue
17:38:01.851574 | .   100 | ... [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] request range lease (attempt #1)
17:39:01.851835 | 60.000261 | ... storage/replica.go:1378 [n5,s5,r2041/20:/Table/51/1/64{77451…-81949…}] have been waiting 1m0s attempting to acquire lease

r2041 has only 2 replicas:

screen shot 2017-09-01 at 1 42 24 pm

@petermattis
Copy link
Collaborator Author

Trying to piece together what happened to r2041. On n3, I see:

I170825 02:18:31.174687 80060 storage/replica_raftstorage.go:718  [n3,s3,r2041/?:{-}] applying preemptive snapshot at index 573633 (id=39a11ddd, encoded size=56930303, 217 rocksdb batches, 177 log entries)
I170825 02:18:33.313474 80060 storage/replica_raftstorage.go:726  [n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] applied preemptive snapshot in 2139ms [clear=614ms batch=857ms entries=0ms commit=666ms]
I170825 02:21:12.232413 79 storage/store.go:2146  [replicaGC,n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] removing replica
I170825 02:21:12.236977 79 storage/replica.go:733  [replicaGC,n3,s3,r2041/?:/Table/51/1/64{77451…-81949…}] removed 568275 (568263+12) keys in 4ms [clear=3ms commit=1ms]

This seems to correspond to this rebalance on n7:

I170825 02:18:23.784999 91 storage/replica_raftstorage.go:509  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] generated preemptive snapshot 39a11ddd at index 573633
I170825 02:18:31.169945 91 storage/store.go:3453  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] streamed snapshot to (n3,s3):?: kv pairs: 568321, log entries: 177, rate-limit: 8.0 MiB/sec, 7385ms
I170825 02:18:33.315863 91 storage/replica_command.go:3639  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas (ADD_REPLICA (n3,s3):26): read existing descriptor r2041:/Table/51/1/64{77451831742334078-81949168901650570} [(n5,s5):20, (n7,s7):23, next=26]

Note that the range already has only 2 replicas at this point.

@petermattis
Copy link
Collaborator Author

Ah, the next few log lines on n7 show:

W170825 02:19:23.782155 680410 storage/replica.go:2089  [n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] context deadline exceeded while in command queue: [txn: 768f8aa9], BeginTransaction [/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor,/Min), ConditionalPut [/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor,/Min)
E170825 02:19:23.782662 91 storage/queue.go:656  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas of r2041 failed: context deadline exceeded

So that rebalance never succeeded which is probably why the replica was eventually GC'd be n3. Looking further, I'm seeing that every rebalance attempt for this range is failing.

n5 (the other node containing a replica of r2041) has lots of the following messages:

W170825 02:18:22.374618 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted
W170825 02:18:28.170909 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted
W170825 02:18:33.971905 303543 storage/store.go:3216  [n5,s5] got error from r2041, replica (n3,s3):25: raft group deleted

@bdarnell
Copy link
Contributor

bdarnell commented Sep 1, 2017

It's normal that every subsequent rebalance would fail if neither of the two existing replicas is the raft leader.

This appears to be a bug in PreVote: the replica on node 7 is ahead of the one on node 5 (by the raft log LastIndex), but it has a lower term. Node 7 will never vote (pre-vote or regular) for node 5 because of the lower LastIndex, but node 7's MsgPreVote is getting dropped (early in the raft message pipeline) because its term is too low. Without prevote, terms are adjusted more often so I don't think you could get stuck like this. (This is probably also an artifact of the transition from CheckQuorum to PreVote: node 5 wouldn't have increased its term under PreVote, but now that it has, it's persistent).

So before we can re-enable preVote, we need another change to etcd/raft to allow positive responses to preVotes with an older term (as long as they're higher than our last log term).

@petermattis
Copy link
Collaborator Author

Here is when r2041 dropped from 3 to 2 replicas:

I170825 01:42:21.177613 91 storage/replica_command.go:3639  [replicate,n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] change replicas (REMOVE_REPLICA (n3,s3):25): read existing descriptor r2041:/Table/51/1/64{77451831742334078-81949168901650570} [(n5,s5):20, (n7,s7):23, (n3,s3):25, next=26]
I170825 01:42:46.484066 640116 storage/replica.go:3005  [n7,s7,r2041/23:/Table/51/1/64{77451…-81949…}] proposing REMOVE_REPLICA((n3,s3):25): updated=[(n5,s5):20 (n7,s7):23] next=26

Along with that removal I see:

W170825 01:42:47.646765 640081 kv/txn_coord_sender.go:775  [replicate,s7,r2041/23:/Table/51/1/64{77451…-81949…},n7] heartbeat to "change-replica" id=358f57e7 key=/Local/Range/Table/51/1/6477451831742334078/RangeDescriptor rw=true pri=0.02760799 iso=SERIALIZABLE stat=PENDING epo=0 ts=1503625341.173597875,0 orig=1503625341.173597875,0 max=1503625341.673597875,0 wto=false rop=false seq=9 failed: context canceled

@petermattis
Copy link
Collaborator Author

@bdarnell Do you think the cluster will recover if I restart with pre-vote disabled?

@bdarnell
Copy link
Contributor

bdarnell commented Sep 1, 2017

Yes, this range should recover when pre-vote is disabled.

@petermattis
Copy link
Collaborator Author

Bold declaration. I'll restart the cluster with pre-vote disabled to see if you're right.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 1, 2017

Not too bold. I specifically said "this range", not the whole cluster :)

@petermattis
Copy link
Collaborator Author

Ok, restarting with pre-vote disabled at r2041 elected n7 on the leader.

@petermattis
Copy link
Collaborator Author

Throughput on the cluster is dropping to 0 again due to r2041, but I think it will resolve itself. We're unable to get a lease for that range because we've built up an enormous Raft log. n7 is at Raft log index 738999 while n5 is at 582182. n5 is slowly catching up though, but this will likely take a while.

@petermattis
Copy link
Collaborator Author

r2041 caught up about a minute ago and throughput on blue recovered at the same time.

@irfansharif
Copy link
Contributor

Filed upstream: etcd-io/etcd#8501.
This sufficiently explains #18129, so closing that in favor of this in the interim. It aligns with the fact that problem ranges didn't always persist through chaos as with enough leadership changes, replicas stuck in the PreCandidate phase with higher terms were eventually overtaken by healthy replica pairs (with ever increasing terms).

@bdarnell
Copy link
Contributor

bdarnell commented Feb 1, 2018

I've been doing some chaos testing and i haven't seen any persistently-stuck ranges, so I'm tentatively marking this as fixed by etcd-io/etcd#8525. Discussion will continue in #16950.

@bdarnell bdarnell closed this as completed Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants