-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: stuck requests after rebalancing-inducing downtime #19165
Comments
And to be more specific, the two main questions to follow up on are:
|
After adding more logging, I've noticed two things that could potentially be contributing to this problem. The first is that after the node that is being cycled, node 1 starts up, other nodes start sending it requests through their dist senders and those requests start timing out. I'm still looking into why node 1 isn't responding quickly enough. The second problem is that the timeout is currently 1 minute after PR #16088. That slows down all of the nodes, because they all end up sending requests to node 1 and end up waiting the timeout period. I looked at the lease histories of the ranges that were causing these timeouts and noticed that the leases for them were previously on node 1 before it was shut down, but then weren't being accessed until node 1 started back up and wasn't the leaseholder anymore. As part of the fix for this, I'm thinking that we should remove entries for nodes that have gone down from the dist sender leaseholder caches when we get node liveness updates from gossip. |
Is this after the node having been down for >10s? What should happen is that the node loses its leadership for all of its ranges and so when it comes back up, nobody's really talking to it initially. Any idea why that doesn't happen? Are these ranges that aren't accessed while the node is down? Or is the node just down for <9s? I'd be interested to hear where these requests are stuck. Sure, they'll be on the DistSender, but where's the "other end" of that request? In the worst case, it'll be on the socket waiting for Also, what does the "took X" say when you cycle the node in the below startup message? Is it in any relation to how long the connections hang?
|
These ranges weren't access while the node was down which is why their leaseholder caches weren't updated. I'm looking into the other side right now using traces as you suggested. I'm see a lot of things that look like:
After the range gets GCed, we break out of the loop in redirectOnOrAcquireLease.
The node startup message doesn't seem to be related to the delays. From here it looks like we need to prioritize GC which at the moment is getting interleaved with applying snapshots or at least error out of the redirectOnOrAcquireLease loop earlier. |
Glad to see the working theory checks out! I'll respond more tomorrow.
|
Interesting to see that we looped around in the lease code 21 times. Could you add the following diff and paste another trace? diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 3351713f1..3051bcc3e 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -1403,6 +1403,7 @@ func (r *Replica) redirectOnOrAcquireLease(ctx context.Context) (LeaseStatus, *r
select {
case pErr = <-llChan:
if pErr != nil {
+ log.Eventf(ctx, "error: %s", pErr)
switch tErr := pErr.GetDetail().(type) {
case *roachpb.AmbiguousResultError:
// This can happen if the RequestLease command we sent has been Do you also see traces for the replica gc queue? If you don't, I think the following is enough to get a rough idea from the log: diff --git a/pkg/storage/replica_gc_queue.go b/pkg/storage/replica_gc_queue.go
index f360da754..81316c866 100644
--- a/pkg/storage/replica_gc_queue.go
+++ b/pkg/storage/replica_gc_queue.go
@@ -186,6 +186,8 @@ func replicaGCShouldQueueImpl(
func (rgcq *replicaGCQueue) process(
ctx context.Context, repl *Replica, _ config.SystemConfig,
) error {
+ log.Warningf(ctx, "processing %s", repl)
+ defer log.Warningf(ctx, "done processing %s", repl)
// Note that the Replicas field of desc is probably out of date, so
// we should only use `desc` for its static fields like RangeID and
// StartKey (and avoid rng.GetReplica() for the same reason). What we should see is that the "done processing" is pretty much immediately followed by the next "processing %s". (You may also want to add |
I added the logging for pErr:
The error is coming from My question is, do we need this thing to return an NewAmbiguousResultError? If we return something else we can check for it and break out of redirectOnOrAcquireLease before the range gets GCed. Here's a snippet of the GC queue log line 181 is processing, line 258 is done processing:
Sometimes one is following pretty closely by the next and sometimes there's a gap of more than a second. |
I do think that returning an |
The error is returned from this path (in the context that a Raft peer tells
us that our Replica is too old, i.e. we've been removed, and we receive
that repeatedly until the replica GC queue picks up the replica and
actually removes it). I think Alex' suggestion would work: the first time
we get this error, we set a flag on the replica (which is cleared whenever
its `mu.replicaID` changes). With the flag set, new proposals return
`NotLeaseholderError` (or, at least lease acquisition attempts are no
longer made).
// If the replica ID in the error matches (which is the usual
// case; the exception is when a replica has been removed and
// re-added rapidly), we know the replica will be removed and we
// can cancel any pending commands. This is sometimes necessary
// to unblock PushTxn operations that are necessary for the
// replica GC to succeed.
if tErr.ReplicaID == repl.mu.replicaID {
repl.cancelPendingCommandsLocked()
}
@bdarnell: are there any false positives when a peer gives us
`ReplicaTooOld` and we still have that ReplicaID? As in, in that case, do
we need to perform the consistent RangeLookup?
On Mon, Oct 16, 2017 at 9:00 PM Alex Robinson ***@***.***> wrote:
My question is, do we need this thing to return an
NewAmbiguousResultError? If we return something else we can check for it
and break out of redirectOnOrAcquireLease before the range gets GCed.
I do think that returning an AmbiguousResultError is reasonable in the
context of cancelling proposed commands on a removed replica. There are
other situations where a replica's proposed command may have succeeded
before it got removed from the raft group. However, what we could more
safely do is set a field on the replica to tell it to refuse to propose any
more commands (with a RangeNotFoundError) unless it gets assigned a new
replica ID. That would short-circuit the the infinite retry loop we're
hitting in the lease acquisition code.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#19165 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135BaKex9QXp9I9xYLU7kLixprUZWgks5ss6fNgaJpZM4P0iTe>
.
--
…-- Tobias
|
Recapping our conversation at lunch: I think it's probably going to be more effective to make sure that we can return the NotLeaseHolderError quickly and move on to the correct node than to try and make replica GC faster (I had wondered if we were using rocksdb range deletions for replica GC yet; we are). To @tschottdorf's question, I think ReplicaTooOldError is unambiguous so we could skip the consistent lookup, so that might speed things up. One detail I was missing is that the place where we're spending time is with the DistSender's pending RPC timeout. This sounds like a context propagation issue. We should be cancelling the RPC's context when we go into the pending RPC wait loop so they'll return quickly instead of running out the clock. We should probably also put an upper bound timeout on all lease attempts - if we're waiting more than a few seconds we should just return a NotLeaseHolderError instead of continuing to wait (the upper levels can retry if they want). But just putting a 10s timeout on all lease acquisitions would still be pretty bad so this is just a backstop to prevent the worst case.
It's correct for cancelPendingCommandsLocked to return an ambiguous error. But if that gets returned from redirectOnOrAcquireLease, it should get turned into an unambiguous error since we know that the client operation was never proposed. |
That's a good point, though what error would we return? |
(And this isn't to say that we shouldn't do the lease acquisition error termination - we should of course get rid of ambiguous errors wherever we can) |
@m-schneider What's the latest on this issue? |
I'm working on reproducing both the issue and the fix on an ephemeral
cluster. If we can see that the fix from
#19353 works, then we'll be
able to close it.
…On Mon, Oct 30, 2017 at 1:33 PM, Cuong Do ***@***.***> wrote:
@m-schneider <https://github.com/m-schneider> What's the latest on this
issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19165 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ab6W52KVfrY--WS_C7uKhA304MQoC33iks5sxghrgaJpZM4P0iTe>
.
|
This was closed by #19353, right? |
Yes.
…On Tue, Nov 21, 2017 at 12:26 PM, Alex Robinson ***@***.***> wrote:
This was closed by #19353
<#19353>, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19165 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ab6W5042kACHt-T3gwezY0MTHO82HmMJks5s4wfSgaJpZM4P0iTe>
.
|
@m-schneider I am trying to replicate this issue on v1.0.6. I've tried a couple of different strategies but have yet to succeed. Would you mind providing the exact steps you used to reproduce? |
In order to reproduce, bring up a cluster with 3 or more nodes. Leave it
running for an hour with a load generator writing to every node. Then bring
down one of the nodes for 10 minutes at a time and you should see latencies
on the other ones going up when you bring it back up. I had a cron job that
killed the node on the 00, 20 and 40 of every hour and brought it back up
on the 10s, 30s, and 50s. The whole cluster ground to a halt on the 10s,
30s and 50s.
…On Tue, Nov 28, 2017 at 6:38 AM, Gustav Paul ***@***.***> wrote:
@m-schneider <https://github.com/m-schneider> I am trying to replicate
this issue on v1.0.6.
I've tried a couple of different strategies but have yet to succeed. Would
you mind providing the exact steps you used to reproduce?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19165 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ab6W50b84V2rFe98dT3MISaPxH1JapN6ks5s6_CsgaJpZM4P0iTe>
.
|
Thanks that helps already. Could you expand on the kind of load generator? Does the schema matter? Should I fill the database from all nodes, concurrently or can all inserts be sent to a single node. Do queries need a specific form, eg., JOINs? When you say 'latencies', do you mean inserts become slow or selects become slow, or both? Would the following work?
My timezone is pretty much 'opposite side of the planet' to SF which is why I want to make sure I understand the instructions clearly. Thanks @m-schneider |
No problem. I'm not quite in SF land myself :) The schema doesn't really
matter and neither does the load generator so I think what you have should
work. The core of the problem was that the node that was coming up and down
didn't reply if it was the lease holder for the replicas for which it was
the leaseholder before a restart in a reasonable amount of time.
So for you points.
1. 5 node is a good amount
2. The schema is fine.
3. Serial inserts are reasonable.
4. Fine
5. You should perform selects that would touch more of data because you
want to make sure that you're touching a replica that was on the node being
restarted, but not there anymore.
6. Sounds good.
…On Tue, Nov 28, 2017 at 4:58 PM, Gustav Paul ***@***.***> wrote:
Thanks that helps already. Could you expand on the kind of load generator?
Does the schema matter? Should I fill the database from all nodes,
concurrently or can all inserts be sent to a single node. Do queries need a
specific form, eg., JOINs? When you say 'latencies', do you mean inserts
become slow or selects become slow, or both?
Would the following work?
1. 5 nodes
2. a table with a single column consisting of integers
3. perform serial inserts against a single node, using a single
client, records consisting of a single incrementing integer.
4. after 1 hour, stop inserting.
5. perform select * from numbers where numbers.num=1 on each of the 5
nodes, serially and continuously using a single client per node.
6. stop a node, sleep 10mins, start the node, sleep 10mins, repeat.
My timezone is pretty much 'opposite side of the planet' to SF which is
why I want to make sure I understand the instructions clearly. Thanks
@m-schneider <https://github.com/m-schneider>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19165 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ab6W55h3ZFycvulAdUfr7ZpEDAmD_uxUks5s7IIagaJpZM4P0iTe>
.
|
When running an experiment for issue #10972, node 1 on navy was set up to be shut down for 10 minutes and then started back up for 10 more minutes and repeated indefinitely.
Here's a graph of SQL queries going through node 1 and through the whole cluster.
As you can see, when the node is down the QPS of the cluster drops, but when it comes back online the QPS drops even further and if you look into any node other than node 1 you see a graph like this:
The logs on node2 looked like the following:
So the dist sender was blocked and it was trying to send preemptive snapshots to node 1.
Looking for r140 on the freshly started node 1 the logs looked like the following:
This looks like node 1 still thought that it had the lease for r140 until r140 got GCed.
The text was updated successfully, but these errors were encountered: