-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
production: Admin UI not displaying any data on indigo #17478
Comments
Timeseries writes got blocked nearly 18 hours ago, and apparently don't have a way to get unstuck:
Since they get done from just a single goroutine, any blocked write prevents all future writes. |
This is an interesting amount of recursion in the dist_sender, but it did appear to eventually bottom out:
|
Should we put a timeout on the timeseries writes? Cc @mrtracy Any clues on why the write got block? |
Yeah, one of the timeseries ranges got borked. The logs from when things got stuck are gone, but if I had to guess I'd guess it's related to #17198. The current symptom is the current leaseholder repeatedly logging:
And less frequently (but still repeatedly) logging
While node 5 (which holds the uninitialized replica just repeatedly logs:
|
This isn't really specific to the time series system, this got stuck in However, in this case the ultimate result (no timeseries data available) would not change; if the time series range is not writable, then the requests will just repeatedly time out nothing will be recorded. Even without a timeout, if the range eventually comes back, the hanging request would presumably succeed and the loop will proceed normally. A timeout would be useful in two cases:
|
Whoops, I was wrong up above where I said node 9 is the leaseholder. It's the raft leader, but node 5 is somehow the leaseholder despite not having an initialized replica. That makes me think this may be different than #17198, particularly since there couldn't have been any rebalancing in this scenario given that the replica IDs are 1, 2, and 3. So the question is how did node 5's replica get into this state of not having a valid range descriptor. |
How is replica 3 the lease holder when it isn't initialized? We're only supposed to transfer leases to up-to-date replicas. |
Well its raft state appears up to date, it's just missing a valid range descriptor. |
Is there anything in the logs about failing snapshots for that range? |
Nope, the only logs are the ones I've included here (and a few from the various queues on nodes 2 and 9 indicating that they can't process the range because they aren't the leaseholder). |
Its log size is significantly less than the other two replicas', so it's possible that truncation was involved in the issue, although I don't see any other evidence pointing to that. |
It'd be nice to be able to see what's in the command queue, if anything (cc #9797). It's unclear where the requests would even be blocked given node 5's weird state. |
I have a couple updates here, but no answer:
I think I'll node 5 offline and manually inspecting its raft state unless anyone has anything specific to check out about the running process. I'm not sure how much more I can derive it. |
@petermattis @tschottdorf - anything else you can think of that I should look at before taking node 5 offline? |
I should also mention that the reason node 5 is unwilling to become raft leader is because it doesn't have an entry for itself in its own progress map ( |
I can't think of anything else to look at, but perhaps @bdarnell does. Ben is also the most knowledgeable about the Raft internals and various states. |
Since we know there was no rebalancing (Replica IDs 1, 2, 3), these replicas were created by a split. One way in which we could have an uninitialized replica on node 5 is if node 5's replica of the pre-split range was stuck and unable to process the split. But that's inconsistent with some of the other things we see here (such as the apparently valid MVCC stats, and a non-empty raft log - these wouldn't exist unless the split had been processed). The raft log is the most suspicious thing here. The first (truncation) and last log indexes are the same on all replicas, but the size on node 5 is wrong. How did it get to be non-empty without matching the others? I don't think it could have completed a truncation without becoming initialized along the way. Or could it be related to #17051? We know there's a bug in which the raft hard state can be written incorrectly on splits, and the fix for this is not enabled until 1.1 migrations are done. |
I thought the I'm currently looking at a user's cluster who also lost his timeseries ranges. Data attached. There's definitely a problem with the quota pool on that cluster. @irfansharif, any thoughts on the below? It might be an artifact of the ranges being horked, which the attached range pages can hopefully illustrate.
|
So it'd be a very weird raft bug if that was only meant to be populated on the leader given how it's used in this follower-specific code. |
Oh, apologies. I was thinking of something else (we use the progress map in Replica sometimes, but not on such a fundamental level). |
@a-robinson if you've already cooked up promising vmodule invocations, would you mind sharing them? |
I wouldn't say I have anything particularly promising given my lack of progress, so it may be better if you come up with your own. Definitely make sure to include |
Random observations from the cluster I'm looking at: the clock offset messages talk about "17 nodes are within the configured ....". The cluster has 9 nodes. I think this is probably explainable when nodes were removed from the cluster. |
Well, taking node 5 offline and investigating its raft state a little more closely didn't turn up anything that I was able to notice. Here are the few relevant debug dumps: range-descriptors.txt The most notable thing I derived from that is that the on-disk range descriptor is present and correct, so we must have done something bad within the process to mess up the in-memory state without destroying the on-disk state. After I killed node 5, the cluster very quickly up-replicated and its performance took off from 300 qps to 3000 qps. Also, the number of leader-not-leaseholder ranges dropped from ~50 to 0. I've cordoned off the data directory into /mnt/data2 and restarted the node with an empty /mnt/data1 to try to determine whether the bad performance was caused by the VM or by something in the store. After adding node 5 back with an empty data directory, the cluster's performance is slowly dipping as more ranges are added to it, and node 5's latency is crawling back up. That leads me to think that something is wrong with either the disk or the VM. I already checked how the disk is formatted, and it's formatted the same as the other nodes (so we're not hitting that issue again). Although, even if the VM's/disk's performance is garbage, the replica shouldn't have gotten into the state that it was in, and I don't know what to look into at this point. I could swap node 5 back to its original data directory at some point, but I expect that wouldn't have any great affect given that all of its replicas will just get GC'ed once it rejoins. |
Heh, the reason the disk's performance is so bad appears to be because it's getting overloaded by more than 1000
Any idea what all these are doing here, @mberhault? |
Yeah, my googling led to the same place. The weird thing, though, is that disabling the OMS agent ( |
@mberhault, let me know if you care to look into the slow disk issue. If I don't hear from you, I'm going to delete and recreate the VM/disk tomorrow under the assumption that it's just azure being azure. |
Yeah, I think this can be replaced. |
I replaced node 5 a few hours ago and things are much happier now. That leaves us with no leads into figuring out the original issue, though, so I won't be looking into it until it strikes again or some sort of inspiration hits me. |
@a-robinson ok, let me shift the milestone to later so that we keep a record of this, but this way we can clean out our 1.1 milestone |
Closing as not actionable. |
The cluster seems to be working fine by other metrics - both reads and writes from kv are being served at the same rate as usual, and I can interactively query/create tables. There aren't any unavailable ranges in the cluster according to the metrics in grafana and the info on the debug pages.
I'll keep digging into what's going on.
The text was updated successfully, but these errors were encountered: