-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552
Comments
@armon @slackpad This has been happening us a lot lately while our hosting provider works on their storage, and seems like a critical bug? We built most of our systems to cope with losing Consul a while, but for those relying more on Consul, it is extremely disturbing. |
Fantastic, thank you! |
Hi @slackpad, we're still seeing this when running 1.0.0. What can I do to help you track down the cause? |
Hi @christoe I'll kick this open - can you provide some log output from the Consul servers when you get into a bad state? |
Hi @christoe thanks for the additional info. Can you use something like |
Hi @slackpad, sure - I can try to do that next time the cluster breaks down. However, it seems as if the fd depletion is more of an effect of the cluster being broken, rather than the cause (the cluster broke down ~15:02 on the graphs above). |
Hi again @slackpad! The cluster broke down again this weekend (feels as if 1.0.0 is worse than before) and I managed to get a list of open fd:s. There are two categories that stick out: Localhost to localhost:8500: And ipv6-sockets: Together, these make up almost all of the to user consul 1024 available fd:s. |
Hey guys, we have also been seeing errors of this sort. Our cluster will go several days without any issues, and then we will suddenly get a slew of errors like:
Most requests to vault/consul start returning 503's at that point until we kill the consul leader node. After consul elects a new leader, activity returns to normal. I have not been able to verify if this was connected to a fd overflow though. This is affecting our production environment, does anyone know if there is a version of consul that is not impacted by this issue? Also, I expect this to happen again since this is the third time, if there are any diagnostics you'd like me to run the next time it happens, I would be happy to. |
Hi @326TimesBetter and @christoe this looks very similar to #3700 which we were able to reproduce during our 1.0.1 release testing and get a fix in 1.0.1. If you can try again with 1.0.1 I'd appreciate it. If you still see problems the best thing is probably to turn on https://www.consul.io/docs/agent/options.html#enable_debug and then save off the contents of /debug/pprof/goroutine?debug=1 on all three servers when they get into a bad state. |
We are working on upgrading our consul servers to 1.0.1 right now. Does this require us to upgrade all consul agents as well?? |
For the fix in #3700 you just need the servers at 1.0.1. |
Thank you @slackpad we have upgraded our servers and also enabled debugging so if it happens again we can dump that endpoint. It has usually taken 2-4 days in between these types of crashes so we won't be able to be confident for a while, but I will definitely get back to you here. |
Thanks @slackpad, I too will install 1.0.1 on our servers and get back to you with the results! |
@slackpad I am happy to say that we have not had another indicent in which our consul leaders crashed since we upgraded our servers to 1.0.1. |
We're also cautiously optimistic. Consul has been running without a restart since the upgrade to 1.0.1 - even with quite a few I/O-wait peaks. Thanks @slackpad! |
Is it possible that in a large (~1000 nodes) and busy cluster that once this behavior had started it would persist even after leadership change. This cluster was upgraded to 1.0.0 on the 11th December. We saw During this period Eventually we had to shutdown the cluster, clean out the data directory and restart. The KV store was restored from snapshot and we haven't had a repeat of this in the subsequent two days running on version 1.0.1. |
We are currently seeing this on a cluster that has been upgraded to 1.0.1. I have collected some data here: https://gist.github.com/nickwales/deed318a6296b709cf5f3f57096feb98 I have destroyed and restored the cluster twice now but we immediately end up in the same situation with huge load on the leader and this repeated on the follower servers.
Update, pulled a heap with Update 2, downgraded servers to 0.8.3 and we're back in business. Of note during the outage, running |
@slackpad thanks for looking into this. Is there a way we can replicate the behavior? It seems to come out of nowhere. We're able to help test any bug fixes you're working on. |
We moved to 1.0.6 and all of this got resolved :) |
@jippi can you confirm how long you've been running 1.0.6 without issue to decide it is resolved? We're happy to hear you're not seeing issues any more but since we still don't have a full understanding of the failures after the fixes in 1.0.0 and 1.0.1, we're cautious about closing this early. @nickwales do you continue to see this issue since last update in January? If not, which version are you settled on? Thanks. |
@banks since ~12th of February :) |
Hello, The raft protocol used is 2 because of the lower Consul version of the agents. |
@banks We're currently running 1.0.1 in dev and test and 0.8.3 in stage and production. Dev and Test have never had an outage.
At this point we were nervous about running 1.0.x in customer facing environments so decided to downgrade those two environments to the previous version of 0.8.3. We kept dev and test at 1.0.1 in an effort to provide the core dumps should we see a failure but I don't have any to report currently. |
can we somehow increase timeouts to deal with this issue? https://github.com/hashicorp/consul/blob/master/agent/consul/config.go#L428-L429 like
|
I've tried to downgrade faced #3361 during upgrade |
We are seeing this with consul 1.0.2. Five servers setup.
This is happening on three nodes simultaneously and makes the cluster fail. |
@jacob-koren-zooz thanks for the info can you confirm if your failure occurs following a disk IO/latency issue? It's not clear from the discussion if that is a common factor between all the incidents reported or if the issue is triggered by anything that causes timing to fall into a specific pattern. @avoidik interesting, what made you think that raft protocol v2 would help? I'd need to look into the specific differences to be sure but it's not obvious to me that that might be a factor.
Increasing that might reduce the frequency somewhat but fundamentally the autopilot health checks should not cause a cluster to be unable to re-elect a new leader even if they fail every time. I think you would need to set it like |
Hey we just had an outage on a 0.8.3 cluster... I was organized enough to have gathered the goroutine and thread dumps from each machine this time. Is there somewhere / someone I can send these to on a more private channel? |
@nickwales if you are comfortable sending them to jack@hashicorp.com I'll get it to the right place. Thanks for doing that. |
We're experiencing a similar issue. Don't have a goroutine thread dump, but I do have a syslog output. We are on 0.8.3 |
While we don't have any more direct news on this. We recently found #5047 which could explain the failure for the leader to step down. That issue is to do with a bug where The unknown link with this issue that would be the clincher that it's the same root cause would be if we can figure out a way that high diskIO could be a cause for something in I've not had time to look back through logs gathered here yet to see if there are signs that the establishLeadership bug is taking effect. |
👋 we just had this problem (one cluster node slowly leaking fds, ending up with a non-working cluster due to hitting open fd limits and crashing) with consul-v0.9.3 [both clients/servers]. I'm a bit confused about where we are at in terms of what fixed and what is remaining due to conflicting comments (as in 1.0.2 is not showing the problem vs having the issue with 1.0.2 as well) It would be really great if you could provide a status update if you have one! Also please let me know if logs or some more info regarding our issue would help ⭐️ |
@caglar10ur our best guess is that the leadership hanging state is a result of #5047 the difficult underlying changes to raft to make that possible are now done so we hope to have the fix in the next release. The hope is that we'll not see any more instances of this issue once that is merged but since we have never been able to reproduce/establish a certain root cause it's hard to say for certain! |
@caglar10ur thanks for bearing with us. #5247 is merged and is going to be released in the next release. This fixes a problem we had in consul that could've caused your issue. |
@i0rek great. I'll keep my eyes on here and test once you release a new version. Thanks! |
@caglar10ur |
@i0rek will try and let you know ASAP. Thanks again 🥇 |
Just letting you know that we haven't forgotten this. We are slowly upgrading clusters and observing their behaviors for a while before moving on to another cluster - so far so good. I'll keep you updated with our progress. |
Much appreciated! |
Completed migrating 10 clusters with various number of agents/servers from 0.9.2 to 1.5.2. We normally got hit by this issue in every other week so I still need to observe before declaring victory but please let me tell you this. This was one of the easiest migrations I ever did in my professional career. Even though I was migrating from an ancient version to the latest one nothing broke and everything worked as designed/excepted (I mean OK something was broken but it was just UI that is because it uses the new API calls to render now) - thank you so much for that ❤️ ! Will report back in couple of weeks... |
I am glad to hear that everything went well for you! |
Hey @caglar10ur Just picking up on the mention of the UI here. I'm not sure if you have time to go into further detail on the UI issue you mention above. I'm half guessing it was just due to the rebuild of the UI we did a good few minor versions back (looking at the version jump you did you probably went from UI v1 to UI v2). I thought I'd quickly check in with you to double check, it sounds like the UI thing you mention above was a temporary thing and everything is fine now? If you have time to give some more detail on the issue that'd be great, but don't worry if not. If there is something to look into I can start up another issue so I don't hijack this one. Thanks! |
@johncowen ah sorry - I should have been more clear 🤦♂ - the UI partially worked during the upgrade and everything is fine now. As far as I see Services section of the UI was only non-working piece while the cluster is in its mixed form where we have 0.9.2 and 1.5.2 servers. Once all servers are upgraded the UI started to render Services section as well without throwing 500 errors. |
👋 just wanted to give an update as promised. Its been almost a month and things are looking really good on our end, both in terms of number of file descriptors and also stability of the clusters. Thanks! |
Great to hear that @caglar10ur! I will close this issue then. Thank you for your help! |
Description of the Issue (and unexpected/desired result)
Two times in a week, we've now had a situation where the leader node's VM experienced high cpu iowait levels for a few (~3) minutes, and disk latencies of 800+ milliseconds. This seems to lead to writes to the log failing and getting retried indefinitely, even after disk access times are back to normal. During this time, it spews lines to the log like
consul.kvs: Apply failed: timed out enqueuing operation
(and also for consul.session). For some reason, this does not trigger a leader election.It appears incoming connections are enqueued, until all file descriptors are consumed. Restarting the Consul service seems to be the only way to recover. The non-leader servers also run out of fd:s at about the same time.
So, some questions here: Why does this not trigger a leader election already when the timeouts start happening? And why can Consul not recover after a few minutes of high disk latency?
(Regarding the reason for the iowait, we're hosted in a public cloud, and according to the provider there is a possibility for other tenants to consume high amounts of IO when booting new VMs. They are working on thottling this in a good way, but regardless, Consul should handle this better)
consul version
Server:
v0.8.2
consul info
Server:
(This is post-restart of the server, so not sure if it is much use)
Operating system and Environment details
Centos 7.2, Openstack VM
The text was updated successfully, but these errors were encountered: