-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Election triggered every minute #1150
Comments
@evanccnyc Could you provide some more information? Specifically more of the following:
It is pretty hard to diagnose or understand the issue with the given information. |
Unfortunately, it was our production consul cluster so we didnt have time to debug it much. We had to completely recycle the cluster in order to get it to work again. We think it was related to consul-template, which we had deployed on a large number of nodes that may have made it unstable. |
There's a similar issue in hashicorp/consul-template#331that's tracking performance issues when there are a large number of watches. Take a look at the linked thread that has some info about upcoming changes to make this better. |
@slackpad That may well be it. It was our cluster that showed the elections as I said, I dont remember nodes leaving and joining every minute but they well could of been joining and leaving. @armon More details, it was a 0.5.2 cluster with 3 nodes running m4.medium's one in each az zone. We rolled out consul-template to 100+ nodes and consul never recovered. First thing we noticed was that one our nodes ran out of open files, consul had something like 5000+ open files when we ran lsof. Most of them looked like connections from clients requests the consul template data. We recycled the node and turned off consul template but that node never seemed to recover, with the other nodes often complaining of timeouts to that one (500ms or more). At that point we recycled the entire cluster and kept consul-template turned off, it appears stable. The debug logs were not much help unfortunately, they showed little more than what we had above. Again, this broke a huge amount of things in our setup, so I grabbed what I could before completely redoing the cluster. |
@evanccnyc Cool, this helps. CT is the likely culprit. Can you share the template that you rolled out? I want to just get a feel for how many depedencies / watchers it had. |
|
@evanccnyc Hmm, so this looks like about ~35 services queries, and sounds like a few hundred nodes were involved. My guess is this is similar to the issue #1154 #1165 and the other is the repro case @darron provided us. We are tracking those separately, but this seems like a related issue. We are working to address some of the read pressure issues CT causes in Consul 0.6 to increase the stability of the cluster under that kind of load. A big improvement to his is the approach of running CT on a single node (or a few under |
To build on what @armon said - we were never able to run Consul Template with service queries across our cluster reliably once we hit around 100 nodes. Leadership transitions over and over that were really unstoppable. If you can build the template on one of the server nodes - running under A few things to watch out for:
|
I am fairly certain that I just ran into this case with around 35-40 nodes, but also making attempts to use Consul Template pretty heavily. Normally the cluster has handled it well up until yesterday where the cluster of 3 was never able to be queried and all the log output made it appear that networking issues were the culprit. It has since sorted itself out with lower traffic, usage, and fewer nodes up. @darron a few questions for you if you are able to answer:
|
Consul 0.6 should address some of the performance problems - we're anxiously awaiting that - we still likely won't move the build process to the nodes - it's not needed. If you're using AWS - we also found some issues with heavy usage and networking here - there were several things that helped to correct it - the Xen "rides the rocket" fix was very important. |
Consul 0.6 did rework the state store and increased read throughput, we've also introduced some tunes in https://www.consul.io/docs/guides/performance.html to help here, and Consul Template got https://github.com/hashicorp/consul-template#de-duplication-mode to help when there are a massive number of renderers of the same template. |
Hi there,
We have a 3 node consul cluster on AWS (1 server in us-east-1e, us-east-1d, and us-east-1c). And they constantly flap, every minute or so we get the log below. The systems are not under heavy load and have subsecond connection between them.
The text was updated successfully, but these errors were encountered: