Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed
carlpett opened this issue Oct 8, 2017 · 52 comments
Labels
type/bug Feature does not function as expected
Milestone

Comments

@carlpett
Copy link
Contributor

carlpett commented Oct 8, 2017

Description of the Issue (and unexpected/desired result)

Two times in a week, we've now had a situation where the leader node's VM experienced high cpu iowait levels for a few (~3) minutes, and disk latencies of 800+ milliseconds. This seems to lead to writes to the log failing and getting retried indefinitely, even after disk access times are back to normal. During this time, it spews lines to the log like consul.kvs: Apply failed: timed out enqueuing operation (and also for consul.session). For some reason, this does not trigger a leader election.

It appears incoming connections are enqueued, until all file descriptors are consumed. Restarting the Consul service seems to be the only way to recover. The non-leader servers also run out of fd:s at about the same time.

So, some questions here: Why does this not trigger a leader election already when the timeouts start happening? And why can Consul not recover after a few minutes of high disk latency?

(Regarding the reason for the iowait, we're hosted in a public cloud, and according to the provider there is a possibility for other tenants to consume high amounts of IO when booting new VMs. They are working on thottling this in a good way, but regardless, Consul should handle this better)

consul version

Server: v0.8.2

consul info

Server:

agent:
        check_monitors = 2
        check_ttls = 1
        checks = 13
        services = 14
build:
        prerelease =
        revision = 6017484
        version = 0.8.2
consul:
        bootstrap = false
        known_datacenters = 2
        leader = false
        leader_addr = 192.168.123.116:8300
        server = true
raft:
        applied_index = 249572007
        commit_index = 249572007
        fsm_pending = 0
        last_contact = 39.594µs
        last_log_index = 249572008
        last_log_term = 6062
        last_snapshot_index = 249567365
        last_snapshot_term = 6062
        latest_configuration = [{Suffrage:Voter ID:192.168.123.118:8300 Address:192.168.123.118:8300} {Suffrage:Voter ID:192.168.123.116:8300 Address:192.168.123.116:8300} {Suffrage:Voter ID:192.168.123.154:8300 Address:192.168.123.154:8300}]
        latest_configuration_index = 45140126
        num_peers = 2
        protocol_version = 2
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 6062
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 446
        max_procs = 4
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 3818
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 2318
        members = 54
        query_queue = 0
        query_time = 69
serf_wan:
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 232
        members = 6
        query_queue = 0
        query_time = 28

(This is post-restart of the server, so not sure if it is much use)

Operating system and Environment details

Centos 7.2, Openstack VM

@carlpett
Copy link
Contributor Author

@armon @slackpad This has been happening us a lot lately while our hosting provider works on their storage, and seems like a critical bug? We built most of our systems to cope with losing Consul a while, but for those relying more on Consul, it is extremely disturbing.
Any idea what it could be? Anything I can do to help explaining it better?

@slackpad
Copy link
Contributor

Hi @carlpett we tracked this down to a code bug and the fix for this will be released in 1.0, hopefully early next week - #3545.

@carlpett
Copy link
Contributor Author

Fantastic, thank you!

@christoe
Copy link

Hi @slackpad, we're still seeing this when running 1.0.0. What can I do to help you track down the cause?

@slackpad
Copy link
Contributor

slackpad commented Nov 1, 2017

Hi @christoe I'll kick this open - can you provide some log output from the Consul servers when you get into a bad state?

@slackpad slackpad reopened this Nov 1, 2017
@slackpad slackpad added this to the 1.0.1 milestone Nov 1, 2017
@slackpad slackpad added the type/bug Feature does not function as expected label Nov 1, 2017
@christoe
Copy link

christoe commented Nov 7, 2017

Happened again yesterday. This time we can't seem to see any indications on high iowait though, but all the symptoms are identical to what we've seen before.

Unfortunately there are no logs that seem to explain what's happening. The first indication that something is wrong:

consul: error getting server health from "<consul cluster server>": context deadline exceeded
[...]
consul: error getting server health from "<consul cluster server>": last request still outstanding
[...]
memberlist: Marking <consul member> as failed, suspect timeout reached (2 peer confirmations)
serf: EventMemberFailed: <consul member> <ip address>
consul: member '<consul member>' failed, marking health critical
serf: EventMemberJoin: <consul member> <ip address>
consul: member '<consul member>' joined, marking health alive
[...]
consul.kvs: Rejecting lock of <kv lock path> due to lock-delay until 2017-11-06 16:01:00
memberlist: Failed fallback ping: read tcp <consul server ip>:58940-><consul member ip>:8301: i/o timeout
agent: Check 'vault:<consul server ip>:8200:vault-sealed-check' missed TTL, is now critical

After this, the consul just spews timeout messages until the Consul leader is restarted:

consul.session: Apply failed: timed out enqueuing operation
consul.kvs: Apply failed: timed out enqueuing operation
[...]

During this time, the Consul cluster is completely disfunct. As Carl described above, we think that Consul enqueues operations until all of the cluster servers run out of fds. Total number of fd:s allocated on the Consul server nodes below (big drop when restarting the "leader" node):
bild

@christoe
Copy link

christoe commented Nov 7, 2017

Can't seem to find any metrics that indicate something is wrong. We see quite a lot of member flapping a little while before the cluster acts out. Also (don't know if its helpful or not) we can see Raft FSM pending tasks increasing at the time when the cluster starts behaving bad:
bild

@slackpad
Copy link
Contributor

slackpad commented Nov 8, 2017

Hi @christoe thanks for the additional info. Can you use something like lsof to see what kind of FDs are accumulating? I'd be curious if you have heavy inbound client traffic during these times (lots of TCP connections to :8500) or if there's something happening at the RPC/Raft layer (TCP connections to :8300), or if it's something else.

@slackpad slackpad modified the milestones: 1.0.1, Unplanned Nov 8, 2017
@christoe
Copy link

christoe commented Nov 8, 2017

Hi @slackpad, sure - I can try to do that next time the cluster breaks down. However, it seems as if the fd depletion is more of an effect of the cluster being broken, rather than the cause (the cluster broke down ~15:02 on the graphs above).

@christoe
Copy link

Hi again @slackpad! The cluster broke down again this weekend (feels as if 1.0.0 is worse than before) and I managed to get a list of open fd:s. There are two categories that stick out:

Localhost to localhost:8500:
consul-ag 907 consul 6u IPv6 1277992252 0t0 TCP localhost:59218->localhost:8500 (ESTABLISHED)

And ipv6-sockets:
consul-ag 907 consul 7u sock 0,6 0t0 1277993586 protocol: TCPv6

Together, these make up almost all of the to user consul 1024 available fd:s.

@MaxFlanders
Copy link

Hey guys, we have also been seeing errors of this sort. Our cluster will go several days without any issues, and then we will suddenly get a slew of errors like:

Nov 21 22:02:30 consul3 consul[2075]: consul.kvs: Apply failed: timed out enqueuing operation
Nov 21 22:02:30 consul3 consul[2075]: consul.kvs: Apply failed: timed out enqueuing operation

Most requests to vault/consul start returning 503's at that point until we kill the consul leader node. After consul elects a new leader, activity returns to normal. I have not been able to verify if this was connected to a fd overflow though.

This is affecting our production environment, does anyone know if there is a version of consul that is not impacted by this issue?

Also, I expect this to happen again since this is the third time, if there are any diagnostics you'd like me to run the next time it happens, I would be happy to.

@slackpad
Copy link
Contributor

Hi @326TimesBetter and @christoe this looks very similar to #3700 which we were able to reproduce during our 1.0.1 release testing and get a fix in 1.0.1. If you can try again with 1.0.1 I'd appreciate it.

If you still see problems the best thing is probably to turn on https://www.consul.io/docs/agent/options.html#enable_debug and then save off the contents of /debug/pprof/goroutine?debug=1 on all three servers when they get into a bad state.

@MaxFlanders
Copy link

We are working on upgrading our consul servers to 1.0.1 right now. Does this require us to upgrade all consul agents as well??

@slackpad
Copy link
Contributor

For the fix in #3700 you just need the servers at 1.0.1.

@MaxFlanders
Copy link

Thank you @slackpad we have upgraded our servers and also enabled debugging so if it happens again we can dump that endpoint.

It has usually taken 2-4 days in between these types of crashes so we won't be able to be confident for a while, but I will definitely get back to you here.

@christoe
Copy link

Thanks @slackpad, I too will install 1.0.1 on our servers and get back to you with the results!

@MaxFlanders
Copy link

@slackpad I am happy to say that we have not had another indicent in which our consul leaders crashed since we upgraded our servers to 1.0.1.

@christoe
Copy link

We're also cautiously optimistic. Consul has been running without a restart since the upgrade to 1.0.1 - even with quite a few I/O-wait peaks. Thanks @slackpad!

@nickwales
Copy link
Contributor

nickwales commented Dec 17, 2017

Is it possible that in a large (~1000 nodes) and busy cluster that once this behavior had started it would persist even after leadership change. This cluster was upgraded to 1.0.0 on the 11th December.

We saw consul.kvs: Apply failed: timed out enqueuing operation and a huge jump in system utilization. Repeatedly moving leadership didn't fix the issue, also upgrading to 1.0.1 during this time didn't solve the issue.

During this period consul info on agents could take minutes to run, agents were unable to update the catalog so our ability to service discovery was impacted and containerpilot health checks started to fail as they could not register.

Eventually we had to shutdown the cluster, clean out the data directory and restart. The KV store was restored from snapshot and we haven't had a repeat of this in the subsequent two days running on version 1.0.1.

@nickwales
Copy link
Contributor

nickwales commented Jan 5, 2018

We are currently seeing this on a cluster that has been upgraded to 1.0.1. I have collected some data here: https://gist.github.com/nickwales/deed318a6296b709cf5f3f57096feb98

I have destroyed and restored the cluster twice now but we immediately end up in the same situation with huge load on the leader and this repeated on the follower servers.

2018/01/05 14:23:43 [WARN] consul: error getting server health from "ip-10-122-4-24": context deadline exceeded 2018/01/05 14:23:44 [WARN] consul: error getting server health from "ip-10-122-10-181": last request still outstanding

Update, pulled a heap with enable_debug turned on: https://gist.github.com/0cf3c3ff50f2ebd2dbae58c66324fb9a

Update 2, downgraded servers to 0.8.3 and we're back in business.

Of note during the outage, running consul info on 0.8.3 agents responded immediately, on 1.0.2 agents in particular it could hang for anything up to a minute. 1.0.1 seemed somewhere in the middle.

@slackpad slackpad modified the milestones: Unplanned, 1.0.3 Jan 9, 2018
@nickwales
Copy link
Contributor

@slackpad thanks for looking into this.

Is there a way we can replicate the behavior? It seems to come out of nowhere.

We're able to help test any bug fixes you're working on.

@slackpad slackpad added this to the Next milestone Jan 23, 2018
@jippi
Copy link
Contributor

jippi commented Mar 6, 2018

We moved to 1.0.6 and all of this got resolved :)

@banks
Copy link
Member

banks commented Mar 14, 2018

@jippi can you confirm how long you've been running 1.0.6 without issue to decide it is resolved?

We're happy to hear you're not seeing issues any more but since we still don't have a full understanding of the failures after the fixes in 1.0.0 and 1.0.1, we're cautious about closing this early.

@nickwales do you continue to see this issue since last update in January? If not, which version are you settled on?

Thanks.

@jippi
Copy link
Contributor

jippi commented Mar 14, 2018

@banks since ~12th of February :)

@m4rcc
Copy link

m4rcc commented Mar 21, 2018

Hello,
I have just upgraded consul from 0.7.5 to latest 1.0.6 and i encountered this issue also. The upgrade has been done only to the servers, the clients are still at 0.7.5 since i saw the error. The errors/issue appears only on the leader node.
consul-server-1-gls-dus10-ci-efood-real-de.log
consul-server-1-gls-dus21-ci-efood-real-de.log
consul-server-2-gls-dus10-ci-efood-real-de.log
consul-server-2-gls-dus21-ci-efood-real-de.log
consul-server-3-gls-dus21-ci-efood-real-de.log

The raft protocol used is 2 because of the lower Consul version of the agents.

@nickwales
Copy link
Contributor

@banks We're currently running 1.0.1 in dev and test and 0.8.3 in stage and production.

Dev and Test have never had an outage.

  • Production had an outage in December, 11 days after upgrading from 0.8.3 to 1.0.0, during the outage we upgraded from 1.0.0 to 1.0.1 and the outage resolved.
  • Stage then had an issue in January while running 1.0.1

At this point we were nervous about running 1.0.x in customer facing environments so decided to downgrade those two environments to the previous version of 0.8.3.

We kept dev and test at 1.0.1 in an effort to provide the core dumps should we see a failure but I don't have any to report currently.

@avoidik
Copy link

avoidik commented Apr 13, 2018

can we somehow increase timeouts to deal with this issue?

https://github.com/hashicorp/consul/blob/master/agent/consul/config.go#L428-L429

like

"consul": {
  "server": {
    "health_interval": 10
  }
}

@avoidik
Copy link

avoidik commented Apr 13, 2018

I've tried to downgrade raft_protocol to v2, re-spun the cluster and it seems that this problem has gone, I have only 1.0.6 in consul members

faced #3361 during upgrade

@jacob-koren-zooz
Copy link

jacob-koren-zooz commented May 13, 2018

We are seeing this with consul 1.0.2. Five servers setup.

2018/05/12 03:47:47 [DEBUG] manager: Rebalanced 5 servers, next active server is xxx (Addr: tcp/11.1.5.51:27039) (DC: dc1)
2018/05/12 03:47:50 [DEBUG] memberlist: Initiating push/pull sync with: 11.1.4.123:xxx
2018/05/12 03:47:50 [DEBUG] memberlist: Stream connection from=11.1.5.51:xxx
2018/05/12 03:47:51 [DEBUG] http: Request GET /v1/status/leader (33.952µs) from=11.1.5.61:xxx
2018/05/12 03:47:51 [DEBUG] http: Request GET /v1/operator/autopilot/health (1.074998ms) from=11.1.5.61:xxx
2018/05/12 03:47:54 [WARN] consul: error getting server health from "xxx": context deadline exceeded
2018/05/12 03:47:55 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:47:55 [DEBUG] http: Request GET /v1/operator/autopilot/health (942.004µs) from=127.0.0.1:xxx
Received task health update, healthy: false
2018/05/12 03:47:57 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:47:59 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:01 [DEBUG] http: Request GET /v1/status/leader (30.218µs) from=11.1.5.61:xxx
2018/05/12 03:48:01 [DEBUG] http: Request GET /v1/operator/autopilot/health (946.665µs) from=11.1.5.61:xxx
2018/05/12 03:48:01 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:01 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
2018/05/12 03:48:01 [DEBUG] agent: Node info in sync
2018/05/12 03:48:03 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:05 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:05 [DEBUG] http: Request GET /v1/operator/autopilot/health (1.016149ms) from=127.0.0.1:xxx
Received task health update, healthy: false
2018/05/12 03:48:06 [DEBUG] http: Request GET /v1/status/leader (34.273µs) from=11.1.4.191:xxx
2018/05/12 03:48:07 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:09 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:11 [DEBUG] http: Request GET /v1/status/leader (35.576µs) from=11.1.5.61:xxx
2018/05/12 03:48:11 [DEBUG] http: Request GET /v1/operator/autopilot/health (1.027931ms) from=11.1.5.61:xxx
2018/05/12 03:48:11 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:13 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:15 [WARN] consul: error getting server health from "xxx": last request still outstanding
2018/05/12 03:48:15 [DEBUG] http: Request GET /v1/operator/autopilot/health (897.07µs) from=127.0.0.1:xxx
Received task health update, healthy: false
Received killTask for task consul-framework.consul-1.b586c018-8e6c-4c63-856f-17f08147f16a

This is happening on three nodes simultaneously and makes the cluster fail.

@banks
Copy link
Member

banks commented May 14, 2018

@jacob-koren-zooz thanks for the info can you confirm if your failure occurs following a disk IO/latency issue? It's not clear from the discussion if that is a common factor between all the incidents reported or if the issue is triggered by anything that causes timing to fall into a specific pattern.

@avoidik interesting, what made you think that raft protocol v2 would help? I'd need to look into the specific differences to be sure but it's not obvious to me that that might be a factor.

can we somehow increase timeouts to deal with this issue?

Increasing that might reduce the frequency somewhat but fundamentally the autopilot health checks should not cause a cluster to be unable to re-elect a new leader even if they fail every time. I think you would need to set it like {"server_health_interval": "10s"} but I've not tested that.

@nickwales
Copy link
Contributor

Hey we just had an outage on a 0.8.3 cluster... I was organized enough to have gathered the goroutine and thread dumps from each machine this time. Is there somewhere / someone I can send these to on a more private channel?

@pearkes
Copy link
Contributor

pearkes commented Jul 10, 2018

@nickwales if you are comfortable sending them to jack@hashicorp.com I'll get it to the right place. Thanks for doing that.

@NikolaeVarius
Copy link

We're experiencing a similar issue. Don't have a goroutine thread dump, but I do have a syslog output. We are on 0.8.3

@banks
Copy link
Member

banks commented Dec 7, 2018

While we don't have any more direct news on this. We recently found #5047 which could explain the failure for the leader to step down.

That issue is to do with a bug where establishLeadership which is called when a node first becomes a leader may fail and if it does so currently the leader remains in the leader loop so it doesn't step down and allow an election but it's not fully setup either.

The unknown link with this issue that would be the clincher that it's the same root cause would be if we can figure out a way that high diskIO could be a cause for something in establishLeadership to fail. It's not obvious to me what exactly that would look like yet, but we should keep in mind that it could be that the diskIO is a symptom of generally high load or some other situation that might also cause the failure mode.

I've not had time to look back through logs gathered here yet to see if there are signs that the establishLeadership bug is taking effect.

@caglar10ur
Copy link

👋 we just had this problem (one cluster node slowly leaking fds, ending up with a non-working cluster due to hitting open fd limits and crashing) with consul-v0.9.3 [both clients/servers].

I'm a bit confused about where we are at in terms of what fixed and what is remaining due to conflicting comments (as in 1.0.2 is not showing the problem vs having the issue with 1.0.2 as well)

It would be really great if you could provide a status update if you have one!

Also please let me know if logs or some more info regarding our issue would help ⭐️

@banks
Copy link
Member

banks commented Jun 3, 2019

@caglar10ur our best guess is that the leadership hanging state is a result of #5047 the difficult underlying changes to raft to make that possible are now done so we hope to have the fix in the next release.

The hope is that we'll not see any more instances of this issue once that is merged but since we have never been able to reproduce/establish a certain root cause it's hard to say for certain!

@hanshasselberg
Copy link
Member

@caglar10ur thanks for bearing with us. #5247 is merged and is going to be released in the next release. This fixes a problem we had in consul that could've caused your issue.
I will ping this thread once we have a release which you could play around with.

@caglar10ur
Copy link

@i0rek great. I'll keep my eyes on here and test once you release a new version. Thanks!

@hanshasselberg
Copy link
Member

@caglar10ur 1.5.2 has been released and includes a possible fix.

@caglar10ur
Copy link

@i0rek will try and let you know ASAP. Thanks again 🥇

@caglar10ur
Copy link

caglar10ur commented Jul 30, 2019

Just letting you know that we haven't forgotten this. We are slowly upgrading clusters and observing their behaviors for a while before moving on to another cluster - so far so good. I'll keep you updated with our progress.

@hanshasselberg
Copy link
Member

Much appreciated!

@caglar10ur
Copy link

caglar10ur commented Aug 7, 2019

Completed migrating 10 clusters with various number of agents/servers from 0.9.2 to 1.5.2. We normally got hit by this issue in every other week so I still need to observe before declaring victory but please let me tell you this. This was one of the easiest migrations I ever did in my professional career. Even though I was migrating from an ancient version to the latest one nothing broke and everything worked as designed/excepted (I mean OK something was broken but it was just UI that is because it uses the new API calls to render now) - thank you so much for that ❤️ !

Will report back in couple of weeks...

@hanshasselberg
Copy link
Member

I am glad to hear that everything went well for you!

@johncowen
Copy link
Contributor

Hey @caglar10ur

Just picking up on the mention of the UI here. I'm not sure if you have time to go into further detail on the UI issue you mention above. I'm half guessing it was just due to the rebuild of the UI we did a good few minor versions back (looking at the version jump you did you probably went from UI v1 to UI v2). I thought I'd quickly check in with you to double check, it sounds like the UI thing you mention above was a temporary thing and everything is fine now?

If you have time to give some more detail on the issue that'd be great, but don't worry if not. If there is something to look into I can start up another issue so I don't hijack this one.

Thanks!

@caglar10ur
Copy link

@johncowen ah sorry - I should have been more clear 🤦‍♂ - the UI partially worked during the upgrade and everything is fine now.

As far as I see Services section of the UI was only non-working piece while the cluster is in its mixed form where we have 0.9.2 and 1.5.2 servers. Once all servers are upgraded the UI started to render Services section as well without throwing 500 errors.

@caglar10ur
Copy link

👋 just wanted to give an update as promised. Its been almost a month and things are looking really good on our end, both in terms of number of file descriptors and also stability of the clusters.

Thanks!

@hanshasselberg
Copy link
Member

Great to hear that @caglar10ur! I will close this issue then. Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests