Nodes locking up #206

keyneston · 2014-06-11T22:54:42Z

I just upgraded to commit 6d43b8c and am now seeing a small percentage of nodes constantly failing their serf checks. There is no consistency between which nodes fail out. Restarting consul will resolve the issue, temporarily.

Consul is running:

nobody   14838  0.3  0.0 160016 13604 ?        Ssl  15:43   0:01 /usr/bin/consul agent -config-dir /etc/consul.d/

But consul members and other api requests hang, not responding.

There is nothing in the logs beyond messages like:

Jun 11 15:47:31 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:31 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:31 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:31 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:31 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:32 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:32 bar21 consul[14838]: serf: messageJoinType: foo52
Jun 11 15:47:33 bar21 consul[14838]: serf: EventMemberFailed: foo29 1.1.1.5
Jun 11 15:47:33 bar21 consul[14838]: serf: EventMemberFailed: zxy1 1.1.1.10
Jun 11 15:47:35 bar21 consul[14838]: serf: EventMemberFailed: baz1 1.1.1.15
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40
Jun 11 15:47:36 bar21 consul[14838]: serf: messageJoinType: foo40

The text was updated successfully, but these errors were encountered:

keyneston · 2014-06-11T22:57:36Z

Config:

{
    "bind_addr": "<scrubbed>",
    "data_dir": "/var/lib/consul",
    "datacenter": "dc01",
    "enable_syslog": true,
    "server": false,
    "start_join": [
        "<scrubbed>",
        "<scrubbed>",
        "<scrubbed>",
        "<scrubbed>"
    ],
    "syslog_facility": "local3"
}

armon · 2014-06-11T22:58:07Z

Do you know what build you were running previously?

keyneston · 2014-06-11T22:58:59Z

2877124

armon · 2014-06-11T23:01:06Z

Are the nodes that lock-up totally inoperable? It would be great if you could try to capture the output of sending a SIGUSR1 to get the metrics, followed by a SIGABRT to get a stack trace once they are hung.

keyneston · 2014-06-11T23:08:12Z

Btw these only print to stdout and not syslog which luckily isn't too bad of a problem since upstart on precise logs stdout automatically. Sadly upstart on lucid doesn't so I have to use syslog for consistent logging.

Metrics and stracktrace: https://gist.github.com/tarrant/4d62bd6db4d9434ed8f4

armon · 2014-06-11T23:11:16Z

Yeah they are super debug only, directly to stderr. The SIGABRT is built into the runtime.

armon · 2014-06-11T23:21:38Z

Do you have something fancy going on with syslog? It seems that Consul is blocked writing to syslog. One goroutine is waiting to write to syslog, while holding a lock inside the Logger, so any goroutine that tries to log anything ends up deadlocking.

keyneston · 2014-06-11T23:31:06Z

As far as I can tell no nothing fancy. But it makes sense that the issue is in syslog since that is the reason I upgraded and the only config that I've changed when upgrading.

I'll poke around our syslog settings and see if I can discover anything.

armon · 2014-06-11T23:41:43Z

Let me know what you find. Otherwise, our only other choices are to either buffer logs indefinitely (bad) or just drop logs (worse).

keyneston · 2014-06-11T23:58:07Z

I can't find anything odd in our syslog setup. I'm attempting to replicate the issue with a small go program and the go-syslog library.

armon · 2014-06-11T23:59:40Z

I'll be trying to replicate tomorrow as well. Hammered a perf cluster for a bit, but never saw this. I wasn't using syslog however.

keyneston · 2014-06-12T00:08:28Z

We only see this issue in our main cluster, our smaller clusters seem to be happy. This makes me believe it is a locking issue triggered by serf messages about nodes joining/leaving/etc as this cluster sees significantly more of these messages.

armon · 2014-06-12T02:01:59Z

Did you see the same number of serf messages before the update? Or has there been an uptick in the new build?

armon · 2014-06-12T17:40:38Z

@tarrant On thinking about this more, I think a few more pieces are clear:

There is increased Serf traffic because we now send the "build" tag with the version (e.g. 0.3rc). Since you are doing a rolling upgrade, all the clients are sending new "alive" messages to update the tags. This will cause some temporary congestion on a very large cluster like yours, since effective throughput is only about 40-80 updates/sec with Serf. This is not an issue, since Serf is designed to be eventually consistent.
The increased activity causes increased logging, and for some reason the syslog write is blocking.
Once the process locks up in the write, other nodes detect a failure (since the agent is not responding to ping messages).

So, I think the root issue here is the blocking syslog. The other pieces of this make sense.

armon · 2014-06-12T17:52:13Z

As followup, seems we are not alone in this. Golang bug with syslog:
https://groups.google.com/forum/#!topic/Golang-Nuts/PMm8nH0yaoA
https://code.google.com/p/go/issues/detail?id=5932

armon · 2014-06-12T18:13:10Z

Just updated this: hashicorp/go-syslog@ac3963b

When you get a chance, can you see if this has fixed the issue?

keyneston · 2014-06-12T21:38:39Z

So far that seems to have fixed the issue.

armon · 2014-06-12T21:43:48Z

Super annoying that this issue exists, but glad its fixed!

keyneston closed this as completed Jun 12, 2014

wangleo61 mentioned this issue Jul 5, 2023

[Snyk] Security upgrade next-mdx-remote from 3.0.1 to 4.0.0 wangleo61/consul#53

Open

terrorizer1980 mentioned this issue Jul 6, 2023

[Snyk] Security upgrade next-mdx-remote from 3.0.1 to 4.0.0 terrorizer1980/consul#19

Open

terrorizer1980 mentioned this issue Nov 28, 2023

[Snyk] Fix for 2 vulnerabilities terrorizer1980/consul#23

Open

qmutz mentioned this issue Nov 29, 2023

[Snyk] Fix for 7 vulnerabilities qmutz/consul#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes locking up #206

Nodes locking up #206

keyneston commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

keyneston commented Jun 12, 2014

armon commented Jun 12, 2014

Nodes locking up #206

Nodes locking up #206

Comments

keyneston commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 11, 2014

armon commented Jun 11, 2014

keyneston commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

armon commented Jun 12, 2014

keyneston commented Jun 12, 2014

armon commented Jun 12, 2014