memberlist: Suspect <name> has failed, no acks received #953

jdrago999 · 2015-05-18T18:35:26Z

Hi -
TL;DR - what EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

I'm setting up my first Consul cluster on EC2 (VPC, Ubuntu 14.04, consul v0.5.1 amd64) and while everything worked great locally on a docker-compose setup, things in EC2 didn't work.

My cluster (at this point) looked like this:

serverA runs consul with bootstrap=true,server=true
serverB (different subnet, same security group) runs consul with bootstrap=false,server=true

After launching consul on serverA, I would launch consul on serverB and have it join serverA.

The logs on serverA looked like this:

2015/05/18 17:53:30 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:32 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:34 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:36 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:37 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:37 [INFO] serf: EventMemberFailed: serverB 10.0.2.95
2015/05/18 17:53:37 [INFO] consul: member 'serverB' failed, marking health critical
2015/05/18 17:53:38 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:51 [INFO] serf: attempting reconnect to serverB 10.0.2.95:8301
2015/05/18 17:53:51 [INFO] serf: EventMemberJoin: serverB 10.0.2.95
2015/05/18 17:53:51 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:54 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:56 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:58 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:59 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:59 [INFO] serf: EventMemberFailed: serverB 10.0.2.95

The logs on serverB looked the same, just s/serverB/serverA/sg

In the EC2 security group's networking settings I had opened the ingress and egress for UDP and TCP 8300-8600 and all ICMP. Still no luck. Was getting the same errors as above.

The Solution

Finally I opened all egress traffic within the subnet as shown in the following screenshot. Consul just started working.

I don't know what extra ports need to be opened, but as far as I can tell I've followed the consul docs but still didn't get it working.

This brings me to my question:

What EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

Also, really loving consul. Thank you.

The text was updated successfully, but these errors were encountered:

ryanbreen · 2015-05-18T18:38:15Z

http://www.consul.io/docs/agent/options.html

Look at the section "Ports Used." Please re-open if that doesn't cover it.

ChristianKniep · 2015-08-31T12:17:26Z

Hey guys,

I had the same issue and I opened all neccessary ports (at least IMHO). :)
Two bare metal servers, both holding a docker image using consul 0.5.2:

    ports:
     - "8500:8500"
     - "8300:8300"
     - "8400:8400"
     - "8301:8301/tcp"
     - "8302:8302/tcp"
     - "8301:8301/udp"
     - "8302:8302/udp"

For me it was a matter of the explicit UDP port description. After adding the /udp ports, the RPC went through... :)

anroots · 2015-09-13T18:58:00Z

TL;DR: Expose 8301 and 8302 ports explicitly for both protocols (TCP and UDP).

This is not a Consul issue but related to the way Docker exposes ports.

I encountered a similar problem: I could create a cluster of 3 Consul servers (DigitalOcean machines, Consul server running as gliderlabs/consul-server Docker image), the nodes could see each other and elect a leader, but would fail right after election.

2015/09/13 18:41:07 [INFO] consul: Attempting bootstrap with nodes: [<server-1-ip>:8300 <server-2-ip>:8300 <server-3-ip>:8300]
    2015/09/13 18:41:07 [WARN] raft: Heartbeat timeout reached, starting election
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Candidate] entering Candidate state
    2015/09/13 18:41:07 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:41:07 [INFO] consul: cluster leadership acquired
    2015/09/13 18:41:07 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:41:07 [INFO] raft: pipelining replication to peer <server-3-ip>:8300
    2015/09/13 18:41:07 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:41:08 [WARN] raft: Remote peer <server-2-ip>:8300 does not have local node <server-1-ip>:8300 as a peer
    2015/09/13 18:41:08 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:41:08 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:41:08 [INFO] agent: Synced service 'consul'
    2015/09/13 18:41:09 [INFO] memberlist: Suspect consul-web2 has failed, no acks received


    2015/09/13 18:41:11 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:12 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:13 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Marking consul-web2 as failed, suspect timeout reached
    2015/09/13 18:41:14 [INFO] serf: EventMemberLeave: consul-web2 <server-3-ip>
    2015/09/13 18:41:14 [INFO] consul: removing server consul-web2 (Addr: <server-3-ip>:8300) (DC: dc1)
    2015/09/13 18:41:14 [INFO] raft: Removed peer <server-3-ip>:8300, stopping replication (Index: 8)
    2015/09/13 18:41:14 [INFO] consul: removed server 'consul-web2' as peer
    2015/09/13 18:41:14 [INFO] consul: member 'consul-web2' left, deregistering

I had exposed the appropriate ports in docker-compose.yml:

ports:
    - "8400:8400"
    - "8500:8500"
    - "8301:8301"
    - "8302:8302"
    - "8300:8300"
    - "8600:8600"

...but this did not seem to work. Explicitly defining tcp/udp ports as @ChristianKniep suggested did the trick:

ports:
      - "8300:8300"
      - "8301:8301/tcp"
      - "8301:8301/udp"
      - "8302:8302/tcp"
      - "8302:8302/udp"
      - "8400:8400"
      - "8500:8500"
      - "8600:8600"

 2015/09/13 18:45:22 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:45:22 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:45:22 [INFO] consul: cluster leadership acquired
    2015/09/13 18:45:22 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-1-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web2' joined, marking health alive
    2015/09/13 18:45:22 [INFO] agent: Synced service 'consul'

This might be due to the fact that by default, Docker only exposes a TCP port, so you need to expose each port twice, but with different protocol switches.

Additionally, all of these publishing rules will default to tcp. If you need udp, simply tack it on to the end such as -p 1234:1234/udp. (source)

Related: #1465 and hashicorp/memberlist#37.

asheshambasta · 2016-06-11T11:28:21Z

@anroots I've tried explicitly adding the ports and it doesn't seem to make any difference whatsoever.

I'm suspecting it has something to do with the SG (since I'm trying this on EC2 instances and only one of the instances keeps failing)

spawluk · 2018-08-07T09:13:45Z

in my case advertise address was wrong. I changed it and it worked.

ryanbreen closed this as completed May 18, 2015

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021

Allow overwriting kubernetes http probes (hashicorp#953)

252e273

snyk-bot mentioned this issue May 3, 2023

[Snyk] Security upgrade @hashicorp/react-hero from 7.3.3 to 9.0.0 ekmixon/consul#511

Open

wangleo61 mentioned this issue May 4, 2023

[Snyk] Security upgrade @hashicorp/react-hero from 7.1.1 to 9.0.0 wangleo61/consul#50

Open

terrorizer1980 mentioned this issue May 4, 2023

[Snyk] Security upgrade @hashicorp/react-hero from 7.2.1 to 9.0.0 terrorizer1980/consul#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memberlist: Suspect <name> has failed, no acks received #953

memberlist: Suspect <name> has failed, no acks received #953

jdrago999 commented May 18, 2015

ryanbreen commented May 18, 2015

ChristianKniep commented Aug 31, 2015

anroots commented Sep 13, 2015

asheshambasta commented Jun 11, 2016

spawluk commented Aug 7, 2018

memberlist: Suspect <name> has failed, no acks received #953

memberlist: Suspect <name> has failed, no acks received #953

Comments

jdrago999 commented May 18, 2015

The Solution

ryanbreen commented May 18, 2015

ChristianKniep commented Aug 31, 2015

anroots commented Sep 13, 2015

asheshambasta commented Jun 11, 2016

spawluk commented Aug 7, 2018