Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memberlist: Suspect <name> has failed, no acks received #953

Closed
jdrago999 opened this issue May 18, 2015 · 5 comments
Closed

memberlist: Suspect <name> has failed, no acks received #953

jdrago999 opened this issue May 18, 2015 · 5 comments

Comments

@jdrago999
Copy link

Hi -
TL;DR - what EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

I'm setting up my first Consul cluster on EC2 (VPC, Ubuntu 14.04, consul v0.5.1 amd64) and while everything worked great locally on a docker-compose setup, things in EC2 didn't work.

My cluster (at this point) looked like this:

  • serverA runs consul with bootstrap=true,server=true
  • serverB (different subnet, same security group) runs consul with bootstrap=false,server=true

After launching consul on serverA, I would launch consul on serverB and have it join serverA.

The logs on serverA looked like this:

2015/05/18 17:53:30 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:32 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:34 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:36 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:37 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:37 [INFO] serf: EventMemberFailed: serverB 10.0.2.95
2015/05/18 17:53:37 [INFO] consul: member 'serverB' failed, marking health critical
2015/05/18 17:53:38 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:51 [INFO] serf: attempting reconnect to serverB 10.0.2.95:8301
2015/05/18 17:53:51 [INFO] serf: EventMemberJoin: serverB 10.0.2.95
2015/05/18 17:53:51 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:54 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:56 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:58 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:59 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:59 [INFO] serf: EventMemberFailed: serverB 10.0.2.95

The logs on serverB looked the same, just s/serverB/serverA/sg

In the EC2 security group's networking settings I had opened the ingress and egress for UDP and TCP 8300-8600 and all ICMP. Still no luck. Was getting the same errors as above.

The Solution

Finally I opened all egress traffic within the subnet as shown in the following screenshot. Consul just started working.

screen shot 2015-05-18 at 11 32 09 am

I don't know what extra ports need to be opened, but as far as I can tell I've followed the consul docs but still didn't get it working.

This brings me to my question:

What EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

Also, really loving consul. Thank you.

@ryanbreen
Copy link
Contributor

http://www.consul.io/docs/agent/options.html

Look at the section "Ports Used." Please re-open if that doesn't cover it.

@ChristianKniep
Copy link

Hey guys,

I had the same issue and I opened all neccessary ports (at least IMHO). :)
Two bare metal servers, both holding a docker image using consul 0.5.2:

    ports:
     - "8500:8500"
     - "8300:8300"
     - "8400:8400"
     - "8301:8301/tcp"
     - "8302:8302/tcp"
     - "8301:8301/udp"
     - "8302:8302/udp"

For me it was a matter of the explicit UDP port description. After adding the /udp ports, the RPC went through... :)

@anroots
Copy link

anroots commented Sep 13, 2015

TL;DR: Expose 8301 and 8302 ports explicitly for both protocols (TCP and UDP).

This is not a Consul issue but related to the way Docker exposes ports.

I encountered a similar problem: I could create a cluster of 3 Consul servers (DigitalOcean machines, Consul server running as gliderlabs/consul-server Docker image), the nodes could see each other and elect a leader, but would fail right after election.

2015/09/13 18:41:07 [INFO] consul: Attempting bootstrap with nodes: [<server-1-ip>:8300 <server-2-ip>:8300 <server-3-ip>:8300]
    2015/09/13 18:41:07 [WARN] raft: Heartbeat timeout reached, starting election
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Candidate] entering Candidate state
    2015/09/13 18:41:07 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:41:07 [INFO] consul: cluster leadership acquired
    2015/09/13 18:41:07 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:41:07 [INFO] raft: pipelining replication to peer <server-3-ip>:8300
    2015/09/13 18:41:07 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:41:08 [WARN] raft: Remote peer <server-2-ip>:8300 does not have local node <server-1-ip>:8300 as a peer
    2015/09/13 18:41:08 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:41:08 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:41:08 [INFO] agent: Synced service 'consul'
    2015/09/13 18:41:09 [INFO] memberlist: Suspect consul-web2 has failed, no acks received


    2015/09/13 18:41:11 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:12 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:13 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Marking consul-web2 as failed, suspect timeout reached
    2015/09/13 18:41:14 [INFO] serf: EventMemberLeave: consul-web2 <server-3-ip>
    2015/09/13 18:41:14 [INFO] consul: removing server consul-web2 (Addr: <server-3-ip>:8300) (DC: dc1)
    2015/09/13 18:41:14 [INFO] raft: Removed peer <server-3-ip>:8300, stopping replication (Index: 8)
    2015/09/13 18:41:14 [INFO] consul: removed server 'consul-web2' as peer
    2015/09/13 18:41:14 [INFO] consul: member 'consul-web2' left, deregistering

I had exposed the appropriate ports in docker-compose.yml:

ports:
    - "8400:8400"
    - "8500:8500"
    - "8301:8301"
    - "8302:8302"
    - "8300:8300"
    - "8600:8600"

...but this did not seem to work. Explicitly defining tcp/udp ports as @ChristianKniep suggested did the trick:

ports:
      - "8300:8300"
      - "8301:8301/tcp"
      - "8301:8301/udp"
      - "8302:8302/tcp"
      - "8302:8302/udp"
      - "8400:8400"
      - "8500:8500"
      - "8600:8600"
 2015/09/13 18:45:22 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:45:22 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:45:22 [INFO] consul: cluster leadership acquired
    2015/09/13 18:45:22 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-1-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web2' joined, marking health alive
    2015/09/13 18:45:22 [INFO] agent: Synced service 'consul'

This might be due to the fact that by default, Docker only exposes a TCP port, so you need to expose each port twice, but with different protocol switches.

Additionally, all of these publishing rules will default to tcp. If you need udp, simply tack it on to the end such as -p 1234:1234/udp. (source)

Related: #1465 and hashicorp/memberlist#37.

@asheshambasta
Copy link

@anroots I've tried explicitly adding the ports and it doesn't seem to make any difference whatsoever.

I'm suspecting it has something to do with the SG (since I'm trying this on EC2 instances and only one of the instances keeps failing)

@spawluk
Copy link

spawluk commented Aug 7, 2018

in my case advertise address was wrong. I changed it and it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants