Member has conflicting node ID #322

lucj · 2017-04-12T13:48:54Z

I have several services running with ContainerPilot, for one of them, I have the following error when it runs, which prevents the local consul agent to register to the consul server.

* Failed to join 172.18.0.2: Member '4b7d8eea9a88' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID

Seems this can be solved by fixing the node id when running the consul agent: hashicorp/consul#2877

How would you recommend to set the node-id in the consul subprocess ?

    {
      "command": ["/usr/local/bin/consul", "agent",
                  "-node-id", "something like $(cat /proc/sys/kernel/random/uuid)", ?
                  "-data-dir=/data",
                  "-config-dir=/config",
                  "-rejoin",
                  "-retry-join", "{{ if .CONSUL }}{{ .CONSUL }}{{ else }}consul{{ end }}",
                  "-retry-max", "10",
                  "-retry-interval", "10s"],
      "restarts": "unlimited"
    }

Thanks a lot.

The text was updated successfully, but these errors were encountered:

tgross · 2017-04-12T14:51:43Z

How would you recommend to set the node-id in the consul subprocess ?

You could do as you're doing there. But it should be pulling the node ID from the hostname, which is going to be unique per container isn't it?

lucj · 2017-04-12T19:34:28Z

Thanks Tim. What is the way to have the 'cat' correctly interpreted ?

Also, I do not understand why the node id has conflict for one service in particular, I'll investigate.

tgross · 2017-04-13T12:59:13Z

What is the way to have the 'cat' correctly interpreted ?

The configuration interpolation doesn't run a subshell, so you'll need to inject it in some other way. Either by using the interpolation of an environment variable like {{ .HOSTNAME }} or by wrapping the Consul subprocess in a shell script.

Also, I do not understand why the node id has conflict for one service in particular

Yeah, I've never seen the problem you're experiencing. Is this on Triton?

siepkes · 2017-04-18T08:42:28Z

@tgross FIW I have the same issue without container pilot, just plain Consul, in a Triton / SmartOS LX zone (CentOS image). I speculate that there is something in the LX zone implementation which causes the UUID generated as node ID to always be the same for zones on the same CN.

tgross · 2017-04-20T14:05:46Z

@siepkes @lucj I'm having trouble figuring out how to replicate this on Triton. I've taken the blueprint in https://github.com/autopilotpattern/consul and run 3 nodes. And then I've added another instance to the cluster like so:

docker run -d -p 8500 --env-file ./_env \
  --name consul_consul_4 \  
  --label "triton.cns.services=consul" \
  -e "affinity:container==consul_consul_1" \
  autopilotpattern/consul:latest \
    /usr/local/bin/containerpilot /bin/consul agent -server \
      -bootstrap-expect 3 -config-dir=/etc/consul -ui-dir /ui

Note that I'm using an affinity filter here to ensure the 4th node lands on the same CN as the 1st node, and I can confirm this with for i in {1,2,4}; do triton inst get consul_consul_$i | jq .compute_node; done. I end up with 4 nodes in Consul just fine, each with their own hostname and node ID.

@siepkes do you have reproduction steps you can share? I must be missing something.

siepkes · 2017-04-20T15:25:11Z

@tgross I haven't configured docker in our private Triton instance. So unfortunately I can't run the test you describe. We use a combination of Packer, SaltStack and Terraform to create and deploy our images. I actually only have 1 LX zone image because we needed V8 and that doesn't build on Illumos. I worked around the issue by having systemd generate a persistent unique node-id on first start:

ExecStartPre=if [[ ! -f "/var/lib/consul/node-id" ]] ; then echo $(uuidgen -r | awk '{print tolower($0)}') > /var/lib/consul/node-id ; fi

The image I used with Consul in which I get this behavior is CentOS 7 ( 23ee2dbc-c155-11e6-ab6d-bf5689f582fd ). I don't know if it is relevant but the Consul agent's I used were clients, not servers.

tgross · 2017-04-20T15:43:34Z

Ok, well we need to try to create a minimal reproduction here. What can you tell me about the topology of the clients/servers or the sequence of events, @siepkes?

lucj · 2017-04-20T16:11:35Z

@tgross @siepkes I got the error when running my compose file against a simple host (using Docker for Mac). I'll try to come with a simple setup to reproduce this error.

lucj · 2017-04-24T21:10:25Z

Seems like this occurs with Consul 0.8.0 but it's fine with 0.7.5.
From my understanding 0.8.1 should fix the thing, I'll test this.
@siepkes which version of Consul are you using for Server and Agent ?

lucj · 2017-04-25T08:06:13Z

Seems to be fine when local agents (the ones embedded in each service) are running Consul 0.8.1 and server is running 0.7.5.
Attempts to run the server to a version >=0.8.0 raise the error.

siepkes · 2017-04-25T08:27:40Z

@lucj I encountered the issue when upgrading to 0.8. Because of the link to the Hashicorp issue I assumed it was clear that this is only an issue with Consul >= 0.8. Sorry, I should have been more verbose about that.

lucj · 2017-04-25T09:07:02Z

@siepkes No problem. I though this would be fixed with 0.8.1 but it's not.
Good thing the agent can use 0.8.1 though.

lucj · 2017-04-25T09:14:28Z

My bad, everything needs to be in version <0.8.0

tgross · 2017-04-25T12:54:06Z

From my understanding 0.8.1 should fix the thing, I'll test this.

Good to know. Can you provide a link to the Hashicorp Consul issue for where that might have been discussed and/or fixed?

lucj · 2017-04-25T12:57:33Z

In fact I did not manage to have it working with 0.8.1.
From the change log https://github.com/hashicorp/consul/blob/master/CHANGELOG.md

agent: Node IDs derived from host information are now hashed to prevent things like common server hardware from generating IDs with a common prefix across nodes. [GH-2884]

tgross · 2017-04-25T15:58:12Z

Ok, I did a little digging and I think we can work around this. Try passing -node-id={{ .HOSTNAME }} as part of your ContainerPilot configuration for the Consul agent. That will bypass the check it's doing that appears to be pulling info from the underlying host and not the container.

EugenMayer · 2017-05-01T16:27:48Z

You can go back with https://www.consul.io/docs/agent/options.html#_disable_host_node_id .. because -node-id={{ .HOSTNAME }} will not work due to Error starting agent: Failed to setup node ID: uuid string is wrong length

tgross · 2017-05-03T14:31:58Z

@EugenMayer that seems like the way to go then. We're also going to look into what might be generating the conflicting node IDs and trying to suss out whether this is a Triton issue (which seems unlikely as it's happening on Docker for Mac) or a Consul issue.

(ref https://devhub.joyent.com/jira/browse/PRODSUP-16 for internal folks)

EugenMayer · 2017-05-03T14:44:22Z

Probably kind of related topic: https://groups.google.com/forum/#!topic/consul-tool/9lm0HbyQVd4

tgross · 2017-05-03T18:33:06Z

@lucj I'm still unable to reproduce this, so it might help if we could get some more information from you about the conditions where you're seeing this problem.

what version(s) of Consul, ContainerPilot, and Docker works? which versions don't work?
is the data directory for Consul being reused? i.e. are you using -v arguments to mount it as a volume?
are you seeing the problem only on Docker for Mac, Docker Machine, or Triton. Or more than one?
if you're seeing the problem on Triton can you get the PI (if on-prem) or provide the CN (triton inst get mycontainer | json compute_node will get this, or you can email me the UUIDs of the containers and I can dig this up at our end).

sean- · 2017-05-03T18:36:27Z

@EugenMayer / @lucj : Those are unrelated. The background on those issues is that some providers use a timestamp-prefixed UUID, which makes the first 10 characters effectively useless. Hashing the full UUID, however, gives a sense of randomness to the first 10 characters.

But, Consul should, and properly detects node-id collisions. If consul is being seeded with a duplicate node-id, then this could happen. Consul is pointing out a discrepancy in the environment. Where it is, however, is what we're trying to figure out. CC @tgross

lucj · 2017-05-03T19:49:11Z

@tgross, here is a compose version that makes this error occur.

version: '2'
services:

  consul:
    image: consul:0.8.0
    command: agent -server -client=0.0.0.0 -bootstrap -ui
    dns:
      - 127.0.0.1
    ports:
      - "8500:8500"
    restart: always

  db:
    image: autopilotpattern/mongodb
    restart: always

  mq:
    image: autopilotpattern/rabbitmq
    build: ../../../development/rabbitmq
    restart: always

  profile:
    image: traxxs/profile:develop
    build: ../../../development/profile
    command: ["containerpilot", "npm", "start"]
    restart: always

The service mq is based on https://github.com/lucj/autopilotpattern-rabbitmq, using Consul 0.8.0.

The service profile is a part of our application, that is also using Consul 0.8.0 and that depends on mq and db services, below is the Dockerfile for this one:

FROM mhart/alpine-node:6.10
ENV LAST_UPDATED 20170407T151300

RUN apk update && apk add curl unzip

# Install consul
# RUN export CONSUL_VERSION=0.7.5 \
#     && export CONSUL_CHECKSUM=40ce7175535551882ecdff21fdd276cef6eaab96be8a8260e0599fadb6f1f5b8 \
RUN export CONSUL_VERSION=0.8.0 \
    && export CONSUL_CHECKSUM=f4051c2cab9220be3c0ca22054ee4233f1396c7138ffd97a38ffbcea44377f47 \
    && curl --retry 7 --fail -vo /tmp/consul.zip "https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip" \
    && echo "${CONSUL_CHECKSUM}  /tmp/consul.zip" | sha256sum -c \
    && unzip /tmp/consul -d /usr/local/bin \
    && rm /tmp/consul.zip \
    && mkdir /config

# Install ContainerPilot
ENV CONTAINERPILOT_VERSION 2.7.2
RUN export CP_SHA1=e886899467ced6d7c76027d58c7f7554c2fb2bcc \
    && curl -Lso /tmp/containerpilot.tar.gz \
         "https://github.com/joyent/containerpilot/releases/download/${CONTAINERPILOT_VERSION}/containerpilot-${CONTAINERPILOT_VERSION}.tar.gz" \
    && echo "${CP_SHA1}  /tmp/containerpilot.tar.gz" | sha1sum -c \
    && tar zxf /tmp/containerpilot.tar.gz -C /usr/local/bin \
    && rm /tmp/containerpilot.tar.gz

# Copy list of server side dependencies
COPY package.json /tmp/package.json

# Install dependencies
RUN cd /tmp && npm install

# Copy dependencies libraries
RUN mkdir /app && cp -a /tmp/node_modules /app/

# Copy src files
COPY . /app/

# Use /app working directory
WORKDIR /app

# COPY ContainerPilot configuration
ENV CONTAINERPILOT_PATH=/etc/containerpilot.json
COPY containerpilot.json ${CONTAINERPILOT_PATH}
ENV CONTAINERPILOT=file://${CONTAINERPILOT_PATH}

# Expose http port
EXPOSE 80

# Run application
CMD ["npm", "start"]

Note: the CMD is overridden in the compose file so ContainerPilot is ran as pid 1.

Below is the part of the logs where profile is starting (moving out of the preStart once the db and mq are up and running). You can see the Node ID error in it.

profile_1  | db is healthly, moving on...
profile_1  | mq is healthly, moving on...
profile_1  | 2017/05/03 19:37:20 ==> Starting Consul agent...
profile_1  | 2017/05/03 19:37:20 ==> Consul agent running!
profile_1  | 2017/05/03 19:37:20            Version: 'v0.8.0'
profile_1  | 2017/05/03 19:37:20            Node ID: '7ba08ece-d74e-4b39-b5fb-c9b18a396864'
profile_1  | 2017/05/03 19:37:20          Node name: 'ba77cfe3461f'
profile_1  | 2017/05/03 19:37:20         Datacenter: 'dc1'
profile_1  | 2017/05/03 19:37:20             Server: false (bootstrap: false)
profile_1  | 2017/05/03 19:37:20        Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)
profile_1  | 2017/05/03 19:37:20       Cluster Addr: 172.18.0.2 (LAN: 8301, WAN: 8302)
profile_1  | 2017/05/03 19:37:20     Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
profile_1  | 2017/05/03 19:37:20              Atlas: <disabled>
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20 ==> Log data will now stream in as it occurs:
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] serf: EventMemberJoin: ba77cfe3461f 172.18.0.2
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] manager: No servers available
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: Joining cluster...
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [ERR] agent: failed to sync remote state: No known Consul servers
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: (LAN) joining: [consul]
consul_1   |     2017/05/03 19:37:20 [INFO] serf: EventMemberJoin: ba77cfe3461f 172.18.0.2
consul_1   |     2017/05/03 19:37:20 [INFO] consul: member 'ba77cfe3461f' joined, marking health alive
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20 * Failed to join 172.18.0.5: Member '5a27f0cf077b' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] agent: Join failed: <nil>, retrying in 10s
consul_1   |     2017/05/03 19:37:20 [INFO] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "7ba08ece-d74e-4b39-b5fb-c9b18a396864" for node "ba77cfe3461f" aliases existing node "5a27f0cf077b"
mq_1       | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] memberlist: ignoring alive message for 'ba77cfe3461f': Member 'ba77cfe3461f' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
mq_1       | 2017/05/03 19:37:21     2017/05/03 19:37:21 [WARN] memberlist: ignoring alive message for 'ba77cfe3461f': Member 'ba77cfe3461f' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
profile_1  |
profile_1  | > profile@0.0.1 start /app
profile_1  | > node index.js
profile_1  |
profile_1  | --- www server listering on port [80] --

To reply to your questions:

ContainerPilot is 2.7.2 in both mq and profile.
I do not mount volumes
I only tested this on Docker for Mac

If all the services are using Consul 0.7.5 instead of 0.8.0 the problem does not occur.

mterron · 2017-05-04T01:53:21Z

You can pass "-disable-host-node-id" to the consul (>0.8.1) run command to make it generate a random node-id. It can't be reproduced on Triton, as it is a Consul-Docker issue.

Cheers

tgross · 2017-05-17T18:08:09Z

Closing this, as it's a Consul issue.

EugenMayer · 2017-05-18T08:37:37Z

@tgross are there any issues you refer too? i am very interested

tgross · 2017-05-18T12:31:29Z

Just the Consul mailing list issue that you opened. Use -disable-host-node-id as was advised there.

tgross added the usage label Apr 12, 2017

tgross closed this as completed May 17, 2017

lorentzca mentioned this issue May 22, 2017

V0.8.3 lorentzca/c3d#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Member has conflicting node ID #322

Member has conflicting node ID #322

lucj commented Apr 12, 2017

tgross commented Apr 12, 2017

lucj commented Apr 12, 2017

tgross commented Apr 13, 2017

siepkes commented Apr 18, 2017

tgross commented Apr 20, 2017

siepkes commented Apr 20, 2017

tgross commented Apr 20, 2017

lucj commented Apr 20, 2017 •

edited

Loading

lucj commented Apr 24, 2017

lucj commented Apr 25, 2017

siepkes commented Apr 25, 2017

lucj commented Apr 25, 2017

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017 •

edited

Loading

EugenMayer commented May 1, 2017

tgross commented May 3, 2017 •

edited

Loading

EugenMayer commented May 3, 2017

tgross commented May 3, 2017

sean- commented May 3, 2017 •

edited

Loading

lucj commented May 3, 2017 •

edited

Loading

mterron commented May 4, 2017 •

edited

Loading

tgross commented May 17, 2017

EugenMayer commented May 18, 2017

tgross commented May 18, 2017

Member has conflicting node ID #322

Member has conflicting node ID #322

Comments

lucj commented Apr 12, 2017

tgross commented Apr 12, 2017

lucj commented Apr 12, 2017

tgross commented Apr 13, 2017

siepkes commented Apr 18, 2017

tgross commented Apr 20, 2017

siepkes commented Apr 20, 2017

tgross commented Apr 20, 2017

lucj commented Apr 20, 2017 • edited Loading

lucj commented Apr 24, 2017

lucj commented Apr 25, 2017

siepkes commented Apr 25, 2017

lucj commented Apr 25, 2017

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017 • edited Loading

EugenMayer commented May 1, 2017

tgross commented May 3, 2017 • edited Loading

EugenMayer commented May 3, 2017

tgross commented May 3, 2017

sean- commented May 3, 2017 • edited Loading

lucj commented May 3, 2017 • edited Loading

mterron commented May 4, 2017 • edited Loading

tgross commented May 17, 2017

EugenMayer commented May 18, 2017

tgross commented May 18, 2017

lucj commented Apr 20, 2017 •

edited

Loading

tgross commented Apr 25, 2017 •

edited

Loading

tgross commented May 3, 2017 •

edited

Loading

sean- commented May 3, 2017 •

edited

Loading

lucj commented May 3, 2017 •

edited

Loading

mterron commented May 4, 2017 •

edited

Loading