Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member has conflicting node ID #322

Closed
lucj opened this issue Apr 12, 2017 · 26 comments
Closed

Member has conflicting node ID #322

lucj opened this issue Apr 12, 2017 · 26 comments
Labels

Comments

@lucj
Copy link

lucj commented Apr 12, 2017

I have several services running with ContainerPilot, for one of them, I have the following error when it runs, which prevents the local consul agent to register to the consul server.

* Failed to join 172.18.0.2: Member '4b7d8eea9a88' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID

Seems this can be solved by fixing the node id when running the consul agent: hashicorp/consul#2877

How would you recommend to set the node-id in the consul subprocess ?

    {
      "command": ["/usr/local/bin/consul", "agent",
                  "-node-id", "something like $(cat /proc/sys/kernel/random/uuid)", ?
                  "-data-dir=/data",
                  "-config-dir=/config",
                  "-rejoin",
                  "-retry-join", "{{ if .CONSUL }}{{ .CONSUL }}{{ else }}consul{{ end }}",
                  "-retry-max", "10",
                  "-retry-interval", "10s"],
      "restarts": "unlimited"
    }

Thanks a lot.

@tgross tgross added the usage label Apr 12, 2017
@tgross
Copy link
Contributor

tgross commented Apr 12, 2017

How would you recommend to set the node-id in the consul subprocess ?

You could do as you're doing there. But it should be pulling the node ID from the hostname, which is going to be unique per container isn't it?

@lucj
Copy link
Author

lucj commented Apr 12, 2017

Thanks Tim. What is the way to have the 'cat' correctly interpreted ?

Also, I do not understand why the node id has conflict for one service in particular, I'll investigate.

@tgross
Copy link
Contributor

tgross commented Apr 13, 2017

What is the way to have the 'cat' correctly interpreted ?

The configuration interpolation doesn't run a subshell, so you'll need to inject it in some other way. Either by using the interpolation of an environment variable like {{ .HOSTNAME }} or by wrapping the Consul subprocess in a shell script.

Also, I do not understand why the node id has conflict for one service in particular

Yeah, I've never seen the problem you're experiencing. Is this on Triton?

@siepkes
Copy link

siepkes commented Apr 18, 2017

@tgross FIW I have the same issue without container pilot, just plain Consul, in a Triton / SmartOS LX zone (CentOS image). I speculate that there is something in the LX zone implementation which causes the UUID generated as node ID to always be the same for zones on the same CN.

@tgross
Copy link
Contributor

tgross commented Apr 20, 2017

@siepkes @lucj I'm having trouble figuring out how to replicate this on Triton. I've taken the blueprint in https://github.com/autopilotpattern/consul and run 3 nodes. And then I've added another instance to the cluster like so:

docker run -d -p 8500 --env-file ./_env \
  --name consul_consul_4 \  
  --label "triton.cns.services=consul" \
  -e "affinity:container==consul_consul_1" \
  autopilotpattern/consul:latest \
    /usr/local/bin/containerpilot /bin/consul agent -server \
      -bootstrap-expect 3 -config-dir=/etc/consul -ui-dir /ui

Note that I'm using an affinity filter here to ensure the 4th node lands on the same CN as the 1st node, and I can confirm this with for i in {1,2,4}; do triton inst get consul_consul_$i | jq .compute_node; done. I end up with 4 nodes in Consul just fine, each with their own hostname and node ID.

@siepkes do you have reproduction steps you can share? I must be missing something.

@siepkes
Copy link

siepkes commented Apr 20, 2017

@tgross I haven't configured docker in our private Triton instance. So unfortunately I can't run the test you describe. We use a combination of Packer, SaltStack and Terraform to create and deploy our images. I actually only have 1 LX zone image because we needed V8 and that doesn't build on Illumos. I worked around the issue by having systemd generate a persistent unique node-id on first start:

ExecStartPre=if [[ ! -f "/var/lib/consul/node-id" ]] ; then echo $(uuidgen -r | awk '{print tolower($0)}') > /var/lib/consul/node-id ; fi 

The image I used with Consul in which I get this behavior is CentOS 7 ( 23ee2dbc-c155-11e6-ab6d-bf5689f582fd ). I don't know if it is relevant but the Consul agent's I used were clients, not servers.

@tgross
Copy link
Contributor

tgross commented Apr 20, 2017

Ok, well we need to try to create a minimal reproduction here. What can you tell me about the topology of the clients/servers or the sequence of events, @siepkes?

@lucj
Copy link
Author

lucj commented Apr 20, 2017

@tgross @siepkes I got the error when running my compose file against a simple host (using Docker for Mac). I'll try to come with a simple setup to reproduce this error.

@lucj
Copy link
Author

lucj commented Apr 24, 2017

Seems like this occurs with Consul 0.8.0 but it's fine with 0.7.5.
From my understanding 0.8.1 should fix the thing, I'll test this.
@siepkes which version of Consul are you using for Server and Agent ?

@lucj
Copy link
Author

lucj commented Apr 25, 2017

Seems to be fine when local agents (the ones embedded in each service) are running Consul 0.8.1 and server is running 0.7.5.
Attempts to run the server to a version >=0.8.0 raise the error.

@siepkes
Copy link

siepkes commented Apr 25, 2017

@lucj I encountered the issue when upgrading to 0.8. Because of the link to the Hashicorp issue I assumed it was clear that this is only an issue with Consul >= 0.8. Sorry, I should have been more verbose about that.

@lucj
Copy link
Author

lucj commented Apr 25, 2017

@siepkes No problem. I though this would be fixed with 0.8.1 but it's not.
Good thing the agent can use 0.8.1 though.

@lucj
Copy link
Author

lucj commented Apr 25, 2017

My bad, everything needs to be in version <0.8.0

@tgross
Copy link
Contributor

tgross commented Apr 25, 2017

From my understanding 0.8.1 should fix the thing, I'll test this.

Good to know. Can you provide a link to the Hashicorp Consul issue for where that might have been discussed and/or fixed?

@lucj
Copy link
Author

lucj commented Apr 25, 2017

In fact I did not manage to have it working with 0.8.1.
From the change log https://github.com/hashicorp/consul/blob/master/CHANGELOG.md

agent: Node IDs derived from host information are now hashed to prevent things like common server hardware from generating IDs with a common prefix across nodes. [GH-2884]

@tgross
Copy link
Contributor

tgross commented Apr 25, 2017

Ok, I did a little digging and I think we can work around this. Try passing -node-id={{ .HOSTNAME }} as part of your ContainerPilot configuration for the Consul agent. That will bypass the check it's doing that appears to be pulling info from the underlying host and not the container.

@EugenMayer
Copy link

You can go back with https://www.consul.io/docs/agent/options.html#_disable_host_node_id .. because -node-id={{ .HOSTNAME }} will not work due to Error starting agent: Failed to setup node ID: uuid string is wrong length

@tgross
Copy link
Contributor

tgross commented May 3, 2017

@EugenMayer that seems like the way to go then. We're also going to look into what might be generating the conflicting node IDs and trying to suss out whether this is a Triton issue (which seems unlikely as it's happening on Docker for Mac) or a Consul issue.

(ref https://devhub.joyent.com/jira/browse/PRODSUP-16 for internal folks)

@EugenMayer
Copy link

Probably kind of related topic: https://groups.google.com/forum/#!topic/consul-tool/9lm0HbyQVd4

@tgross
Copy link
Contributor

tgross commented May 3, 2017

@lucj I'm still unable to reproduce this, so it might help if we could get some more information from you about the conditions where you're seeing this problem.

  • what version(s) of Consul, ContainerPilot, and Docker works? which versions don't work?
  • is the data directory for Consul being reused? i.e. are you using -v arguments to mount it as a volume?
  • are you seeing the problem only on Docker for Mac, Docker Machine, or Triton. Or more than one?
  • if you're seeing the problem on Triton can you get the PI (if on-prem) or provide the CN (triton inst get mycontainer | json compute_node will get this, or you can email me the UUIDs of the containers and I can dig this up at our end).

@sean-
Copy link

sean- commented May 3, 2017

@EugenMayer / @lucj : Those are unrelated. The background on those issues is that some providers use a timestamp-prefixed UUID, which makes the first 10 characters effectively useless. Hashing the full UUID, however, gives a sense of randomness to the first 10 characters.

But, Consul should, and properly detects node-id collisions. If consul is being seeded with a duplicate node-id, then this could happen. Consul is pointing out a discrepancy in the environment. Where it is, however, is what we're trying to figure out. CC @tgross

@lucj
Copy link
Author

lucj commented May 3, 2017

@tgross, here is a compose version that makes this error occur.

version: '2'
services:

  consul:
    image: consul:0.8.0
    command: agent -server -client=0.0.0.0 -bootstrap -ui
    dns:
      - 127.0.0.1
    ports:
      - "8500:8500"
    restart: always

  db:
    image: autopilotpattern/mongodb
    restart: always

  mq:
    image: autopilotpattern/rabbitmq
    build: ../../../development/rabbitmq
    restart: always

  profile:
    image: traxxs/profile:develop
    build: ../../../development/profile
    command: ["containerpilot", "npm", "start"]
    restart: always

The service mq is based on https://github.com/lucj/autopilotpattern-rabbitmq, using Consul 0.8.0.

The service profile is a part of our application, that is also using Consul 0.8.0 and that depends on mq and db services, below is the Dockerfile for this one:

FROM mhart/alpine-node:6.10
ENV LAST_UPDATED 20170407T151300

RUN apk update && apk add curl unzip

# Install consul
# RUN export CONSUL_VERSION=0.7.5 \
#     && export CONSUL_CHECKSUM=40ce7175535551882ecdff21fdd276cef6eaab96be8a8260e0599fadb6f1f5b8 \
RUN export CONSUL_VERSION=0.8.0 \
    && export CONSUL_CHECKSUM=f4051c2cab9220be3c0ca22054ee4233f1396c7138ffd97a38ffbcea44377f47 \
    && curl --retry 7 --fail -vo /tmp/consul.zip "https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip" \
    && echo "${CONSUL_CHECKSUM}  /tmp/consul.zip" | sha256sum -c \
    && unzip /tmp/consul -d /usr/local/bin \
    && rm /tmp/consul.zip \
    && mkdir /config

# Install ContainerPilot
ENV CONTAINERPILOT_VERSION 2.7.2
RUN export CP_SHA1=e886899467ced6d7c76027d58c7f7554c2fb2bcc \
    && curl -Lso /tmp/containerpilot.tar.gz \
         "https://github.com/joyent/containerpilot/releases/download/${CONTAINERPILOT_VERSION}/containerpilot-${CONTAINERPILOT_VERSION}.tar.gz" \
    && echo "${CP_SHA1}  /tmp/containerpilot.tar.gz" | sha1sum -c \
    && tar zxf /tmp/containerpilot.tar.gz -C /usr/local/bin \
    && rm /tmp/containerpilot.tar.gz

# Copy list of server side dependencies
COPY package.json /tmp/package.json

# Install dependencies
RUN cd /tmp && npm install

# Copy dependencies libraries
RUN mkdir /app && cp -a /tmp/node_modules /app/

# Copy src files
COPY . /app/

# Use /app working directory
WORKDIR /app

# COPY ContainerPilot configuration
ENV CONTAINERPILOT_PATH=/etc/containerpilot.json
COPY containerpilot.json ${CONTAINERPILOT_PATH}
ENV CONTAINERPILOT=file://${CONTAINERPILOT_PATH}

# Expose http port
EXPOSE 80

# Run application
CMD ["npm", "start"]

Note: the CMD is overridden in the compose file so ContainerPilot is ran as pid 1.

Below is the part of the logs where profile is starting (moving out of the preStart once the db and mq are up and running). You can see the Node ID error in it.

profile_1  | db is healthly, moving on...
profile_1  | mq is healthly, moving on...
profile_1  | 2017/05/03 19:37:20 ==> Starting Consul agent...
profile_1  | 2017/05/03 19:37:20 ==> Consul agent running!
profile_1  | 2017/05/03 19:37:20            Version: 'v0.8.0'
profile_1  | 2017/05/03 19:37:20            Node ID: '7ba08ece-d74e-4b39-b5fb-c9b18a396864'
profile_1  | 2017/05/03 19:37:20          Node name: 'ba77cfe3461f'
profile_1  | 2017/05/03 19:37:20         Datacenter: 'dc1'
profile_1  | 2017/05/03 19:37:20             Server: false (bootstrap: false)
profile_1  | 2017/05/03 19:37:20        Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)
profile_1  | 2017/05/03 19:37:20       Cluster Addr: 172.18.0.2 (LAN: 8301, WAN: 8302)
profile_1  | 2017/05/03 19:37:20     Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
profile_1  | 2017/05/03 19:37:20              Atlas: <disabled>
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20 ==> Log data will now stream in as it occurs:
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] serf: EventMemberJoin: ba77cfe3461f 172.18.0.2
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] manager: No servers available
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: Joining cluster...
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [ERR] agent: failed to sync remote state: No known Consul servers
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: (LAN) joining: [consul]
consul_1   |     2017/05/03 19:37:20 [INFO] serf: EventMemberJoin: ba77cfe3461f 172.18.0.2
consul_1   |     2017/05/03 19:37:20 [INFO] consul: member 'ba77cfe3461f' joined, marking health alive
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:
profile_1  | 2017/05/03 19:37:20
profile_1  | 2017/05/03 19:37:20 * Failed to join 172.18.0.5: Member '5a27f0cf077b' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
profile_1  | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] agent: Join failed: <nil>, retrying in 10s
consul_1   |     2017/05/03 19:37:20 [INFO] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "7ba08ece-d74e-4b39-b5fb-c9b18a396864" for node "ba77cfe3461f" aliases existing node "5a27f0cf077b"
mq_1       | 2017/05/03 19:37:20     2017/05/03 19:37:20 [WARN] memberlist: ignoring alive message for 'ba77cfe3461f': Member 'ba77cfe3461f' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
mq_1       | 2017/05/03 19:37:21     2017/05/03 19:37:21 [WARN] memberlist: ignoring alive message for 'ba77cfe3461f': Member 'ba77cfe3461f' has conflicting node ID '7ba08ece-d74e-4b39-b5fb-c9b18a396864' with this agent's ID
profile_1  |
profile_1  | > profile@0.0.1 start /app
profile_1  | > node index.js
profile_1  |
profile_1  | --- www server listering on port [80] --

To reply to your questions:

  • ContainerPilot is 2.7.2 in both mq and profile.
  • I do not mount volumes
  • I only tested this on Docker for Mac

If all the services are using Consul 0.7.5 instead of 0.8.0 the problem does not occur.

@mterron
Copy link
Contributor

mterron commented May 4, 2017

You can pass "-disable-host-node-id" to the consul (>0.8.1) run command to make it generate a random node-id. It can't be reproduced on Triton, as it is a Consul-Docker issue.

Cheers

@tgross
Copy link
Contributor

tgross commented May 17, 2017

Closing this, as it's a Consul issue.

@tgross tgross closed this as completed May 17, 2017
@EugenMayer
Copy link

@tgross are there any issues you refer too? i am very interested

@tgross
Copy link
Contributor

tgross commented May 18, 2017

Just the Consul mailing list issue that you opened. Use -disable-host-node-id as was advised there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants