Consul join broken in containerized Servers run on same node #2877

iamlittle · 2017-04-05T23:19:55Z

`consul version` for both Client and Server

Server: Consul v0.8.0

`consul info` for both Client and Server

Server:

agent:
       	check_monitors = 0
       	check_ttls = 0
       	checks = 0
       	services = 1
build:
       	prerelease =
       	revision = '402636f
       	version = 0.8.0
consul:
       	bootstrap = false
       	known_datacenters = 1
       	leader = false
       	leader_addr =
       	server = true
raft:
       	applied_index = 0
       	commit_index = 0
       	fsm_pending = 0
       	last_contact = never
       	last_log_index = 0
       	last_log_term = 0
       	last_snapshot_index = 0
       	last_snapshot_term = 0
       	latest_configuration = []
       	latest_configuration_index = 0
       	num_peers = 0
       	protocol_version = 2
       	protocol_version_max = 3
       	protocol_version_min = 0
       	snapshot_version_max = 1
       	snapshot_version_min = 0
       	state = Follower
       	term = 0
runtime:
       	arch = amd64
       	cpu_count = 4
       	goroutines = 107
       	max_procs = 4
       	os = linux
       	version = go1.8
serf_lan:
       	encrypted = false
       	event_queue = 0
       	event_time = 1
       	failed = 0
       	health_score = 0
       	intent_queue = 0
       	left = 0
       	member_time = 1
       	members = 1
       	query_queue = 0
       	query_time = 1
serf_wan:
       	encrypted = false
       	event_queue = 0
       	event_time = 1
       	failed = 0
       	health_score = 0
       	intent_queue = 0
       	left = 0
       	member_time = 1
       	members = 1
       	query_queue = 0
       	query_time = 1

Operating system and Environment details

Ubuntu 16.04.1 LTS
Kubernetes 1.5

Description of the Issue (and unexpected/desired result)

Trying to join containerized consul servers on the same machine will throw an error due to /proc/sys/kernel/random/boot_id being identical across all containers on a host.

Reproduction steps

Running Consul 0.8.0 in a 3 pod replica set on a single node Kubernetes cluster (development machine).
Deployment definition

consul agent join X.X.X.X throws the error:

Error joining address 'X.X.X.X': Unexpected response code: 500 (1 error(s) occurred:

* Failed to join X.X.X.X: Member 'consul-deployment-2927201250-tlj1h' has conflicting node ID 'bd203bf1-2d59-42b0-901c-2a5dedf99c86' with this agent's ID)
Failed to join any nodes.

I believe this to be a result of #2700. In any case, 0.8.0 could cause some serious problems in Kubernetes clusters if the 2 Consul pods were to be scheduled on the same machine. This may not occur immediately.

The text was updated successfully, but these errors were encountered:

sean- · 2017-04-05T23:42:39Z

@iamlittle If you're concerned about this happening, then you need to pass in -node-id to set each instance of Consul to have a unique node ID. I'd argue (strongly) that this is in fact the correct behavior and that if you want to allow duplicate consul nodes on the same host you need to explicitly disambiguate them. Consul doesn't detect anything about running inside of a Kube environment so if there is a better ID to pull from if Consul is running under Kube, lmk.

iamlittle · 2017-04-05T23:50:30Z

@sean- Thanks! I was looking for something like that in the docs. Guess I missed it.

slackpad · 2017-04-06T02:28:18Z

We will add a note to the docs and maybe even the error message to help people find the -node-id option.

Something like -node-id=$(uuidgen | awk '{print tolower($0)}') added to the command line should get you a unique node ID.

mgiaccone · 2017-04-06T12:05:16Z

@slackpad That sounds good, but I believe an option in consul to force the generation of the node id from another source would be very useful.
I'm facing the same issue as @iamlittle. Your solution sounds reasonable, but it requires uuidgen to be available in the container. This is not the case when using the official consul docker images, for instance.

iamlittle · 2017-04-06T14:34:45Z

@mgiaccone cat /proc/sys/kernel/random/uuid will give you a uuid and is available in the docker container.

mgiaccone · 2017-04-06T14:36:37Z

@iamlittle Thanks, I just solved it with the same command

slackpad · 2017-04-06T14:48:48Z

@mgiaccone that's fair - depending on how many people bump into this we may need to add an option to generate a uuid internally - we've got the code in there, it's just a tradeoff on adding more config complexity.

rgruyters · 2017-04-07T18:25:10Z

Is it me or does the -node-id=$(cat /proc/sys/kernel/random/uuid) not work yet in version 0.8.0?
Tried the following commands:

docker run -d --name consul-01 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)
docker run -d --name consul-02 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)
docker run -d --name consul-03 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)

This doesn't work for me.

[update]
Sorry, my bad. I still had the words "agent ip" in the command-line.
[/update]

shantanugadgil · 2017-04-10T13:24:59Z

@slackpad
After having upgraded from 0.7.5 to 0.8.0, I have hit a bug of identical node ids.
My use case is that I have LXD (LXC) containers and the dmidecode output from inside all the LXD containers is the same as that of the physical host.

These are long running LXD containers which can stop and start over time.

If I were to pass the the -node-id parameter to the Consul startup, the node-id would be different on each startup. In such a case, would it matter? Or, would it use the saved (persisted) node-id from the previous ?

For now, I have reverted to v0.7.5

Thanks and Regards,
Shantanu

shantanugadgil · 2017-04-10T17:16:04Z

@slackpad

... answering my own question ... 😄

As expected the changing node-id (as specified on the command-line) does matter, but only until the health checks pass and the nodes de-register and re-register successfully.

For testing, if I restart the nodes (lxc containers) within a short span of time, I do see the message:
"consul.fsm: EnsureRegistration failed: failed inserting node: node ID..."
and then ...
"member XXXXX left, deregistering"

The node joins in successfully after the health checks, so for me, things are working fine with v0.8.0 for now.

Regards,
Shantanu

mterron · 2017-04-11T02:01:11Z

A "better" IMO way to set the node-id is with something like this:

cat /proc/sys/kernel/random/uuid > "$CONSUL_DATA_DIR"/node-id

and then start your consul agent/server as per usual (pre 0.8) practice.
Reading the code you can see that consul first checks for the existence of this file before trying to generate a new node-id.
That way, if you restart your container, it will keep a stable node-id.

shantanugadgil · 2017-04-11T07:42:06Z

@mterron thanks!
I have Ubuntu 14.04/16.04 LXD (LXC) containers.

I will have to come up with a startup logic of "execute only once, if node-id file doesn't exist" in the init script and the systemctl equivalent, so that the node-id file get generated only once!

It's straightforward for the 14.04 upstart script, will check up on how to easily achieve for the systemctl equivalent 😦

Thanks and Regards,
Shantanu

slackpad · 2017-04-12T05:04:02Z

Changing this to enhancement - I think we should add a configuration to disable the host-based ID, which will make a random one if needed inside of Consul itself, and then save that to the data dir for persistence. This will make life easier for people trying to do this in Docker.

…tainers. Fixes #2877.

shantanugadgil · 2017-04-17T15:13:52Z

Thanks @slackpad
Eagerly awaiting the next release! 🙌

mterron · 2017-04-18T05:27:18Z

What's the scenario where you want consul to use the boot_id as node id? Generating a random node id by default seems more intuitive but I'm sure I'm missing something here.

I mean, instead of having the -disable-host-node-id flag, I'd just add a -enable-host-node-id for the people that specifically need that behaviour.

slackpad · 2017-04-18T16:00:33Z

@mterron Nomad uses the same host-based IDs so it's nice to have the two sync by default (you can see where a job is running and go to .node.consul via Consul DNS kind of thing). It makes for some cool magic integration for applications like that, and in Consul you really don't want to be running two agents in the same cluster on the same host (unless you are testing or experimenting) so we made it opt-out for now.

mterron · 2017-04-18T22:05:48Z

I've never used Nomad so boot_id seemed like an arbitrary choice for a random identifier but it sort of makes sense from a Hashicorp ecosystem point of view.

2 lines on the documentation should be enough to explain the default behaviour so that users are not surprised. Something like: "By default Consul will use the machine boot_id (/proc/sys/kernel/random/boot_id) as the node-id. You can override this behaviour with the -disable-host-node-id flag or pass your own node-id using the -node-id flag." or something like that.

Thanks for replying to a closed issue!

slackpad · 2017-04-18T22:33:32Z

Hi @mterron we ended up adding something like that to the docs - https://www.consul.io/docs/agent/options.html#_node_id:

-node-id - Available in Consul 0.7.3 and later, this is a unique identifier for this node across all time, even if the name of the node or address changes. This must be in the form of a hex string, 36 characters long, such as adf4238a-882b-9ddc-4a9d-5b6758e4159e. If this isn't supplied, which is the most common case, then the agent will generate an identifier at startup and persist it in the data directory so that it will remain the same across agent restarts. Information from the host will be used to generate a deterministic node ID if possible, unless -disable-host-node-id is set to true.

slackpad mentioned this issue Apr 6, 2017

Adds node ID integrity checking to the catalog and the LAN and WAN clusters. #2832

Merged

slackpad added the type/docs Documentation needs to be created/updated/clarified label Apr 6, 2017

slackpad mentioned this issue Apr 7, 2017

Conflicting Node ID when using Docker hashicorp/docker-consul#57

Closed

slackpad added this to the 0.8.1 milestone Apr 12, 2017

slackpad added type/enhancement Proposed improvement or new feature and removed type/docs Documentation needs to be created/updated/clarified labels Apr 12, 2017

lucj mentioned this issue Apr 12, 2017

Member has conflicting node ID TritonDataCenter/containerpilot#322

Closed

slackpad mentioned this issue Apr 13, 2017

Adds a new -disable-host-node-id option to help when testing with containers. #2904

Merged

slackpad added a commit that referenced this issue Apr 13, 2017

Adds a new -disable-host-node-id option to help when testing with con…

fa04c24

…tainers. Fixes #2877.

slackpad closed this as completed in #2904 Apr 13, 2017

rk295 mentioned this issue May 12, 2017

consul 0.8.0 and later failing to cluster helm/charts#1067

Closed

mterron mentioned this issue Nov 21, 2019

Panic when trying to use raft storage for the first time hashicorp/vault#7919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul join broken in containerized Servers run on same node #2877

Consul join broken in containerized Servers run on same node #2877

iamlittle commented Apr 5, 2017 •

edited

Loading

sean- commented Apr 5, 2017

iamlittle commented Apr 5, 2017

slackpad commented Apr 6, 2017 •

edited

Loading

mgiaccone commented Apr 6, 2017 •

edited

Loading

iamlittle commented Apr 6, 2017

mgiaccone commented Apr 6, 2017

slackpad commented Apr 6, 2017

rgruyters commented Apr 7, 2017 •

edited

Loading

shantanugadgil commented Apr 10, 2017

shantanugadgil commented Apr 10, 2017

mterron commented Apr 11, 2017 •

edited

Loading

shantanugadgil commented Apr 11, 2017

slackpad commented Apr 12, 2017

shantanugadgil commented Apr 17, 2017

mterron commented Apr 18, 2017

slackpad commented Apr 18, 2017

mterron commented Apr 18, 2017

slackpad commented Apr 18, 2017

Consul join broken in containerized Servers run on same node #2877

Consul join broken in containerized Servers run on same node #2877

Comments

iamlittle commented Apr 5, 2017 • edited Loading

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

sean- commented Apr 5, 2017

iamlittle commented Apr 5, 2017

slackpad commented Apr 6, 2017 • edited Loading

mgiaccone commented Apr 6, 2017 • edited Loading

iamlittle commented Apr 6, 2017

mgiaccone commented Apr 6, 2017

slackpad commented Apr 6, 2017

rgruyters commented Apr 7, 2017 • edited Loading

shantanugadgil commented Apr 10, 2017

shantanugadgil commented Apr 10, 2017

mterron commented Apr 11, 2017 • edited Loading

shantanugadgil commented Apr 11, 2017

slackpad commented Apr 12, 2017

shantanugadgil commented Apr 17, 2017

mterron commented Apr 18, 2017

slackpad commented Apr 18, 2017

mterron commented Apr 18, 2017

slackpad commented Apr 18, 2017

iamlittle commented Apr 5, 2017 •

edited

Loading

`consul version` for both Client and Server

`consul info` for both Client and Server

slackpad commented Apr 6, 2017 •

edited

Loading

mgiaccone commented Apr 6, 2017 •

edited

Loading

rgruyters commented Apr 7, 2017 •

edited

Loading

mterron commented Apr 11, 2017 •

edited

Loading