-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Member has conflicting node ID #322
Comments
You could do as you're doing there. But it should be pulling the node ID from the hostname, which is going to be unique per container isn't it? |
Thanks Tim. What is the way to have the 'cat' correctly interpreted ? Also, I do not understand why the node id has conflict for one service in particular, I'll investigate. |
The configuration interpolation doesn't run a subshell, so you'll need to inject it in some other way. Either by using the interpolation of an environment variable like
Yeah, I've never seen the problem you're experiencing. Is this on Triton? |
@tgross FIW I have the same issue without container pilot, just plain Consul, in a Triton / SmartOS LX zone (CentOS image). I speculate that there is something in the LX zone implementation which causes the UUID generated as node ID to always be the same for zones on the same CN. |
@siepkes @lucj I'm having trouble figuring out how to replicate this on Triton. I've taken the blueprint in https://github.com/autopilotpattern/consul and run 3 nodes. And then I've added another instance to the cluster like so:
Note that I'm using an affinity filter here to ensure the 4th node lands on the same CN as the 1st node, and I can confirm this with @siepkes do you have reproduction steps you can share? I must be missing something. |
@tgross I haven't configured docker in our private Triton instance. So unfortunately I can't run the test you describe. We use a combination of Packer, SaltStack and Terraform to create and deploy our images. I actually only have 1 LX zone image because we needed V8 and that doesn't build on Illumos. I worked around the issue by having systemd generate a persistent unique node-id on first start:
The image I used with Consul in which I get this behavior is CentOS 7 ( |
Ok, well we need to try to create a minimal reproduction here. What can you tell me about the topology of the clients/servers or the sequence of events, @siepkes? |
Seems like this occurs with Consul 0.8.0 but it's fine with 0.7.5. |
Seems to be fine when local agents (the ones embedded in each service) are running Consul 0.8.1 and server is running 0.7.5. |
@lucj I encountered the issue when upgrading to 0.8. Because of the link to the Hashicorp issue I assumed it was clear that this is only an issue with Consul >= 0.8. Sorry, I should have been more verbose about that. |
@siepkes No problem. I though this would be fixed with 0.8.1 but it's not. |
My bad, everything needs to be in version <0.8.0 |
Good to know. Can you provide a link to the Hashicorp Consul issue for where that might have been discussed and/or fixed? |
In fact I did not manage to have it working with 0.8.1.
|
Ok, I did a little digging and I think we can work around this. Try passing |
You can go back with https://www.consul.io/docs/agent/options.html#_disable_host_node_id .. because -node-id={{ .HOSTNAME }} will not work due to |
@EugenMayer that seems like the way to go then. We're also going to look into what might be generating the conflicting node IDs and trying to suss out whether this is a Triton issue (which seems unlikely as it's happening on Docker for Mac) or a Consul issue. (ref https://devhub.joyent.com/jira/browse/PRODSUP-16 for internal folks) |
Probably kind of related topic: https://groups.google.com/forum/#!topic/consul-tool/9lm0HbyQVd4 |
@lucj I'm still unable to reproduce this, so it might help if we could get some more information from you about the conditions where you're seeing this problem.
|
@EugenMayer / @lucj : Those are unrelated. The background on those issues is that some providers use a timestamp-prefixed UUID, which makes the first 10 characters effectively useless. Hashing the full UUID, however, gives a sense of randomness to the first 10 characters. But, Consul should, and properly detects node-id collisions. If consul is being seeded with a duplicate node-id, then this could happen. Consul is pointing out a discrepancy in the environment. Where it is, however, is what we're trying to figure out. CC @tgross |
@tgross, here is a compose version that makes this error occur.
The service mq is based on https://github.com/lucj/autopilotpattern-rabbitmq, using Consul 0.8.0. The service profile is a part of our application, that is also using Consul 0.8.0 and that depends on mq and db services, below is the Dockerfile for this one:
Note: the CMD is overridden in the compose file so ContainerPilot is ran as pid 1. Below is the part of the logs where profile is starting (moving out of the preStart once the db and mq are up and running). You can see the Node ID error in it.
To reply to your questions:
If all the services are using Consul 0.7.5 instead of 0.8.0 the problem does not occur. |
You can pass "-disable-host-node-id" to the consul (>0.8.1) run command to make it generate a random node-id. It can't be reproduced on Triton, as it is a Consul-Docker issue. Cheers |
Closing this, as it's a Consul issue. |
@tgross are there any issues you refer too? i am very interested |
Just the Consul mailing list issue that you opened. Use |
I have several services running with ContainerPilot, for one of them, I have the following error when it runs, which prevents the local consul agent to register to the consul server.
Seems this can be solved by fixing the node id when running the consul agent: hashicorp/consul#2877
How would you recommend to set the node-id in the consul subprocess ?
Thanks a lot.
The text was updated successfully, but these errors were encountered: