-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock acquisition failing #14
Comments
@cleung2010 Could it just be a blip that happened during a leader election? Or is this a more permanent error you are getting? |
Somehow the service on all the nodes on dc2 had stopped overnight with the logs ending on that message. I started them back up and they seem to be running fine now. I'll go ahead and close this issue, and re-open if I observe a recurrence. Thanks for the quick reply! |
@armon I am still seeing this issue. Here is a more comprehensive log:
Here is the upstart script for consul-replicate service which calls |
@cleung2010 The error "Check 'serfHealth' is in critical state" means that the node has a check in the critical state. This will prevent lock acquisition, because the cluster already suspects that node of having failed. If the lock was granted to a failed node, it would potentially deadlock the cluster. You need to ensure the networking setup is healthy and that running agents are not marked as critical. |
I just double-checked the ports on both the servers and the client and they have the required TCP/UDP ports opened. The consul agent on the client in still running, even though the service that is using consul lock (i.e. consul-replicate) died. |
This is the log from another consul agent that is running another service, also using consul lock, that is experiencing the same issue. This agent also has all the necessary ports opened.
|
@armon this is the iptables for the server nodes
|
It's good you have inbound TCP and UDP for 8301, are there any rules that would affect outbound traffic on these machines? Also, could you please provide some more logs from before a node is marked as failed so we can try to see what's going on? Thanks! |
This is what I am getting from the logs on the service that is running @slackpad there shouldn't be anything blocking the outbound traffic, but let me double check and see if I can grab some more logs.
|
Hi @cleung2010 - yeah we should take a look at the Consul logs on the agent node and the server to see when and why the cluster thinks the agent node is unhealthy. |
@slackpad actually the consul-replicate service is running on the server node (so in this case the agent node is also the server?). I got some consul logs from around the time the service stop was reported. Unfortunately, the consul-replicate logs didn't have a timestamp (as seem from the previous comment's log), since Side note, it would be nice if
|
@slackpad I was able to pinpoint and find the precise consul logs for when the cluster lost its leader (thankful for email alerting!).
|
Hmm - it looks like TCP connections are dying from that last log. As a sanity check, can you take a look at this issue and see if you show any "skb rides the rocket" errors in syslog around the same time? hashicorp/consul#1154 (comment) That log before that shows memberlist pings failing which are UDP, which I don't think are subject to the Xen bug, so I can't explain that yet. |
@slackpad I don't see any "skb rides the rocket" errors in
|
@cleung2010 It's hard to tell from those logs whether the high restart rate is part of the cause, or just a symptom because the leadership is lost and it can't get the locks it needs. This part of the logs above makes me think there's still some networking problem at play:
The other option is maybe the CPU is starved or something so Consul isn't meeting its timing requirements. Do you have any other data about those nodes at the time of these events that we might be able to correlate? Also, if you can reproduce it perhaps we could get some |
Closing due to inactivity. Thanks! |
I am running into the following issue:
The clusters are set up like so: 5 server nodes on dc1 and 5 on dc2. I am running multiple instances of consul-replicate through consul lock on upstart for high-availability on dc2, which is trying to replicate a set of KV's on dc1. I verified leadership in both datacenters via the
/v1/status/leader
API call and manually throughconsul info
that the agent is indeed the leader.The text was updated successfully, but these errors were encountered: