Node unable to rejoin after failure #107

matt0x6F · 2018-08-26T04:36:57Z

For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).

I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.

Here's some orchestrator logs:

2018-08-27 18:18:27.000 CDT Successfully pulled image "quay.io/presslabs/orchestrator:v3.0.11-r21"
2018-08-27 18:18:27.000 CDT Created container 
2018-08-27 18:18:27.000 CDT Started container
2018-08-27 18:18:37.000 CDT Readiness probe failed: HTTP probe failed with statuscode: 500

The failing healthcheck goes on perpetually

This is emitted from the node that restart:

I  [martini] Completed 500 Internal Server Error in 7.805308ms
I  [martini] Started GET /api/raft-health for 10.8.33.1:48672
E  2018/08/27 23:18:36 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:36 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:35 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:35 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:34 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:34 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:32 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:32 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:30 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
E  2018/08/27 23:18:29 [INFO] raft: Node at 10.8.33.10:10008 [Follower] entering Follower state (Leader: "")
E  2018/08/27 23:18:29 [INFO] raft: Restored from snapshot 15915-17741-1535409376687
E  2018-08-27 23:18:27 FATAL 2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host

This is emitted from the other nodes:

E  2018/08/28 00:48:48 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:48 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:48 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/28 00:48:48 [WARN] raft: Election timeout reached, restarting election
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m3.397022732s
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m2.919864839s
I  [martini] Started GET /api/lb-check for 10.8.31.1:60068
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  k8s.io update kube-system:cluster-autoscaler cluster-autoscaler {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{},"authenticationInfo":{"principalEmail":"cluster-autoscaler"},"requestMetadata":{"callerIp":"::1"},"serviceName":"k8s.io","methodName":"io.k8s.core.v1.endpoints.update","authorizationInfo":[{"resource":"core/v1/namespaces/kube-sys… k8s.io update kube-system:cluster-autoscaler cluster-autoscaler 
E  2018/08/28 00:48:47 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  2018-08-28T00:48:47,943449832+00:00 requests.cpu needs updating. Is: '', want: '100m'.
E  Error from server (NotFound): daemonsets.extensions "fluentd-gcp-v3.0.0" not found
I  2018-08-28T00:48:47,791328941+00:00 fluentd-gcp-scaling-policy not found in namespace kube-system, using defaults.
E  Error from server (NotFound): scalingpolicies.scalingpolicy.kope.io "fluentd-gcp-scaling-policy" not found
E  2018/08/28 00:48:47 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:47 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:47 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state

It seems like a node should be able to rejoin after failure, even if it's using a different IP address.

The text was updated successfully, but these errors were encountered:

matt0x6F · 2018-08-28T01:07:45Z

I opened this issue on GitHub Orchestrator to correlate as I'm not really sure where the actual failure is occurring here.

matt0x6F · 2018-08-28T08:00:15Z

Based on the response I got from @shlomi-noach it seems that this actually an operator issue. He's described the steps for recovery which mimic my experience. Let me know your thoughts.

matt0x6F · 2018-08-28T08:01:54Z

Also, I was actually using a Helm export of the manifests (mainly because I didn't want to run Tiller). It seems that wasn't enough to get things running (potentially, although things seemed to work). I have not seen this issue replicated since, so I'll keep you informed.

matt0x6F · 2018-09-03T00:45:21Z

This is actually still happening, albeit, more rarely. I'll start up a dev cluster and see if I can get the operator to notice the failure and rectify it. It should be noted I regularly kill instances in my cluster, so my situation may be a bit extreme :)

tuapuikia · 2018-09-28T14:17:06Z

May I know how to rolling restart the orchestrator?

AMecea · 2018-09-28T14:34:28Z

Hi @tuapuikia, you can do rolling restart by setting a new annotation on the orchestrator statefulset with kubectl, then the k8s will do a rolling update to pods.

@mattouille we know about those problems with orchestrator, in the next version we will try to fix them, sorry for the late response, we focus on rewriting operator with kubebuilder. Also if you have a fix, for this issue, we can discuss it on gitter.

tuapuikia · 2018-09-28T14:37:43Z

Thank you for the reply.

shlomi-noach · 2018-10-02T08:11:37Z

Noteworthy that there is another option that does not require a rolling restart upon node replacement: using RaftAdvertise where a node can be reached via an "advertised" IP address (e.g. via load balancer). In such case, you may remove an orchestrator node, provision a new one in its place (with a different IP), and as long as the new node answers on the advertised IP, you should be good to go.

See https://github.com/github/orchestrator/blob/master/docs/configuration-raft.md#nat-firewalls-routing

Also related is a discussion on vitessio/vitess#3665

calind · 2018-10-03T12:01:35Z

@shlomi-noach I've set RaftAdvertise to the FQDN of the node. Everything seams to work fine when doing a first deployment, but if an IP changes the raft fails. I think that orchestrator might not re-resolve the advertised node DNS name.

I'm seeing this error:

2018/10/03 12:13:23 [WARN] raft: Remote peer 10.1.0.155:10008 does not have local node 10.1.0.158:10008 as a peer

shlomi-noach · 2018-10-03T13:32:51Z

@calind could you please provide the following details:

orchestrator configuration (just the raft partsm and mainly the peers and advertise address) -- in all boxes
which peer did you remove and reprovision (I assume 10.1.0.158 is a new peer; I'm unsure whether this is the advertised address?)
What was the address of the previous (deprovisioned) peer?

calind · 2018-10-04T08:26:55Z

@shlomi-noach here are the details:

orchestrator.conf.json ({{ .Env.HOSTNAME }} gets replaced accordingly with orchestrator-0...2):

{
  "BackendDB": "sqlite",
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyCredentialsConfigFile": "/etc/orchestrator/orc-topology.cnf",
  "RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftAdvertise": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftDataDir": "/var/lib/orchestrator",
  "RaftEnabled": true,
  "RaftNodes": [
    "orchestrator-0.orchestrator-headless",
    "orchestrator-1.orchestrator-headless",
    "orchestrator-2.orchestrator-headless"
  ],
  "SQLite3DataFile": "/var/lib/orchestrator/orc.db"
}

Initial peer IPs:

10.1.1.61 (orchestrator-0.orchestrator-headless)
10.1.1.62 (orchestrator-1.orchestrator-headless)
10.1.1.63 (orchestrator-2.orchestrator-headless) - This is the elected leader

After killing orchestrator-2.orchestrator-headless (the master):

10.1.1.61 (orchestrator-0.orchestrator-headless) - This is the newly elected leader
10.1.1.62 (orchestrator-1.orchestrator-headless)
10.1.1.65 (orchestrator-2.orchestrator-headless) - This one cannot join the cluster

Error logs after killing the master:

orchestrator-0.orchestrator-headless (the newly elected leader):

raft: Failed to contact 10.1.1.63:10008 in 1m18.108505986s
raft: Rejecting vote request from 10.1.1.65:10008 since we have a leader: 10.1.1.61:10008

orchestrator-1.orchestrator-headless:

raft: Rejecting vote request from 10.1.1.65:10008 since we have a leader: 10.1.1.61:10008

orchestrator-2.orchestrator-headless:

raft: Election timeout reached, restarting election
raft: Node at 10.1.1.65:10008 [Candidate] entering Candidate state
raft: Remote peer 10.1.1.62:10008 does not have local node 10.1.1.65:10008 as a peer
raft: Remote peer 10.1.1.61:10008 does not have local node 10.1.1.65:10008 as a peer
raft: Votes needed: 2
raft: Vote granted from 10.1.1.65:10008. Tally: 1

shlomi-noach · 2018-10-04T09:54:09Z

@calind if I'm reading this right, your RaftBind and RaftAdvertise are the same. This undermines the intent of RaftAdvertise.
The idea is that RaftBind would have your actual box IP. RaftAdvertise would have a "public" IP, typically a load balancer/VIP. RaftNodes would list all the public IPs.

Then, if a box goes down and another takes its place, it would have a different RaftBind, but (courtesy of your own infrastructure which is your own responsibility) will be accessible via same public IP (same RaftAdvertise).

Makes sense?

calind · 2018-10-04T10:35:02Z

Yes, it does for the RaftBind/RaftAdvertise.

But the problem I see is with RaftNodes. They are kept as a list of IPs resolved at orchestrator startup if they are passed as a list of FQDNs. So if one node changes it's IP the other 2, which maintain quorum, will consider it an "outsider" and not let it join the cluster. My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

That way, when orchestrator-2 changes it's IP, it would be accepted by orchestrator-0 and orchestrator-1 as cluster member.

Another approach would be to have a shared "RaftID" and consider RaftNodes only for initial bootstrapping (first leader election). Any node providing the same "RaftID" should be able to join the cluster, and the other peers should update their peer list accordingly. I think this is what etcd does.

shlomi-noach · 2018-10-04T13:17:35Z

Yes, it does for the RaftBind/RaftAdvertise.

Sorry, I'm not sure I understand what the "yes" implies.

My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

I'm wondering whether after given some time this self resolves?

Another approach would be to have a shared "RaftID" and consider RaftNodes only for initial bootstrapping (first leader election). Any node providing the same "RaftID" should be able to join the cluster, and the other peers should update their peer list accordingly. I think this is what etcd does.

The latest Consul code does the same. Unfortunately it also removes support for pre-defined cluster IPs and otherwise breaks other things. I don't plan to upgrade to that as yet.

calind · 2018-10-04T14:03:14Z

Sorry, I'm not sure I understand what the "yes" implies.

yes, the mechanics for RaftBind/RaftAdvertise make sense

I'm wondering whether after given some time this self resolves?

It doesn't recover after some time. It seams that the peer list gets set in stone in https://github.com/github/orchestrator/blob/eb7a3b642f6e0aa83a4257ae62441571cc14a292/go/raft/raft.go#L138

shlomi-noach · 2018-10-04T14:29:18Z

True, the list is set in stone.

So if one node changes it's IP the other 2, which maintain quorum, will consider it an "outsider" and not let it join the cluster. My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

I think (sorry if I'm wrong) that this is a continued misunderstanding about how RaftAdvertise works. As example, let's take orchestrator on kubernetes. This setup is known to work: kubernetes load balances on the public IP and routes the communication to the explicit backend orchestrator node. There is no need for orchestrator to re-read configuration since configuration is not expected to change on the two boxes that were not deprovisioned. Also, there is no need for orchestrator to re-resolve the DNS, because the traffic gets routed by the load balancer no matter what.

calind · 2018-10-05T10:18:44Z

The setup I'm referring is on kubernetes. There are 3 orchestrator pods managed by a statefulset and it's corresponding headless service.
Orchestrator HTTP is accesible by an service (internal load balancer) which routes traffic to orchestrator pods that pass /api/raft-health HTTP check.

Initial state:
pod0: hostname: orchestrator-0.orchestrator-headless, ip: 10.1.1.61, follower
pod1: hostname: orchestrator-1.orchestrator-headless, ip: 10.1.1.62, follower
pod2: hostname: orchestrator-2.orchestrator-headless, ip: 10.1.1.63, leader
service: hostanme: orchestrator, ip: 10.98.157.50, round robins between orchestrator-0, orchestrator-1 and orchestrator-2.

As far as I understood, the correct raft configuration for this would be (no need for RaftAdvertise since internal/external IPs are the same):

{
...
  "RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftEnabled": true,
  "RaftNodes": [
    "orchestrator-0.orchestrator-headless",
    "orchestrator-1.orchestrator-headless",
    "orchestrator-2.orchestrator-headless"
  ]
}

@shlomi-noach is this correct?

shlomi-noach · 2018-10-15T09:03:15Z

@calind sorry for the late response. What happens when a pod goes down and a new one takes it place? Say pod0 goes down. :

Does the new pod grab the very same IP address 10.1.1.61 the old pod0 had? In this case there the config is good. Or,
Does the new pod get a new IP address, e.g. 10.1.1.66? That is the situation I'm familiar with, and in this case you need to use RaftAdvertise.

calind · 2018-11-02T10:47:34Z

@shlomi-noach the new pod gets a new ip. It's not clear what do I put in RaftNodes and RaftAdvertise in this case.

OGKevin · 2019-06-24T13:57:51Z

I have submitted a fix for this issue 😊

Fixes #107 This commit makes a service for each pod by using the unique statefull set name label. These services ensure that there is a cluster ip reserved for each pod. The rafting uses these cluster ips. Orchestrator will proxy/route traffic to its leader. So, the main service can be used as entry point and all the trafic will be routed to the leader. See: https://github.com/github/orchestrator/blob/master/docs/configuration-raft.md See: presslabs/docker-orchestrator#8 Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>

imriss · 2019-09-18T13:39:06Z

Will be a RC (like mysql-operator-0.3.0-rc.4) to include theses fixes in the Helm chart? Thanks

AMecea · 2019-09-18T14:33:39Z

@imriss, the version v0.3.2 is already published and contains the fix.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

matt0x6F changed the title ~~FlushNontrivialResolveCacheToDatabase() called, but HostnameResolveMethod is default~~ Node unable to rejoin after failure Aug 28, 2018

matt0x6F mentioned this issue Aug 28, 2018

Node unable to rejoin after failure openark/orchestrator#593

Closed

calind added this to the 0.2.x milestone Nov 20, 2018

AMecea mentioned this issue Jan 15, 2019

Master service fails when orchestrator is down #203

Closed

AMecea mentioned this issue Feb 25, 2019

Use individual service for each orchestrator node presslabs/docker-orchestrator#4

Closed

afirth mentioned this issue Mar 10, 2019

Graceful shutdown #15

Closed

AMecea modified the milestones: 0.2.x, 0.3.x Mar 25, 2019

OGKevin mentioned this issue Jun 24, 2019

Fix rafting issues. presslabs/docker-orchestrator#8

Merged

OGKevin mentioned this issue Jun 25, 2019

This commit fixes rafting and routing to the leader issues. #359

Merged

calind closed this as completed in #359 Jul 1, 2019

ruslanloman mentioned this issue Dec 2, 2022

Orchestrator can't properly update or migrate when it more than one. #856

Closed

chapsuk pushed a commit to chapsuk/mysql-operator that referenced this issue Oct 16, 2023

Fix bumping of pinned Vitess version. (bitpoke#107)

1cc806b

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node unable to rejoin after failure #107

Node unable to rejoin after failure #107

matt0x6F commented Aug 26, 2018 •

edited

Loading

matt0x6F commented Aug 28, 2018

matt0x6F commented Aug 28, 2018

matt0x6F commented Aug 28, 2018

matt0x6F commented Sep 3, 2018

tuapuikia commented Sep 28, 2018

AMecea commented Sep 28, 2018

tuapuikia commented Sep 28, 2018

shlomi-noach commented Oct 2, 2018

calind commented Oct 3, 2018 •

edited

Loading

shlomi-noach commented Oct 3, 2018

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018 •

edited

Loading

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018

calind commented Oct 5, 2018

shlomi-noach commented Oct 15, 2018

calind commented Nov 2, 2018

OGKevin commented Jun 24, 2019

imriss commented Sep 18, 2019

AMecea commented Sep 18, 2019

Node unable to rejoin after failure #107

Node unable to rejoin after failure #107

Comments

matt0x6F commented Aug 26, 2018 • edited Loading

matt0x6F commented Aug 28, 2018

matt0x6F commented Aug 28, 2018

matt0x6F commented Aug 28, 2018

matt0x6F commented Sep 3, 2018

tuapuikia commented Sep 28, 2018

AMecea commented Sep 28, 2018

tuapuikia commented Sep 28, 2018

shlomi-noach commented Oct 2, 2018

calind commented Oct 3, 2018 • edited Loading

shlomi-noach commented Oct 3, 2018

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018 • edited Loading

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018

calind commented Oct 4, 2018

shlomi-noach commented Oct 4, 2018

calind commented Oct 5, 2018

shlomi-noach commented Oct 15, 2018

calind commented Nov 2, 2018

OGKevin commented Jun 24, 2019

imriss commented Sep 18, 2019

AMecea commented Sep 18, 2019

matt0x6F commented Aug 26, 2018 •

edited

Loading

calind commented Oct 3, 2018 •

edited

Loading

shlomi-noach commented Oct 4, 2018 •

edited

Loading