Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node unable to rejoin after failure #107

Closed
matt0x6F opened this issue Aug 26, 2018 · 22 comments · Fixed by presslabs/docker-orchestrator#8 or #359
Closed

Node unable to rejoin after failure #107

matt0x6F opened this issue Aug 26, 2018 · 22 comments · Fixed by presslabs/docker-orchestrator#8 or #359
Milestone

Comments

@matt0x6F
Copy link

matt0x6F commented Aug 26, 2018

For some context, I am using the mySQL Operator by PressLabs on Kubernetes which utilizes this application. My Kubernetes nodes are preemptible, which means they can occassionally die (usually once a day).

I'm observing an interesting behavior where I have a cluster of three orchestrators. They all work really well until one of the nodes dies and then when a new one comes up it looks like the other two ignore it.

Here's some orchestrator logs:

2018-08-27 18:18:27.000 CDT Successfully pulled image "quay.io/presslabs/orchestrator:v3.0.11-r21"
2018-08-27 18:18:27.000 CDT Created container 
2018-08-27 18:18:27.000 CDT Started container
2018-08-27 18:18:37.000 CDT Readiness probe failed: HTTP probe failed with statuscode: 500

The failing healthcheck goes on perpetually

This is emitted from the node that restart:

I  [martini] Completed 500 Internal Server Error in 7.805308ms
I  [martini] Started GET /api/raft-health for 10.8.33.1:48672
E  2018/08/27 23:18:36 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:36 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:35 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:35 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:35 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:34 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:34 [WARN] raft: Election timeout reached, restarting election
E  2018/08/27 23:18:32 [DEBUG] raft: Vote granted from 10.8.33.10:10008. Tally: 1
E  2018/08/27 23:18:32 [DEBUG] raft: Votes needed: 2
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:32 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/27 23:18:30 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/27 23:18:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
E  2018/08/27 23:18:29 [INFO] raft: Node at 10.8.33.10:10008 [Follower] entering Follower state (Leader: "")
E  2018/08/27 23:18:29 [INFO] raft: Restored from snapshot 15915-17741-1535409376687
E  2018-08-27 23:18:27 FATAL 2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR failed to open raft store: lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host
E  2018-08-27 23:18:27 ERROR lookup mysql-operator-orchestrator-1.mysql-operator-orchestrator-headless on 10.11.240.10:53: no such host

This is emitted from the other nodes:

E  2018/08/28 00:48:48 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:48 [WARN] raft: Remote peer 10.8.32.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:48 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state
E  2018/08/28 00:48:48 [WARN] raft: Election timeout reached, restarting election
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m3.397022732s
E  2018/08/28 00:48:48 [DEBUG] raft: Failed to contact 10.8.30.6:10008 in 1h34m2.919864839s
I  [martini] Started GET /api/lb-check for 10.8.31.1:60068
E  2018/08/28 00:48:48 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  k8s.io update kube-system:cluster-autoscaler cluster-autoscaler {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{},"authenticationInfo":{"principalEmail":"cluster-autoscaler"},"requestMetadata":{"callerIp":"::1"},"serviceName":"k8s.io","methodName":"io.k8s.core.v1.endpoints.update","authorizationInfo":[{"resource":"core/v1/namespaces/kube-sys… k8s.io update kube-system:cluster-autoscaler cluster-autoscaler 
E  2018/08/28 00:48:47 [WARN] raft: Rejecting vote request from 10.8.33.10:10008 since we have a leader: 10.8.32.3:10008
I  2018-08-28T00:48:47,943449832+00:00 requests.cpu needs updating. Is: '', want: '100m'.
E  Error from server (NotFound): daemonsets.extensions "fluentd-gcp-v3.0.0" not found
I  2018-08-28T00:48:47,791328941+00:00 fluentd-gcp-scaling-policy not found in namespace kube-system, using defaults.
E  Error from server (NotFound): scalingpolicies.scalingpolicy.kope.io "fluentd-gcp-scaling-policy" not found
E  2018/08/28 00:48:47 [DEBUG] raft: Votes needed: 2
E  2018/08/28 00:48:47 [WARN] raft: Remote peer 10.8.31.3:10008 does not have local node 10.8.33.10:10008 as a peer
E  2018/08/28 00:48:47 [INFO] raft: Node at 10.8.33.10:10008 [Candidate] entering Candidate state

It seems like a node should be able to rejoin after failure, even if it's using a different IP address.

@matt0x6F matt0x6F changed the title FlushNontrivialResolveCacheToDatabase() called, but HostnameResolveMethod is default Node unable to rejoin after failure Aug 28, 2018
@matt0x6F
Copy link
Author

I opened this issue on GitHub Orchestrator to correlate as I'm not really sure where the actual failure is occurring here.

@matt0x6F
Copy link
Author

Based on the response I got from @shlomi-noach it seems that this actually an operator issue. He's described the steps for recovery which mimic my experience. Let me know your thoughts.

@matt0x6F
Copy link
Author

Also, I was actually using a Helm export of the manifests (mainly because I didn't want to run Tiller). It seems that wasn't enough to get things running (potentially, although things seemed to work). I have not seen this issue replicated since, so I'll keep you informed.

@matt0x6F
Copy link
Author

matt0x6F commented Sep 3, 2018

This is actually still happening, albeit, more rarely. I'll start up a dev cluster and see if I can get the operator to notice the failure and rectify it. It should be noted I regularly kill instances in my cluster, so my situation may be a bit extreme :)

@tuapuikia
Copy link

May I know how to rolling restart the orchestrator?

@AMecea
Copy link
Contributor

AMecea commented Sep 28, 2018

Hi @tuapuikia, you can do rolling restart by setting a new annotation on the orchestrator statefulset with kubectl, then the k8s will do a rolling update to pods.

@mattouille we know about those problems with orchestrator, in the next version we will try to fix them, sorry for the late response, we focus on rewriting operator with kubebuilder. Also if you have a fix, for this issue, we can discuss it on gitter.

@tuapuikia
Copy link

Thank you for the reply.

@shlomi-noach
Copy link

Noteworthy that there is another option that does not require a rolling restart upon node replacement: using RaftAdvertise where a node can be reached via an "advertised" IP address (e.g. via load balancer). In such case, you may remove an orchestrator node, provision a new one in its place (with a different IP), and as long as the new node answers on the advertised IP, you should be good to go.

See https://github.com/github/orchestrator/blob/master/docs/configuration-raft.md#nat-firewalls-routing

Also related is a discussion on vitessio/vitess#3665

@calind
Copy link
Member

calind commented Oct 3, 2018

@shlomi-noach I've set RaftAdvertise to the FQDN of the node. Everything seams to work fine when doing a first deployment, but if an IP changes the raft fails. I think that orchestrator might not re-resolve the advertised node DNS name.

I'm seeing this error:

2018/10/03 12:13:23 [WARN] raft: Remote peer 10.1.0.155:10008 does not have local node 10.1.0.158:10008 as a peer

@shlomi-noach
Copy link

@calind could you please provide the following details:

  • orchestrator configuration (just the raft partsm and mainly the peers and advertise address) -- in all boxes
  • which peer did you remove and reprovision (I assume 10.1.0.158 is a new peer; I'm unsure whether this is the advertised address?)
  • What was the address of the previous (deprovisioned) peer?

@calind
Copy link
Member

calind commented Oct 4, 2018

@shlomi-noach here are the details:

orchestrator.conf.json ({{ .Env.HOSTNAME }} gets replaced accordingly with orchestrator-0...2):

{
  "BackendDB": "sqlite",
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyCredentialsConfigFile": "/etc/orchestrator/orc-topology.cnf",
  "RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftAdvertise": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftDataDir": "/var/lib/orchestrator",
  "RaftEnabled": true,
  "RaftNodes": [
    "orchestrator-0.orchestrator-headless",
    "orchestrator-1.orchestrator-headless",
    "orchestrator-2.orchestrator-headless"
  ],
  "SQLite3DataFile": "/var/lib/orchestrator/orc.db"
}

Initial peer IPs:

  • 10.1.1.61 (orchestrator-0.orchestrator-headless)
  • 10.1.1.62 (orchestrator-1.orchestrator-headless)
  • 10.1.1.63 (orchestrator-2.orchestrator-headless) - This is the elected leader

After killing orchestrator-2.orchestrator-headless (the master):

  • 10.1.1.61 (orchestrator-0.orchestrator-headless) - This is the newly elected leader
  • 10.1.1.62 (orchestrator-1.orchestrator-headless)
  • 10.1.1.65 (orchestrator-2.orchestrator-headless) - This one cannot join the cluster

Error logs after killing the master:

  • orchestrator-0.orchestrator-headless (the newly elected leader):
raft: Failed to contact 10.1.1.63:10008 in 1m18.108505986s
raft: Rejecting vote request from 10.1.1.65:10008 since we have a leader: 10.1.1.61:10008 
  • orchestrator-1.orchestrator-headless:
raft: Rejecting vote request from 10.1.1.65:10008 since we have a leader: 10.1.1.61:10008
  • orchestrator-2.orchestrator-headless:
raft: Election timeout reached, restarting election
raft: Node at 10.1.1.65:10008 [Candidate] entering Candidate state
raft: Remote peer 10.1.1.62:10008 does not have local node 10.1.1.65:10008 as a peer
raft: Remote peer 10.1.1.61:10008 does not have local node 10.1.1.65:10008 as a peer
raft: Votes needed: 2
raft: Vote granted from 10.1.1.65:10008. Tally: 1 

@shlomi-noach
Copy link

shlomi-noach commented Oct 4, 2018

@calind if I'm reading this right, your RaftBind and RaftAdvertise are the same. This undermines the intent of RaftAdvertise.
The idea is that RaftBind would have your actual box IP. RaftAdvertise would have a "public" IP, typically a load balancer/VIP. RaftNodes would list all the public IPs.

Then, if a box goes down and another takes its place, it would have a different RaftBind, but (courtesy of your own infrastructure which is your own responsibility) will be accessible via same public IP (same RaftAdvertise).

Makes sense?

@calind
Copy link
Member

calind commented Oct 4, 2018

Yes, it does for the RaftBind/RaftAdvertise.

But the problem I see is with RaftNodes. They are kept as a list of IPs resolved at orchestrator startup if they are passed as a list of FQDNs. So if one node changes it's IP the other 2, which maintain quorum, will consider it an "outsider" and not let it join the cluster. My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

That way, when orchestrator-2 changes it's IP, it would be accepted by orchestrator-0 and orchestrator-1 as cluster member.

Another approach would be to have a shared "RaftID" and consider RaftNodes only for initial bootstrapping (first leader election). Any node providing the same "RaftID" should be able to join the cluster, and the other peers should update their peer list accordingly. I think this is what etcd does.

@shlomi-noach
Copy link

Yes, it does for the RaftBind/RaftAdvertise.

Sorry, I'm not sure I understand what the "yes" implies.

My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

I'm wondering whether after given some time this self resolves?

Another approach would be to have a shared "RaftID" and consider RaftNodes only for initial bootstrapping (first leader election). Any node providing the same "RaftID" should be able to join the cluster, and the other peers should update their peer list accordingly. I think this is what etcd does.

The latest Consul code does the same. Unfortunately it also removes support for pre-defined cluster IPs and otherwise breaks other things. I don't plan to upgrade to that as yet.

@calind
Copy link
Member

calind commented Oct 4, 2018

Sorry, I'm not sure I understand what the "yes" implies.

yes, the mechanics for RaftBind/RaftAdvertise make sense

I'm wondering whether after given some time this self resolves?

It doesn't recover after some time. It seams that the peer list gets set in stone in https://github.com/github/orchestrator/blob/eb7a3b642f6e0aa83a4257ae62441571cc14a292/go/raft/raft.go#L138

@shlomi-noach
Copy link

True, the list is set in stone.

So if one node changes it's IP the other 2, which maintain quorum, will consider it an "outsider" and not let it join the cluster. My take is that if passed as a FQDNs list their IPs should be re-resolved taking into account the DNS ttl for example.

I think (sorry if I'm wrong) that this is a continued misunderstanding about how RaftAdvertise works. As example, let's take orchestrator on kubernetes. This setup is known to work: kubernetes load balances on the public IP and routes the communication to the explicit backend orchestrator node. There is no need for orchestrator to re-read configuration since configuration is not expected to change on the two boxes that were not deprovisioned. Also, there is no need for orchestrator to re-resolve the DNS, because the traffic gets routed by the load balancer no matter what.

@calind
Copy link
Member

calind commented Oct 5, 2018

The setup I'm referring is on kubernetes. There are 3 orchestrator pods managed by a statefulset and it's corresponding headless service.
Orchestrator HTTP is accesible by an service (internal load balancer) which routes traffic to orchestrator pods that pass /api/raft-health HTTP check.

Initial state:
pod0: hostname: orchestrator-0.orchestrator-headless, ip: 10.1.1.61, follower
pod1: hostname: orchestrator-1.orchestrator-headless, ip: 10.1.1.62, follower
pod2: hostname: orchestrator-2.orchestrator-headless, ip: 10.1.1.63, leader
service: hostanme: orchestrator, ip: 10.98.157.50, round robins between orchestrator-0, orchestrator-1 and orchestrator-2.

As far as I understood, the correct raft configuration for this would be (no need for RaftAdvertise since internal/external IPs are the same):

{
...
  "RaftBind": "{{ .Env.HOSTNAME }}.orchestrator-headless",
  "RaftEnabled": true,
  "RaftNodes": [
    "orchestrator-0.orchestrator-headless",
    "orchestrator-1.orchestrator-headless",
    "orchestrator-2.orchestrator-headless"
  ]
}

@shlomi-noach is this correct?

@shlomi-noach
Copy link

@calind sorry for the late response. What happens when a pod goes down and a new one takes it place? Say pod0 goes down. :

  • Does the new pod grab the very same IP address 10.1.1.61 the old pod0 had? In this case there the config is good. Or,
  • Does the new pod get a new IP address, e.g. 10.1.1.66? That is the situation I'm familiar with, and in this case you need to use RaftAdvertise.

@calind
Copy link
Member

calind commented Nov 2, 2018

@shlomi-noach the new pod gets a new ip. It's not clear what do I put in RaftNodes and RaftAdvertise in this case.

@OGKevin
Copy link
Contributor

OGKevin commented Jun 24, 2019

I have submitted a fix for this issue 😊

calind pushed a commit that referenced this issue Jul 1, 2019
Fixes #107

This commit makes a service for each pod by using the unique statefull
set name label. These services ensure that there is a cluster ip
reserved for each pod. The rafting uses these cluster ips.

Orchestrator will proxy/route traffic to its leader. So, the main
service can be used as entry point and all the trafic will be routed to
the leader.

See: https://github.com/github/orchestrator/blob/master/docs/configuration-raft.md
See: presslabs/docker-orchestrator#8
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
@imriss
Copy link
Contributor

imriss commented Sep 18, 2019

Will be a RC (like mysql-operator-0.3.0-rc.4) to include theses fixes in the Helm chart? Thanks

@AMecea
Copy link
Contributor

AMecea commented Sep 18, 2019

@imriss, the version v0.3.2 is already published and contains the fix.

chapsuk pushed a commit to chapsuk/mysql-operator that referenced this issue Oct 16, 2023
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants