-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd state stalls control plane bootstrapping on 3rd node #2591
Comments
Taking a look on CAPV side. |
is it possible to get etcd logs (from all the 3 members)? |
Is it possible to test with |
@vincepri I was testing with |
@vincepri seeing this on rc.3 |
From the capi controller-manager:
|
The sync delays and errors for reads taking too long are a bit worrisome. What type of disks are backing the control plane instances? What does the networking look like between the instances? |
RAID controllers? |
Tried about 5 times today, got it working once where it deploys all 3 masters, but here are the logs for the failures. First control plane mycluster-b555r.log Second control plane mycluster-xts2n.log Environment: |
@johnlam90 What kubernetes version are you using out of curiosity? |
@vincepri 1.16.7 |
This error on the second control plane is a bit concerning:
It looks like either the load balancer or the api-server for the first control plane Machine had an issue where it severed the connection early during that call. Could you provide additional logs from the host? Of particular interest would be the kubelet, api-server, and etcd logs for both the first and second control plane Machines. |
Checking in CAPV, I'm seeing different errors more frequently, and I'm still suspecting the HAProxyLoadBalancer. However, I can reproduce the original problem, repeatedly, in one situation: The steps for 100% reproducability are:
and you can get it most of the time by applying calico at any point. etcd leadership will move around until the etcd cluster is finally dead. These will be logged in the api server and etcd logs. Tests done on rc3, K8s 1.17.3 |
is the error I see more often in CAPV, but suspect HAProxy config as the version of the Dataplane API there does not support scraping for completeness, the logs will look like something like:
etcd logs on initial node:
etcd logs on node 2:
The read-only range request errors appear to be a red herring during sync, I/O stats seem OK:
Tests done on rc3, K8s 1.17.3 |
CAPZ issue seem fixed now (root cause a misconfiguration in the template) |
can the issue be resolved if the Pod CIDR is modified? this obviously requires changing CALICO_IPV4POOL_CIDR and the kubeadm ClusterConfiguration. |
@neolit123 yeah, we had hard-coded the value in the CAPV template to reduce the number of variables since clusterctl doesn't support default values yet, but will move to make it mandatory a large number of CAPV users will have clusters in 192.168.0.0/16. As for the other issue, that's already tracked in kubernetes-sigs/cluster-api-provider-vsphere#782 . cc @johnlam90 |
I actually just got this error again with capz (with fixed JoinConfiguration this time)
|
I really feel like theres got to be some kind of chaos / overlapping IPs / missing kube-proxy / kill nodes in etcd randomly / test we can run on CAPD to induce this. I know @randomvariable said that in general, nobody has seen it there, but i'd venture to say, if something doesn't happen on CAPD (granted it might require the correct stressor or scenario to simulate cloud latency ) then we cant really call it a CAPI bug, and it has to be a provider triage. |
ill file a separate issue for this idea and see what others think |
this seems to have been fixed by v0.3.0 for me... Still unsure what the exact root cause is 😕 |
we've fixed our HAProxy issue as well in kubernetes-sigs/cluster-api-provider-vsphere#823, so would be happy to close. |
Closing this issue. /close |
@randomvariable: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks with CAPV v6.0 and v0.3.0 , this resolved my issue. Tried it about 5 times with no issues . |
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
When specifying a 3-node control plane, in certain infrastructures, the 3rd machine bootstrap fails.
First reported by @CecileRobertMichon:
with KCP status:
What did you expect to happen:
3 node cluster bootstraps successfully
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Likely to be related to misdiagnosed kubernetes-sigs/cluster-api-provider-vsphere#782
Environment:
kubectl version
):/etc/os-release
): 1.17.3/kind bug
The text was updated successfully, but these errors were encountered: