Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to add second control plane node w 1.17.4 #2072

Closed
briantopping opened this issue Mar 24, 2020 · 11 comments
Closed

Unable to add second control plane node w 1.17.4 #2072

briantopping opened this issue Mar 24, 2020 · 11 comments
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@briantopping
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: kubeadm version: &version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:01:11Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

docker info:

[root@gw01 cluster-gen]# docker info
Client:
 Debug Mode: false

Server:
 Containers: 22
  Running: 15
  Paused: 0
  Stopped: 7
 Images: 96
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.20.3-1.el7.elrepo.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 36
 Total Memory: 125.7GiB
 Name: gw01.hndc1.example.com
 ID: N4G6:TUN5:YM2S:C2CL:QT7A:Z4K2:ZBKC:5RTE:UERH:TEEI:JZ2C:WLLK
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: briantopping
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

kubeadm init configuration:

apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: uhkufu.oumto0mu0c6uxlz6
  ttl: 24h0m0s
  usages:
  - signing
  - authentication
localAPIEndpoint:
  advertiseAddress: 10.10.0.41
  bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  name: gw01.hndc1.example.com
#  taints:
#  - effect: NoSchedule
#    key: node-role.kubernetes.io/master
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
  timeoutForControlPlane: 4m0s
  extraArgs:
    bind-address: 10.10.0.10
  certSANs:
  - "c1-api.hndc1.example.com"
  - "gw01.hndc1.example.com"
  - "gw02.hndc1.example.com"
  - "hndc1-master.kbjc.example.com"
  - "10.10.0.41"
  - "10.10.0.42"
  - "10.10.4.40"
  - "127.0.0.1"
etcd:
  local:
    serverCertSANs:
    - "gw01.hndc1.example.com"
    - "10.10.0.41"
    - "127.0.0.1"
    peerCertSANs:
    - "c1-api.hndc1.example.com"
    - "gw01.hndc1.example.com"
    - "gw02.hndc1.example.com"
    - "hndc1-master.kbjc.example.com"
    - "10.10.0.41"
    - "10.10.0.42"
    - "10.10.4.40"
    - "127.0.0.1"
certificatesDir: /etc/kubernetes/pki
clusterName: "hndc1-c1"
controllerManager: {}
controlPlaneEndpoint: "10.10.0.10:6443"
dns:
  type: CoreDNS
etcd:
  local:
    dataDir: /var/lib/etcd
imageRepository: k8s.gcr.io
kubernetesVersion: v1.17.4
networking:
  dnsDomain: c1.example.com
  podSubnet: 10.16.0.0/14
  serviceSubnet: 10.96.0.0/14
scheduler: {}

kubeadm join configuration:

apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
caCertPath: /etc/kubernetes/pki/ca.crt
discovery:
  bootstrapToken:
    apiServerEndpoint: 10.10.0.10:6443
    token: uhkufu.oumto0mu0c6uxlz6
    caCertHashes: ["sha256:c3ff46a0102c3fcdefad51fc84920a9921ce694f02a6426cda8443e4e2332bcb"]
    unsafeSkipCAVerification: true
  timeout: 5m0s
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  name: gw02.hndc1.example.com
  taints: null
controlPlane:
  localAPIEndpoint:
    advertiseAddress: "10.10.0.42"
  certificateKey: "9ba4916eb04f3cb38477daea6aac7320758254daef12e8b8a7e291479a980391"

What happened?

Install first node with init configuration, after CNI install and taint removal, node running fine. Move on to second node, use join configuration. Node is created with kube-apiserver set to use etc on localhost:

    - --etcd-servers=https://127.0.0.1:2379

This is set here.

In the TLS bootstrap phase, the local etcd instance is contacted, but it isn't created until a later phase.

How did this code ever work?

What you expected to happen?

Since there's no way to start a local stacked etcd without the rest of the control plane being stable, it seems like kube-apiserver needs to be bootstrapped to the first etcd, then the local stacked etcd needs to bootstrapped. Only then can the local kube-apiserver be set to use etcd on localhost as the above code is set.

How to reproduce it (as minimally and precisely as possible)?

  1. Install master with above config
  2. Install second master with join config
  3. Wait for kubeadm TLS bootstrap phase to timeout

Anything else we need to know?

@neolit123
Copy link
Member

hi, just to point out that most of the kubeadm e2e tests are multi-control-plane setups with stacked etcd:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm

we have users using kubeadm HA setups in production, so it must be an issue in your setup.

In the TLS bootstrap phase, the local etcd instance is contacted, but it isn't created until a later phase.

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/#bootstrap-initialization

TLS boostrap is the same for workers and CP nodes joining the cluster.
the local etcd instance can be started later when this node is promoted as CP.

How did this code ever work?

it still works, because in stacked etcd the CP node api-server only talks to the etcd member on the local node.

Install master with above config
Install second master with join config
Wait for kubeadm TLS bootstrap phase to timeout

can you show some errors logs of what you are seeing from the second control-plane node?
kubelet logs or kubeadm logs with --v=5 should reveal more.

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 24, 2020
@briantopping
Copy link
Author

briantopping commented Mar 24, 2020

we have users using kubeadm HA setups in production, so it must be an issue in your setup.

Yes, I should have prefaced with that. I'm also clear that there is some edge case that my environment is evoking, just not clear what it is and digging in the code has only produced more questions. I've even gone so far as to change the configuration of kube-apiserver (with the existing etcd endpoint) as the TLS bootstrap begins to see if it would "break through" the logjam, but nothing so simple works (it was worth a shot).

middle stuff

That's all good information, thanks. I have been reading code, but not e2e. I am going to try a few additional things this morning, then start reading the e2e setups to see if I can at least replicate with the same parameters.

can you show some errors logs of what you are seeing from the second control-plane node?
kubelet logs or kubeadm logs with --v=5 should reveal more.

I get a lot of these in the kube-apiserver logs on the new CP node during TLS bootstrap:

W0324 03:27:38.620827       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.10.0.41:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.0.41:2379: connect: connection refused". Reconnecting...

EDIT: This is with me changing the kube-apiserver configuration as noted. It usually says 127.0.0.1. After getting some sleep, I see that this shouldn't be failing, but is. Maybe that's a problem...

Yes, I have been running kubeadm with --v=5, but I just remembered I could install kube-apiserver with the same using extraArgs and maybe get a better picture of why it wants to talk to etcd on localhost before it is set up. I'm wondering if there is an environmental aspect that is triggering a different code path, causing the TLS bootstrap to look locally.

As an aside, I imagine it must be possible to attach Delve to running processes and have far better clarity on what's happening, could I do this with additional extraArgs? It seems fraught, there are a lot of timeouts on concurrent processes that would probably need to be extended or held as well. For instance, that log entry is in gRPC, there's no context about how it got there.

I have a lot more information to provide over the next few hours, just wanted to fill in what I could now.

/triage needs-information

@neolit123
Copy link
Member

10.10.0.41:2379: connect: connection refused"

i think you need to understand why this is happening. you can also try adding a member manually and joining it to the cluster using etcdctl.

That's all good information, thanks. I have been reading code, but not e2e. I am going to try a few additional things this morning, then start reading the e2e setups to see if I can at least replicate with the same parameters.

that might not be that easy, because we are using the tool kinder which does not allow cluster-configuration.

you might be able to use kind though. google for k8s kind will give you the project page.
it's a local "docker in docker" type of deployment.

I've even gone so far as to change the configuration of kube-apiserver (with the existing etcd endpoint) as the TLS bootstrap begins to see if it would "break through" the logjam, but nothing so simple works (it was worth a shot).

again, the TLS bootstrap should have already finished when you are seeing the etcd failure. it is pre-step to the etcd member join failure and i don't think it's related.

I have a lot more information to provide over the next few hours, just wanted to fill in what I could now.

i do not see a kubeadm related bug in this ticket. if you find one please provide the exact reproduction steps. we usually close support tickets after a few days as the kubeadm maintainers don't have the bandwidth...and usually delegate to the support channels and forums:

https://github.com/kubernetes/kubernetes/blob/master/SUPPORT.md

@briantopping
Copy link
Author

again, the TLS bootstrap should have already finished when you are seeing the etcd failure. it is pre-step to the etcd member join failure and i don't think it's related.

Ah, I didn't parse that earlier, thanks.

i do not see a kubeadm related bug in this ticket. if you find one please provide the exact reproduction steps. we usually close support tickets after a few days as the kubeadm maintainers don't have the bandwidth...and usually delegate to the support channels and forums:

There's definitely a usability issue. Maybe one could call the improvement of notification in a failure to install a "feature request", but it's still failing to install with unknown cause. Some call that a bug, it's just semantics. I'm not asking for support here and certainly didn't want to waste your time! But I am grateful for your explanation of when etcd is called.

I just got back from a workout, so I haven't started digging yet. I will see if I can pull up a PR that solves the issue and attach it when I have that together or otherwise resolve the issue when it's clear what's broken. Thanks again for your input!

@neolit123
Copy link
Member

There's definitely a usability issue.

kubeadm exposes the means for customization of a Kubernetes deployment, but this also allows the users to deploy a broken setup in many different ways. piping the error messages from component to the kubeadm output is a non-goal, so one has to look at what the component are reporting: local etcd server, kubelet, api-server, etc.

@briantopping
Copy link
Author

briantopping commented Mar 26, 2020

The problem that I believe I have found is the failure occurs when the bind/advertise interfaces are non-default. In that case, the init node must have bind-address explicitly specified so that it does not try to use advertise-address, which is managed by HAProxy/Keepalived. Use of these two will cause kubeadm-config to store both of these addresses.

When the joining node comes online, it fetches the ClusterConfiguration from kubeadm-config, and applies the extraArgs that were saved. This is wrong, the bind-address should either not have made it to ClusterConfiguration or the joining node should not import it. Because JoinConfiguration has no extraArgs, there's no way to override this.

I believe if the default interfaces were used, the init node would not require the use of bind-address, it would not be stored in the ClusterConfiguration, and the join node would fall back to the default interface.

Note that even if the init node did use the default interface, no join nodes could put the apiserver address on a secondary interface (no extraArgs and the default interface is not correct). As such, there's no way to test this hypothesis simply by editing the stored kubeadm-config and removing the bind-address required to create the initial node on the non-default interface.

@neolit123
Copy link
Member

neolit123 commented Mar 26, 2020

When the joining node comes online, it fetches the ClusterConfiguration from kubeadm-config, and applies the extraArgs that were saved. This is wrong, the bind-address should either not have made it to ClusterConfiguration or the joining node should not import it. Because JoinConfiguration has no extraArgs, there's no way to override this.

the lack of configuration per joining node is certainly problematic and we have plans to extend that in the future with instance specific configuration. not clear when this will happen and how as this is difficult.

kubeadm has its ClusterConfiguration as it's source of truth for all control plane members and it partially makes the naive assumption that control-plane nodes are replicas.

extraArgs:
bind-address: 10.10.0.10

this will not work as multi-node solution. bind-address is an instance specific flag for the api-server.

you have the following options:

  • let kubeadm deploy it's static pod manifests in /etc/kubernetes/manifests, then you can patch the files manually which will cause restart of the components.
  • conform your network interfaces so that kubeadm / kubelet picks the right ones.

I believe if the default interfaces were used, the init node would not require the use of bind-address, it would not be stored in the ClusterConfiguration, and the join node would fall back to the default interface.

note that the kubelet always picks an IP from a public interface. this is hardcoded and there is no way to customize it.

closing as we are already tracking the instance specific work.
/close

let me know if you have further questions.

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

When the joining node comes online, it fetches the ClusterConfiguration from kubeadm-config, and applies the extraArgs that were saved. This is wrong, the bind-address should either not have made it to ClusterConfiguration or the joining node should not import it. Because JoinConfiguration has no extraArgs, there's no way to override this.

the lack of configuration per joining node is certainly problematic and we have plans to extend that in the future with instance specific configuration. not clear when this will happen and how as this is difficult.

kubeadm has its ClusterConfiguration as it's source of truth for all control plane members and it partially makes the naive assumption that control-plane nodes are replicas.

extraArgs:
bind-address: 10.10.0.10

this will not work as multi-node solution. bind-address is an instance specific flag for the api-server.

you have the following options:

  • let kubeadm deploy it's static pod manifests in /etc/kubernetes/manifests, then you can patch the files manually which will cause restart of the components.
  • conform your network interfaces so that kubeadm / kubelet picks the right ones.

closing as we are already tracking the instance specific work.
/close

let me know if you have further questions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@briantopping
Copy link
Author

briantopping commented Mar 26, 2020

note that the kubelet always picks an IP from a public interface. this is hardcoded and there is no way to customize it.

Again, if nothing else "there's definitely a usability issue". Many would say that's bad security practice if the primary interface is on the public internet. It's also not documented anywhere.

closing as we are already tracking the instance specific work.

Where is this being tracked?

@neolit123
Copy link
Member

neolit123 commented Mar 26, 2020

Again, if nothing else "there's definitely a usability issue". Many would say that's bad security practice if the primary interface is on the public internet.

i do not disagree that it is a usability issue, for the kubeadm HA support.
this is one of the things we have to resolve before getting HA to GA (current still Beta)

It's also not documented anywhere.

in terms of documentation, for the kubelet, the following is not documented and should be logged as a ticket in kubernetes/kubernetes and tagged as /sig node:

https://github.com/kubernetes/kubernetes/blob/562a420d86cbe3845835e89bcda77c1f4c35904d/pkg/kubelet/nodestatus/setters.go#L192-L198

related discussion
kubernetes/kubernetes#87244 (comment)

for the bind address of the API server:

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/

The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used.

kubeadm does not use --bind-address for the api-server, it uses --advertise-address, and kubeadm documents it as:

The IP address the API Server will advertise it's listening on. If not set the default network interface will be used.

Where is this being tracked?

you can watch this ticket:
#1682

not directly for ClusterConfiguration, but this is what we have about instance specific configuration.
it should not only apply for KubeletConfiguration.

@briantopping
Copy link
Author

Thanks @neolit123, that gets me through what I need to know about addressing. I understand now why it's not worth fixing docs when ComponentConfig is nearly complete and going to deprecate everything. I apologize if that was frustrating, it's been a couple of weeks of dead ends here so I get it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants