-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-apiserver 1.13.x refuses to work when first etcd-server is not available. #72102
kube-apiserver 1.13.x refuses to work when first etcd-server is not available. #72102
Comments
/sig api-machinery |
/remove-sig api-machinery |
/sig api-machinery apologies, just had another look and it's indeed an api-machinery issue.
we are passing the server list straight into etcd v3 client which return the error u reported. not sure if it's designed |
This is an etcdv3 client issue. See etcd-io/etcd#9949 |
/cc @jpbetz |
/assign @timothysc @detiber So live updating a static pod manifest is typically not recommended, was this triggered via some other operation or were you editing your static manifests? |
No pod manifest involved here. Just a group of etcd and a kube-apiserver. The issue appeared when we rebooted the first etcd node. |
I was able to repro this issue with the repro steps provided by @Cytrian. I also reproduced this issue with a real etcd cluster. As @JishanXing previously mentioned, the problem is caused by a bug in the etcd v3 client library (or perhaps the grpc library). The vault project is also running into this: hashicorp/vault#4349 The problem seems to be that the etcd library uses the first node’s address as the An important thing to highlight is that when the first etcd server goes down, it also takes the Kubernetes API servers down, because they fail to connect to the remaining etcd servers. With that said, this all depends on what your etcd server certificates look like:
To reproduce the issue with a real etcd cluster:
Versions:
API server crash log: https://gist.github.com/alexbrand/ba86f506e4278ed2ada4504ab44b525b I was unable to reproduce this issue with API server v1.12.5 (n.b. this was somewhat of a non-scientific test => tested by updating the image field of the API server static pod produced by kubeadm v1.13.2) |
@liggitt ^ FYI. |
thank you for the investigation @alexbrand |
I believe I am running into this issue or at least something similar.
I see the etcd going up and down and the api server. This cluster was created with The logs from etcd show the following.
The logs from the api server show the following.
Is this the same? |
There are claims here that the bug is solved, but I am seeing evidence of it not being solved in our cluster:
Are we absolutely sure the etcd client fix made it onto the release? I am testing v1.6.2. |
That bug is not fixed yet. The only fix was for IP address-only connections, not those using DNS names like this. We are waiting on #83968 for what will probably be Kubernetes version 1.16.3. The workaround I'm using today is to replace my etcd server certificates with ones that use a wildcard SAN for the members in the subdomain, rather than including the given machine's DNS name as a SAN. So far, it works. |
It was fixed for IP addresses, but not DNS names (DNS name issue is tracked in #83028). Additionally, part of the fix regressed IPv6 address handling (#83550). See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#known-issues. These two issues have been resolved in master, and #83968 is open to pick them to 1.16 (targeting 1.16.3) |
@seh Would you please explain how to change the SAN on the etcd certificates?
|
I generate these certificates myself using Terraform's tls provider, so it's a matter of revising the arguments passed for the |
Tried with Kubernetes 1.15.3 and with 1.16.2 but its not working with neither.
|
I have similar observation as @yacinelazaar with IP addresses: kube-apiserver log:
|
This fixes a big issue with apiserver <-> etcd interaction and mutual TLS, as defined in [1] and [2]. [1]: https://github.com/etcd-io/etcd/releases/tag/v3.3.14 [2]: kubernetes/kubernetes#72102 Fixes #24
You should be fine with 1.16.2 and Etcd 3.3.15 now. I managed to get 3 masters running. |
Issue resolved in >v1.16.3 kubernetes/kubernetes#72102 (comment)
In my case, apiserver has been repeating warnings about connecting to external etcd cluster with tls, log snippets as follows
My environment: but i am not sure whether my issue is releated with grpc. any answer will be apprecicated |
COMPLETE BACKUP AND RESTORE PROCEDURE FOR ETCD NOTE: Check that in the file "/etc/kubernetes/etcd.yml" there is the port with the address configured like this below :
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /tmp/snapshot-pre-boot.db NOTE: etcdctl is a command normally found on the master
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --name=master --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --data-dir /var/lib/etcd-from-backup --initial-cluster=master=https://127.0.0.1:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls=https://127.0.0.1:2380 snapshot restore /tmp/snapshot-pre-boot.db
--data-dir=/var/lib/etcd-from-backup ## Update --data-dir to use new target location (put in the previous restore command) --initial-cluster-token=etcd-cluster-1 ## (put in the previous restore command) volumeMounts:
hostPath:
See if the container process is back on docker ps -a | grep etcd see if the cluster members have been recreated ETCDCTL_API=3 etcdctl member list --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=127.0.0.1:2379 see if pods, deployments and services have been recreated kubectl get pods,svc,deployments |
"""" INSTALL KUBERNETES WITH KUBEADM """" !!! CHECK ALL INSTALLATION PREREQUISITES BEFORE INSTALLING kubernetes ----> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ !!! prerequisites: Check if it is already installed by running these commands on master nodes: kubectl kubeadm prerequisites: check which version of linux you have: --> cat /etc/os-release prerequisites: Letting iptables see bridged traffic prerequisites: Check required ports prerequisites: install docker on all nodes if not already installed --> https://kubernetes.io/docs/setup/production-environment/container-runtimes/ docker installation: sudo -i (Install Docker CE)Set up the repository:Install packages to allow apt to use a repository over HTTPSapt-get update && apt-get install -y Add Docker’s official GPG key:curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - Add the Docker apt repository:add-apt-repository Install Docker CEapt-get update && apt-get install -y containerd.io=1.2.13-2 docker-ce=5:19.03.113-0ubuntu-$(lsb_release -cs) docker-ce-cli=5:19.03.113-0ubuntu-$(lsb_release -cs) Set up the Docker daemoncat > /etc/docker/daemon.json <<EOF mkdir -p /etc/systemd/system/docker.service.d Restart Dockersystemctl daemon-reload per vedere se docker è attivo: systemctl status docker.service ############################end pre-requisites############################################## INSTALL KUBERNETES Return to the manual --> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ and install the following components: kubeadm: the command to bootstrap the cluster. kubelet: the component that runs on all of the machines in your cluster and does things like starting pods and containers. kubectl: the command line util to talk to your cluster. I report below the steps of the manual, do it on each node: sudo apt-get update && sudo apt-get install -y apt-transport-https curl systemctl daemon-reload Vedere la versione di kubeadm --> kubeadm version -o short we won't install "Configure cgroup driver" as you do when you don't have docker installed Go to the bottom of the link page mentioned above "What's next": will take you to the following link --> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/ Give this command only on the master: do it with normal user kubeadm init Once installed, copy the output that appears and create the directories as follows on the master only: do it with normal user mkdir -p $HOME/.kube generate the token on the master --> "kubeadm token create --print-join-command" and then copy the output to all workers Still on the master, give the following command: kubectl get nodes ## you will see that the nodes are not active since the network for the nodes and pods has not been installed ENABLE THE NETWORK: user root: Enable the following file /proc/sys/net/bridge/bridge-nf-call-iptables to "1" for all CNI PLUGINS !!! by running this command below on all nodes: sysctl net.bridge.bridge-nf-call-iptables=1 Install the network on the master with the normal user kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')" Check again if the nodes are in the running state watch kubectl get nodes ################################################## Install the etcd and etcdctl command on master node cat /etc/kubernetes/manifests/etcd.yml | grep -i image ## see the etcd version on master, then download it with the website instructions below https://github.com/etcd-io/etcd/releases/ wget -q --show-progress --https-only --timestamping "https://github.com/etcd-io/etcd/releases/download/v3.4.13/etcd-v3.4.13-linux-amd64.tar.gz" tar -xvf etcd-v3.4.13-linux-amd64.tar.gz sudo mv etcd-v3.4.13-linux-amd64/etcd* /usr/local/bin/ cd /usr/local/bin/ chown root:root etcd* check that everything is working ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key |
Installation of the METRICS SERVER on a MASTER without minkube copy the link of the zip file
|
Deploying flannel network manually Flannel can be added to any existing Kubernetes cluster though it's simplest to add flannel before any pods using the pod network have been started. For Kubernetes v1.17+ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml ############################################################# Quickstart for Calico network on Kubernetes
mkdir -p $HOME/.kube
kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml Note: Before creating this manifest, read its contents and make sure its settings are correct for your environment. For example, you may need to change the default IP pool CIDR to match your pod network CIDR
watch kubectl get pods -n calico-system Note: The Tigera operator installs resources in the calico-system namespace. Other install methods may use the kube-system namespace instead
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME -------------- About installing calicoctl calicoctl allows you to create, read, update, and delete Calico objects from the command line. Calico objects are stored in one of two datastores, either etcd or Kubernetes. The choice of datastore is determined at the time Calico is installed. Typically for Kubernetes installations the Kubernetes datastore is the default. You can run calicoctl on any host with network access to the Calico datastore as either a binary or a container. For step-by-step instructions, refer to the section that corresponds to your desired deployment. Installing calicoctl as a Kubernetes pod etcd kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl-etcd.yaml Kubernetes API datastore kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml You can then run commands using kubectl as shown below. kubectl exec -ti -n kube-system calicoctl -- /calicoctl get profiles -o wide NAME TAGS alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl" calicoctl create -f - < my_manifest.yaml |
we are using kops. can someone help me on this |
3.5 seems to have regressed on kubernetes/kubernetes#72102
3.5 seems to have regressed on kubernetes/kubernetes#72102
3.5 seems to have regressed on kubernetes/kubernetes#72102
3.5 seems to have regressed on kubernetes/kubernetes#72102
How to reproduce the problem:
Set up a new demo cluster with kubeadm 1.13.1.
Create default configurationwith
kubeadm config print init-defaults
Initialize cluster as usual with
kubeadm init
Change the
--etcd-servers
list in kube-apiserver manifest to--etcd-servers=https://127.0.0.2:2379,https://127.0.0.1:2379
, so that the first etcd node is unavailable ("connection refused").The kube-apiserver is then not able to connect to etcd any more.
Last message:
Unable to create storage backend: config (\u0026{ /registry [https://127.0.0.2:2379 https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc000381dd0 \u003cnil\u003e 5m0s 1m0s}), err (dial tcp 127.0.0.2:2379: connect: connection refused)\n","stream":"stderr","time":"2018-12-17T12:13:19.608822816Z"}
kube-apiserver does not start.
If I upgrade etcd to version 3.3.10, it reports an error
remote error: tls: bad certificate", ServerName ""
Environment:
I also experience this bug in an environment with a real etcd cluster.
/kind bug
The text was updated successfully, but these errors were encountered: