Skip to content

keremceliker/Disaster-Recovery-Openshift-Kubernetes-Engine-From-Expired-Control-Plane-Certificates

Repository files navigation

CII Best Practices

Disaster Recovery Red Hat Openshift Container Platform 4 From Expired Control-Plane Nodes Certificates

Written by Kerem ÇELİKER

  • Linkedin: linkedin.com/in/keremceliker
  • Twitter: @CloudRss
  • Blog: www.keremceliker.com

Sometimes teams have to shutdown the openshift cluster for planned work, and if this is not done appropriately, there are unexpected problems.

However, there is a point where the teams managing many Container infrastructures in general are overlooked due to their work intensives. In this case, the core certificates have been forgotten by a DevOps Engineer/SRE and these certificates have expired after this situation, the long periods of downtime and the occurrence of money losses in Different Area Online Customer Systems

  • In such a problem, must first be determined issue with the following commands.

[keremceliker@bastion ~]$ export KUBECONFIG=/2021GonnaBeGood/auth/kubeconfig

[keremceliker@bastion ~]$ oc get nodes  

**Unable to connect to the server: x509: certificate has expired or is not yet valid** 
  • Check the logs for the kubelet.service and and you must be getting an certificate error like the one below

[core@kcmaster01 ~]$ sudo journalctl -a --no-pager -u kubelet -b0 | grep "E" | head -10

Dec 24 09:35:03 kcmaster0 hyperkube[1680]: E1024 09:35:03.175362 1680 bootstrap.go:264] 
Part of the existing bootstrap client certificate is expired: 2020-12-23 03:40:12 +0000 UTC 
=====================================================
Dec 24 09:35:03 master0 hyperkube[1680]: E1024 09:35:03.193155 1680 certificate_manager.go:385] 
Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post https://api-int.keremceliker.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: EOF 
  • Check which kubelet certificates are in expire state
[core@kcmaster01 ~]$ sudo cat /var/lib/kubelet/pki/kubelet-client-current.pem | openssl x509 -noout -dates  

notBefore=Oct 22 15:37:00 2020 GMT  
notAfter=Oct 23 15:20:32 2020 GMT 
  • Check the validity of bootstrap certificate for kubelet authentication
[core@kcmaster01 ~]$ sudo cat /etc/kubernetes/kubeconfig | grep client-certificate-data | awk '{print $2}' | base64 -d | openssl x509 -noout -dates  

notBefore=Oct 22 15:37:00 2020 GMT 

notAfter=Oct 23 15:20:32 2020 GMT 

In the last case if everything shows that the certificate has expired you must follow these steps in order

  • Log on to a Master Node as kcmaster01 as root to use temp. kube-apiserver and follow the step-by-step below one-to-one.
[keremceliker@bastion ~]$  ssh kcmaster01 

[core@kcmaster01 ~]$ sudo su - 
 
[root@kcmaster01 ~]$ RELEASE_IMAGE=quay.io/openshift-release-dev/ocp-release:4.6.0  

[root@kcmaster01 ~]$ KAO_IMAGE=$( oc adm release info --registry-config='/var/lib/kubelet/config.json' "${RELEASE_IMAGE}" --image-for=cluster-kube-apiserver-operator )
 
[root@kcmaster01~]$ podman pull --authfile=/var/lib/kubelet/config.json "${KAO_IMAGE}"  
 
[root@kcmaster01 ~]$ podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:F--entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver create
 
[root@kcmaster01 ~]$ export KUBECONFIG=/etc/kubernetes/static-pod-resources/recovery-kube-apiserver-pod/admin.kubeconfig  
 
[root@kcmaster01 ~]$ until oc get namespace kube-system 2>/dev/null 1>&2; do echo 'Waiting for recovery apiserver to come up.'; sleep 1; done 
  • Re-create control-plane certificates by running this podman command over master0 with root user
[root@kcmaster01 ~]$ podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:F --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" regenerate-certificates  
  • Quickly publish by forced this for Nodes across all Control Planes:
[root@kcmaster01 ~]$ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge  
 
[root@kcmaster01 ~]$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge  
 
[root@kcmaster01 ~]$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 
  • Create a new bootstrap certificate to use kubeconfig bash.

[root@kcmaster01 ~]$ recover-kubeconfig.sh > /2021GonnaBeGood/kubeconfig

  • Check and to have the CA certificate to verify getting connections from the API server
[root@kcmaster01 ~]$ oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /2021GonnaBeGood/kubelet-ca.crt  
  • Deploy newly created boot certificate and CA certificate to all Nodes

[keremceliker@bastion ~]$ scp core@kcmaster01:/2021GonnaBeGood/kubeconfig .

[keremceliker@bastion ~]$ scp core@kcmaster01:/2021GonnaBeGood/kubelet-ca.crt .

[keremceliker@bastion ~]$ for node in core@kcmaster{01,02,03} core@kcworker{01,02,03}  
do  
if [[ ! "$node" =~ "kcmaster01" ]] ; then scp kubeconfig $node:/2021GonnaBeGood; fi  
ssh $node sudo cp /2021GonnaBeGood/kubeconfig /etc/kubernetes/kubeconfig  
done 
[keremceliker@bastion ~]$ for node in core@kcmaster{01,02,03} core@kcworker{01,02,03} 
do 
if [[ ! "$node" =~ "kcmaster01" ]] ; then scp kubelet-ca.crt $node:/2021GonnaBeGood; fi  
ssh $node sudo cp /2021GonnaBeGood/kubelet-ca.crt /etc/kubernetes/kubelet-ca.crt  
done 
  • Add machine-config-daemon-force files for all hosts and then run the following command to configure Machine-Config and accept the new updated certificate in Nodes
[keremceliker@bastion ~]$ for node in core@kcmaster{01,02,03} core@kcworker{01,02,03}  

do  

ssh $node sudo touch /run/machine-config-daemon-force  

done 
  • Restart the "kubelet.service" from Master nodes and remove all previous/former expired certificates:
[keremceliker@bastion ~]$ for node in core@kcmaster{01,02,02}  

do  

ssh $node sudo systemctl stop kubelet  

ssh $node sudo rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig  

ssh $node sudo systemctl start kubelet  

done 
  • Restart the "kubelet.service" from Worker nodes and remove all previous/former expired certificates:
[keremceliker@bastion ~]$ for node in core@kcworker{01,02,03}  

do  

ssh $node sudo systemctl stop kubelet  

ssh $node sudo rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig  

ssh $node sudo systemctl start kubelet  

done 
  • Confirm all newly "Pending State" CSR certificates by running the following commands and have nodes receive them. Thus, all graded operators will be running perfectly again (:
[root@kcmaster01 ~]$ oc get csr or watch "oc get csr" 
[root@kcmaster01 ~]$ for id in `oc get csr | grep Pend | awk '{print $1}'`;  

do  

oc adm certificate approve $id;  

done  
  • You will need to do this several times. so wait a while for all the certificates to arrive

Sometimes you need to run repeatly to oc adm certificate approve "csr name"

  • Verify that the cluster is up and healthy again by running oc get nodes:

[root@kcmaster01 ~]# oc get co && oc get nodes -o wide Or

[keremceliker@bastion ~]$ oc get co && oc get nodes -o wide

References

  • How to set up TLS client certificate bootstrapping for kubelets

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

  • Recovering from expired control plane certificates

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.2/html/backup_and_restore/disaster-recovery#dr-scenario-3-recovering-expired-certs_dr-recovering-expired-certs

  • Configure Certificate Rotation for the Kubelet

https://kubernetes.io/docs/tasks/tls/certificate-rotation/

Result

See why Kubernetes Loves Openshift or Vice Versa! 😉

Hope you Learned I know that article even though the level is High and Enjoyed !🙏😊