Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to upgrade from 1.29.x to 1.30.x #11500

Closed
mrBlackhat opened this issue Sep 3, 2024 · 15 comments
Closed

Unable to upgrade from 1.29.x to 1.30.x #11500

mrBlackhat opened this issue Sep 3, 2024 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mrBlackhat
Copy link

mrBlackhat commented Sep 3, 2024

What happened?

Running the upgrade (cluster-upgrade.yaml)
Failing at task:
Upgrade first master (1 retries left)
[upgrade/apply] Kubeadm | Upgrade first master FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03.pem

What did you expect to happen?

To upgrade the master from 1.29 to 1.30

How can we reproduce it (as minimally and precisely as possible)?

run upgrade-cluster.yaml from 1.29 to 1.30

OS

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Version of Ansible

ansible-playbook [core 2.16.10]
jinja version = 3.0.3

Version of Python

3.10.12

Version of Kubespray (commit)

e744a11

Network plugin used

weave

Full inventory with variables

[all]
etcd01 ansible_host=10.10.20.50 etcd_member_name=etcd01
etcd02 ansible_host=10.10.20.51 etcd_member_name=etcd02
etcd03 ansible_host=10.10.20.52 etcd_member_name=etcd03

master01 ansible_host=10.10.30.10
master02 ansible_host=10.10.30.11
master03 ansible_host=10.10.30.12

worker1 ansible_host=10.10.30.21
worker2 ansible_host=10.10.30.22
worker3 ansible_host=10.10.30.23
worker4 ansible_host=10.10.30.24
worker5 ansible_host=10.10.30.25

[kube_control_plane]
master01
master02
master03

[etcd]
etcd01
etcd02
etcd03

[kube_node]
worker1
worker2
worker3
worker4
worker5

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Command used to invoke ansible

ansible-playbook upgrade-cluster.yaml -i /path/to/my/inventory -e ansible_user=root

Output of ansible run

Upgrade first master (1 retries left)
[upgrade/apply] FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03.pem

Anything else we need to know

Not sure if related, but i see that when trying to upgade master01 - it seraches for master03 certificate in /etc/ssl/etcd/ssl, but since is master01 and there is only master01.cert - it fails, any help would be appreciated

@mrBlackhat mrBlackhat added the kind/bug Categorizes issue or PR as related to a bug. label Sep 3, 2024
@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

I am unable to reproduce this - I just went through a full upgrade of a cluster from 1.29.3 and 1.30.2 to v1.30.4 without any issues.
(the cluster was in a mixed version state due to #11350 , but I had both control plane and worker nodes left on 1.29.3 - all of them were upgraded successfully).

There is one major thing different on my cluster, though - I am running etcd on the control plane nodes (not on dedicated machines).

The step you mentioned actually calls kubeadm upgrade . I think you will need to provide a more complete log of the actions, and some details about the cluster configuration.

Was the cluster created using kubespray? Were there any significant configuration changes to the cluster (and/or etcd) between cluster creation and the upgrade attempt?

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

@bogd Thank you for your answer, yes indeed cluster was created via Kubespray (We hadnt any issues till now) , last things i remember was we tried to upgade from 1.29 to 1.30, but there we hit this kubernetes/kubeadm#3084 (which was merged recently and now fixed) Our issue was that then we tried upgade, it went from 1.29 to 1.30 (on the first master) then failed and we had to downgrade back to 1.29 - since then we're getting this error i mention above, i read the playbooks and saw that i calls kubeadm upgrade, what more logs shall i provide, i can run ansible with debug v5 and post the result if any changes?

UpdatE: I am now running the upgrade to another cluster (same version) which hadnt been impacted by: kubernetes/kubeadm#3084 , and soon will try to the affected and will post more of the output

@mrBlackhat mrBlackhat reopened this Sep 9, 2024
@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

@mrBlackhat - that is exactly the same issue I encountered ( #11350 references kubernetes/kubeadm#3084 ), and the reason why my cluster was stuck on two versions (v1.30.2 on the first control plane, and v1.29.3 on the other control plane and worker nodes). However, in my case the upgrade actually completed successfully using the most recent kubespray version.

A debug log from ansible will definitely help - and so will the results of running the upgrade on the other cluster.

[ Edit, because I just noticed - how did you perform the downgrade? In my case, I just left the cluster on multiple versions - a difference of one minor version is fine, as per K8s version skew documentation ]

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

@bogd , Hi again, the downgrade was via kubepsray just did kube_verstion 1.30 -> 1.29 and it downgraded fine, so i tried to run upgrade from 1.29 to 1.30 (on cluster which hadnt been update while the --config was problem) Issue still persist: https://pastebin.com/6gSMjswY , I ran the upgrade (My master nodes are master01-03-test) so in the output of the playbook i still see that it tries to search for master03.pem (while is searching directory located on master01) , the only thing i found is that after removing --config from kubeadm task, someone said that it becomes interactive, so i have to pass --yes to it in order to continue its normal behaivor, do i edit the task in the playbook to add --yes flag?

Edit: I left the cluster in weird state, after the fail, master01 was 1.30 , other was 1.29 so i saw alarm in prometheus Kube_version_missmatch, but since i downgraded it went off, then tried upgrade again and still fails, as i said i just ran the upgrade on a fresh (1.29) cluster which hasnt been upgraded when --config options was problematic (so it never seen failed upgrade) but it is still failing as i posted the output of the task

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

@bogd I am confused by this error: [master01-test] FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03-test.pem: no such file or directory , why it searches for master03.pem file on master01 in /etc/ssl/etcd/ssl/node-master03-test.pem , when i search the directory i can see that there is only node-master01-test.pem (Which seems correct) but why is complaining about master03 when it is currently working on master01

edit: v5 output of the task: https://pastebin.com/5WseLz7R

@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

@bogd I am confused by this error: [master01-test] FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03-test.pem

There is something strange there indeed. kubeadm seems to be taking the certificate file name from the API server configuration (/etc/kubernetes/manifests/kube-apiserver.yaml). And you are right - even on my cluster, I see this:

root@name-01-master:~# less /etc/kubernetes/manifests/kube-apiserver.yaml 
apiVersion: v1
[...]
    - --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-name-03-master.pem
    - --etcd-compaction-interval=5m0s
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-name-03-master-key.pem

On another kubespray-provisioned cluster, that is still on v1.29, the manifest contains the correct certificate file names - so it looks like this was indeed the botched upgrade messing up some configuration files.

The issue was not visible on my end because my /etc/ssl/etcd/ssl directory contains all certificates for all nodes (not sure why, but that is another topic, for another issue :) ). So kubeadm was able to grab the "correct" -03 file and use it.

⚠️ Read the entire text below, this might be dangerous ⚠️
I would try to workaround this by editing the kube-apiserver.yaml with the correct certificate file and key file names, waiting for the API server to be restarted on the node, and trying the upgrade again. Also, take a look at /etc/kubernetes/kubeadm-config.yaml and check for -03 references.

I can afford to do this, since it is only a test cluster, and worst case scenario is "I just have to delete it and recreate it". Not sure how safe this workaround is for a production cluster, so proceed with caution...

(BTW, was the API server on that 01 node actually running? If the manifest had the incorrect file names, I would expect some issues with it...)

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

@bogd Thanks for the reply, so on my [master01-test] i see the correct files:

less /etc/kubernetes/manifests/kube-apiserver.yaml

apiVersion: v1
[...]
- --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
- --etcd-certfile=/etc/ssl/etcd/ssl/node-master01-test.pem
- --etcd-compaction-interval=5m0s
- --etcd-keyfile=/etc/ssl/etcd/ssl/node-master01-test.pem

API on Master01 is running yes, i checked, everything is well running, also on the master02/03 is also correct in (kupe-apiserver.yml) i see no wrong files there, only confusion comes from kubespray when is working on master01 but it is searching master03 cert

in my kube-api.yaml on master01 i can see there are listed files, but only for master01 (which i assume is correct, since the same file on master02 is listing paths to master02 files) , there is no master03 references anywhere in master01 - which confuses me

EDIT: The only difference between my setup and default is that i have etcd on separate machines (not on the master nodes) when i list
/etc/ssl/etcd/ssl (on my etcd01-test) i see all of the files admin-, node-, client-*, is somehow possible that kubespray confuses something and it thinks that my master01 is my etcd01 and it searches the files there instead of the real etcd01?

Edit2: not sure if related in this case but my etcd_deployment type is : etcd_deployment_type: host (as i understand host means etcd is deployed on different host not on the master)

@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

One more possible source for the config (see here for details):

root@name-01-master:/# kubectl -n kube-system get cm kubeadm-config -o yaml
apiVersion: v1
data:
  ClusterConfiguration: |
...
    etcd:
      external:
        caFile: /etc/ssl/etcd/ssl/ca.pem
        certFile: /etc/ssl/etcd/ssl/node-name-03-master.pem
...
        keyFile: /etc/ssl/etcd/ssl/node-name-03-master-key.pem
    imageRepository: registry.k8s.io
...

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

Yep, i just checked it seems wrong maybe?, shouldn't it be master01.pem && master01-key.pem:

kubectl -n kube-system get cm kubeadm-config -o yaml

  external:
    caFile: /etc/ssl/etcd/ssl/ca.pem
    certFile: /etc/ssl/etcd/ssl/node-master03.pem

    keyFile: /etc/ssl/etcd/ssl/node-master03-key.pem 

Edit: After few more checks i see the following:
Since the variables in this configmap comes from

https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubernetes/control-plane/defaults/main/main.yml#L26C1-L28C58

kube_etcd_cacert_file: ca.pem 
kube_etcd_cert_file: node-{{ inventory_hostname }}.pem 
kube_etcd_key_file: node-{{ inventory_hostname }}-key.pem

And in this case i assume it is some loop going thru etcd group and since master03 is the last one in the group it is how ended it there, if i understand correctly , shouldn't this

kube_etcd_cert_file: node-{{ inventory_hostname }}.pem be changed to something like
kube_etcd_cert_file: node-{{ groups['etcd'][0] }}.pem
, or something like this, since inventory_hostname comes from the inventory itself, not sure if related but this catched my eye when following thru the playbooks

@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

Yep, i just checked it seems wrong maybe?, shouldn't it be master01.pem && master01-key.pem:

Not really - unfortunately, the cluster-wide config (stored in the ConfigMap) is... cluster wide - and as far as I can tell, K8s hasn't yet implemented per-node configs

You are absolutely right in assuming that the config looks this way because of the last control plane node being processed.

What I cannot figure out is how kubespray normally works around this problem, and why I have all the certs for all the control plane nodes in my /etc/ssl/etcd/ssl (on both kubespray-deployed clusters):

# ls -la /etc/ssl/etcd/ssl
total 88
drwx------ 2 etcd root 4096 Sep  9 09:07 .
drwx------ 3 etcd root 4096 May 13  2023 ..
-rwx------ 1 etcd root 1704 Sep  9 09:06 admin-name-01-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:06 admin-name-01-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 admin-name-02-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 admin-name-02-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 admin-name-03-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 admin-name-03-master.pem
-rwx------ 1 etcd root 1704 May 13  2023 ca-key.pem
-rwx------ 1 etcd root 1111 May 13  2023 ca.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 member-name-01-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 member-name-01-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 member-name-02-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 member-name-02-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 member-name-03-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 member-name-03-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 node-name-01-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 node-name-01-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 node-name-02-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 node-name-02-master.pem
-rwx------ 1 etcd root 1704 Sep  9 09:07 node-name-03-master-key.pem
-rwx------ 1 etcd root 1497 Sep  9 09:07 node-name-03-master.pem

Edit: I think I have the answer to the second question - I am running etcd on the control-plane nodes, and all the certificates are copied to all the etcd nodes here.

Unfortunately, I do not have a cluster with external etcd that I can test with...

@mrBlackhat
Copy link
Author

I have exact same files (on my etcd01-test node) - which seems fine to me but etcd is external,still not sure why kubespray is searching those files on the master01 when etcd is external and by my logic it should search them there,

on my master01-test i have

master01-test #: ls -la /etc/ssl/etcd/ssl
-rwx------ 1 etcd root 1111 Mar 30  2023 ca.pem
-rwx------ 1 etcd root 1704 Sep  3 12:19 node-master01-test-key.pem
-rwx------ 1 etcd root 1468 Sep  3 12:19 node-master01-test.pem

But since etcd is deployed on a different machine (not on the master nodes) i am assuming that kubespray is searching for the file on the wrong place?, when i have external etcd shouldnt it check its etcd group and therefore their /etc/ssl/etcd/ssl folder?

@bogd
Copy link
Contributor

bogd commented Sep 9, 2024

why kubespray is searching those files on the master01 when etcd is external and by my logic it should search them there,

Not kubespray - this seems to be kubeadm that is looking in the wrong place. What I have been trying to figure out above was whether this is an issue with kubeadm, or it is just relying on some wrong configs that it finds on the cluster.

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 9, 2024

So when i inspect all of the masters

Master01

master01-test:~# ls -la /etc/ssl/etcd/ssl/
total 20
drwx------ 2 etcd root 4096 Sep  3 12:19 .
drwx------ 3 etcd root 4096 Mar 30  2023 ..
-rwx------ 1 etcd root 1111 Mar 30  2023 ca.pem
-rwx------ 1 etcd root 1704 Sep  3 12:19 node-master01-test-key.pem
-rwx------ 1 etcd root 1468 Sep  3 12:19 node-master01-test.pem

Master02

master02-test:~# ls -la /etc/ssl/etcd/ssl/
total 28
drwx------ 2 etcd root 4096 Sep  3 12:19 .
drwx------ 3 etcd root 4096 Mar 30  2023 ..
-rwx------ 1 etcd root 1111 Mar 30  2023 ca.pem
-rwx------ 1 etcd root 1708 Mar 30  2023 node-master01-test-key.pem
-rwx------ 1 etcd root 1480 Mar 30  2023 node-master01-test.pem
-rwx------ 1 etcd root 1704 Sep  3 12:19 node-master02-test-key.pem
-rwx------ 1 etcd root 1468 Sep  3 12:19 node-master02-test.pem

Master03

master03-test~# ls -la /etc/ssl/etcd/ssl/
total 28
drwx------ 2 etcd root 4096 Sep  3 12:19 .
drwx------ 3 etcd root 4096 Mar 30  2023 ..
-rwx------ 1 etcd root 1111 Mar 30  2023 ca.pem
-rwx------ 1 etcd root 1708 Mar 30  2023 node-master01-test-key.pem
-rwx------ 1 etcd root 1480 Mar 30  2023 node-master01-test.pem
-rwx------ 1 etcd root 1704 Sep  3 12:19 node-master03-test-key.pem
-rwx------ 1 etcd root 1468 Sep  3 12:19 node-master03-test.pem

I am not exactly sure why they look like this, master1 has its own certificate, master2 has its own + master1 and etc,
I can try to manually copy master01..03.pem & .key in every /etc/ssl/etcd/ssl folder on the other masters, and i guess this task will continue, but not sure if it will break something else, i will continue to search in the playbooks for something, is there anything i can post to further help the debug?

EDIT:
So i did a bit more reading and found few things,

If i understand it correctly, in the manifest.go i can see This

If i am using external etcd (like mine case) it will use any provided certificate, since i think it comes from the config itself, and since in the config in my case is wrriten the last one being processed, it tries to use the certificate (which is missing) so isn't this the kubespray issue, since it is creating the config, so my point is if kubespray knows that my master01 is my first master, because i think i saw some set_facts that sets who first master is, shouldn't it ensure that in the config is written the node-first-master.pem and not the last of the loop?

And in thise case if i edit the configmap to point it to master01 like this

external:
    caFile: /etc/ssl/etcd/ssl/ca.pem
   - certFile: /etc/ssl/etcd/ssl/node-master03.pem
   +certFile: /etc/ssl/etcd/ssl/node-master01.pem
   -keyFile: /etc/ssl/etcd/ssl/node-master03-key.pem
   +/etc/ssl/etcd/ssl/node-master01-key.pem

i guess it may do the upgrade

@mrBlackhat
Copy link
Author

mrBlackhat commented Sep 10, 2024

@bogd Some updates, so i've inspected my master nodes, and found this:

master01:~# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep etcd
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-master01.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-master01-key.pem
master02:~# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep etcd
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-master02.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-master02-key.pem
master03:~# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep etcd
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-master03.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-master03-key.pem

, so in kube-apiserver, every master is using it's own certificate (which to me seems right),

but if i do:

kubectl get cm kubeadm-config -n kube-system -o yaml , i can see:

    etcd:
      external:
        caFile: /etc/ssl/etcd/ssl/ca.pem
        certFile: /etc/ssl/etcd/ssl/node-master03.pem
        endpoints:
        keyFile: /etc/ssl/etcd/ssl/node-master03-key.pem
 

So my obvious solutions was to do:

kubectl edit cm kubeadm-config -n kube-system and change:

certFile: /etc/ssl/etcd/ssl/node-master03.pem 
=>
certFile: /etc/ssl/etcd/ssl/node-master01.pem

and since master01 certificate is present in every other master

master01:~# ls /etc/ssl/etcd/ssl/
ca.pem  node-master01.pem  node-master01.pem
master02:~# ls /etc/ssl/etcd/ssl/
ca.pem  node-master01.pem  node-master01-key.pem node-master02.pem  node-master02-key.pem
master03:~# ls /etc/ssl/etcd/ssl/
ca.pem  node-master01.pem  node-master01-key.pem node-master03.pem  node-master03-key.pem

The upgrade is going as expected, i successfully upgraded to 1.30.4 ,
I also did reset.yaml and performed clean install with cluster.yaml , then inspected the configmap and it were pointing to the right master (master01), so i am only confused now how master03 ended in my config (and when possibly),
but my two test cases were:

Clean install (Configmap was right, was pointing to master01),
i edited it by hand to point to master03 and succesfully got the error:
[Fatal] Cannot create external etcd client '/etc/ssl/etcd/ssl/node-master03.pem' not found. , then i edited it back to master01 and upgrade went without any issues, i did then cluster-upgrade (to downgrade to 1.29 again) then again updated to 1.30 and there wasnt any errors, so i guess that somehow during some of the upgrades master03 ended in my configmap and from theere is coming the confusion, i will do few more test and will verify if everything is working and maybe close the issue

@mrBlackhat
Copy link
Author

Indeed, the problem was that somehow configmap was wrong,not sure how exactly it happend but it did, somehow master03 ended up there, since the certificate file is missing from other masters - upgrade failed,

Solution to copy the cert from master03 to 02 and 01 is not recomended,

the best one that worked was to edit configmap kubeadm-config in kube-system, to point it to master01
(certFile: /etc/ssl/etcd/ssl/node-master01.pem ),
since its cert is present in every other master, upgrade is running without issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants