Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

Closed
rbrtbnfgl opened this issue Aug 12, 2024 · 6 comments
Assignees

Comments

@rbrtbnfgl
Copy link
Contributor

backport for #5551

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Aug 20, 2024

@rbrtbnfgl I am still seeing the issue with v1.30.4-rc1+rke2r1

Test Steps:

  • Installed rke2 v1.30.3+rke2r1 on 1 linux server, 1 linux agent and 1 windows agent node
  • Ensured cluster is up
  • Deployed windows workload
  • Exec in to windows pod and perform nslookup
  • Observed the virtual IP is there
  • Upgraded all the nodes to 1.30.4-rc1+rke2r1 manually
  • Ensure the cluster is up
  • Exec in to windows pod and perform nslookup
  • Observed the virtual IP is not there

Before upgrade

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto

$ kubectl get nodes
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-xxx-xx-2-1.us-east-2.compute.internal    Ready    control-plane,etcd,master   26m   v1.30.3+rke2r1
ip-xxx-xx-3-23.us-east-2.compute.internal   Ready    <none>                      24m   v1.30.3+rke2r1
ip-ac1f02eb                                 Ready    <none>                      18m   v1.30.3

$ kubectl get pods -A | grep win
default           pod/win-webserver-6678868fb5-c5bqh                                      1/1     Running     0              19m
default           pod/win-webserver-6678868fb5-n5f6w                                      1/1     Running     0              19m

$ kubectl exec -it pod/win-webserver-6678868fb5-c5bqh -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

After upgrade

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-xxx-xx-2-1.us-east-2.compute.internal    Ready    control-plane,etcd,master   1h20m   v1.30.4+rke2r1
ip-xxx-xx-3-23.us-east-2.compute.internal   Ready    <none>                      1h18m  v1.30.4+rke2r1
ip-ac1f02eb                                 Ready    <none>                      1h13m   v1.30.4

~$ kgp | grep win
default           win-webserver-6678868fb5-c5bqh                                      1/1     Running     0             53m
default           win-webserver-6678868fb5-n5f6w                                      1/1     Running     0             53m


$ kubectl exec -it pod/win-webserver-6678868fb5-c5bqh -- powershell.exe
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\> nslookup
*** Default servers are not available
Default Server:  UnKnown
Address:  127.0.0.1

Please advise

@mdrahman-suse
Copy link
Contributor

@rbrtbnfgl So it looks like the fix works with flannel CNI but not with calico CNI

With calico:

#6534 (comment)

With flannel:

Before upgrade

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-156.us-east-2.compute.internal   Ready    <none>                      30m   v1.30.3+rke2r1
ip-xxx-xx-3-69.us-east-2.compute.internal     Ready    control-plane,etcd,master   32m   v1.30.3+rke2r1
ip-ac1f2610                                   Ready    <none>                      28m   v1.30.3

$ kgp | grep win
default       win-webserver-6778785459-s8587                                       1/1     Running     0          12m

$ kubectl exec -it pod/win-webserver-6778785459-s8587 -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

After upgrade

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-156.us-east-2.compute.internal   Ready    <none>                      41m   v1.30.4+rke2r1
ip-xxx-xx-3-69.us-east-2.compute.internal     Ready    control-plane,etcd,master   43m   v1.30.4+rke2r1
ip-ac1f2610                                   Ready    <none>                      39m   v1.30.4

$ kgp | grep win
default       win-webserver-6778785459-s8587                                       1/1     Running            0               23m

$ kubectl exec -it pod/win-webserver-6778785459-s8587 -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

BUT

There is an issue with the upgrade that caused flannel and rke2-ingress-nginx controller pods to go into Error / CrashLoopBackOff state

ubuntu@ip-172-31-3-69:~$ kgp
NAMESPACE     NAME                                                                 READY   STATUS             RESTARTS        AGE
default       win-webserver-6778785459-s8587                                       1/1     Running            0               25m
kube-system   cloud-controller-manager-ip-172-31-3-69.us-east-2.compute.internal   1/1     Running            0               10m
kube-system   etcd-ip-172-31-3-69.us-east-2.compute.internal                       1/1     Running            0               44m
kube-system   helm-install-rke2-coredns-gv99b                                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-flannel-246kd                                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-ingress-nginx-wqzqx                                0/1     Completed          0               9m53s
kube-system   helm-install-rke2-metrics-server-2hnj2                               0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-controller-crd-hdcqs                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-controller-hdt2g                          0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-validation-webhook-7lb8r                  0/1     Completed          0               9m53s
kube-system   kube-apiserver-ip-172-31-3-69.us-east-2.compute.internal             1/1     Running            0               10m
kube-system   kube-controller-manager-ip-172-31-3-69.us-east-2.compute.internal    1/1     Running            1 (10m ago)     10m
kube-system   kube-flannel-ds-sfc6z                                                0/1     CrashLoopBackOff   6 (3m30s ago)   9m41s
kube-system   kube-flannel-ds-zdkvs                                                0/1     CrashLoopBackOff   8 (2m33s ago)   9m19s
kube-system   kube-proxy-ip-172-31-13-156.us-east-2.compute.internal               1/1     Running            0               8m10s
kube-system   kube-proxy-ip-172-31-3-69.us-east-2.compute.internal                 1/1     Running            0               10m
kube-system   kube-scheduler-ip-172-31-3-69.us-east-2.compute.internal             1/1     Running            0               10m
kube-system   rke2-coredns-rke2-coredns-64dcf4f58b-vlt2p                           1/1     Running            0               43m
kube-system   rke2-coredns-rke2-coredns-6bb85f9dd8-22hlc                           0/1     Pending            0               9m42s
kube-system   rke2-coredns-rke2-coredns-6bb85f9dd8-qhvbt                           0/1     Running            0               9m41s
kube-system   rke2-coredns-rke2-coredns-autoscaler-7b9c797d64-glgnm                1/1     Running            0               9m42s
kube-system   rke2-ingress-nginx-controller-bmfnz                                  0/1     CrashLoopBackOff   6 (21s ago)     8m2s
kube-system   rke2-ingress-nginx-controller-thm6d                                  0/1     CrashLoopBackOff   6 (35s ago)     9m15s
kube-system   rke2-metrics-server-868fc8795f-tfhxv                                 1/1     Running            0               43m
kube-system   rke2-snapshot-controller-7dcf5d5b46-sqckq                            1/1     Running            1 (10m ago)     43m
kube-system   rke2-snapshot-validation-webhook-bf7bbd6fc-7lrlr                     1/1     Running            0               44m
$ k logs -n kube-system pod/kube-flannel-ds-sfc6z
Defaulted container "kube-flannel" out of: kube-flannel, install-cni-plugins (init), install-cni (init)
I0821 01:30:47.986053       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0821 01:30:47.986258       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0821 01:30:48.009934       1 kube.go:139] Waiting 10m0s for node controller to sync
I0821 01:30:48.010052       1 kube.go:469] Starting kube subnet manager
I0821 01:30:48.016282       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0821 01:30:48.016347       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0821 01:30:48.016359       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0821 01:30:49.011027       1 kube.go:146] Node controller sync successful
I0821 01:30:49.011104       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ip-172-31-3-69.us-east-2.compute.internal
I0821 01:30:49.011116       1 main.go:234] Installing signal handlers
I0821 01:30:49.011330       1 main.go:452] Found network config - Backend type: vxlan
...
I0821 01:30:49.021925       1 nftables.go:47] Starting flannel in nftables mode...
E0821 01:30:49.022154       1 main.go:353] no nftables support: could not find nftables binary: exec: "nft": executable file not found in $PATH
I0821 01:30:49.022335       1 main.go:432] Stopping shutdownHandler...
$ k logs -n kube-system rke2-ingress-nginx-controller-bmfnz
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.10.1-hardened2
  Build:         git-6c2923297
  Repository:    https://github.com/rancher/ingress-nginx
  nginx version: nginx/1.25.3

-------------------------------------------------------------------------------

W0821 01:33:01.566466       8 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0821 01:33:01.574103       8 main.go:205] "Creating API client" host="https://10.43.0.1:443"
$ k describe -n kube-system rke2-ingress-nginx-controller-bmfnz
...
Events:
  Type     Reason             Age                   From               Message
  ----     ------             ----                  ----               -------
  Normal   Scheduled          10m                   default-scheduler  Successfully assigned kube-system/rke2-ingress-nginx-controller-bmfnz to ip-172-31-13-156.us-east-2.compute.internal
  Normal   Pulling            10m                   kubelet            Pulling image "rancher/nginx-ingress-controller:v1.10.1-hardened2"
  Normal   Pulled             10m                   kubelet            Successfully pulled image "rancher/nginx-ingress-controller:v1.10.1-hardened2" in 14.856s (14.856s including waiting). Image size: 293192075 bytes.
  Normal   Created            9m38s (x2 over 10m)   kubelet            Created container rke2-ingress-nginx-controller
  Normal   Started            9m38s (x2 over 10m)   kubelet            Started container rke2-ingress-nginx-controller
  Normal   Killing            9m38s                 kubelet            Container rke2-ingress-nginx-controller failed liveness probe, will be restarted
  Warning  FailedPreStopHook  9m38s                 kubelet            PreStopHook failed
  Normal   Pulled             9m38s                 kubelet            Container image "rancher/nginx-ingress-controller:v1.10.1-hardened2" already present on machine
  Warning  Unhealthy          5m38s (x39 over 10m)  kubelet            Readiness probe failed: Get "http://10.42.1.16:10254/healthz": dial tcp 10.42.1.16:10254: connect: connection refused
  Warning  BackOff            46s (x20 over 5m28s)  kubelet            Back-off restarting failed container rke2-ingress-nginx-controller in pod rke2-ingress-nginx-controller-bmfnz_kube-system(e688ab3f-d439-4e28-bd2e-02fc499cbb88)

@brandond
Copy link
Member

I0821 01:30:49.021925       1 nftables.go:47] Starting flannel in nftables mode...
E0821 01:30:49.022154       1 main.go:353] no nftables support: could not find nftables binary: exec: "nft": executable file not found in $PATH

This appears to be on the linux node, not the windows node? And not related to the Windows changes as far as I can tell...

@rbrtbnfgl
Copy link
Contributor Author

rbrtbnfgl commented Aug 21, 2024

The issue is fixed if you update from a fixed version to another fixed version. If you update from a version without the fix the issue will still occur. You need to restart the node to fix it. You could easily reproduce the issue restarting the service that it's similar to what happens when you update the node. This fix is only Calico related if there are any issues with flannel it could be better to open a new one.

@rbrtbnfgl
Copy link
Contributor Author

I created a new issue for this flannel bug #6601

@mdrahman-suse
Copy link
Contributor

Validated with RC v1.30.4-rc1+rke2r1

Environment / Config

Ubuntu 22.04 (server and worker nodes)
Windows Server 2022 Datacenter

1 linux server, 1 linux worker, 1 windows worker
Windows workload winapp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

Testing

  • Create an rke2 cluster with the setup mentioned
  • Ensure cluster comes up
  • Deploy Windows workload
  • Wait for the workload to be Running (~12m)
  • Exec into the workload and do nslookup
  • Login to Windows worker node and Restart rke2 service
  • Exec into the workload and do nslookup
  • Validate the response before and after restart matches

Replication

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto
  • Nodes are up
$ kgn
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-xxx-31-4-155.us-east-2.compute.internal   Ready    <none>                      37m   v1.30.3+rke2r1
ip-xxx-31-5-117.us-east-2.compute.internal   Ready    control-plane,etcd,master   39m   v1.30.3+rke2r1
ip-ac1f2610                                  Ready    <none>                      35m   v1.30.3
  • Windows pod is up
$ kgp | grep win
default           win-webserver-6778785459-csk4x                                        1/1     Running     0          10m
  • Exec into pod and nslookup (Before restarting rke2 in Windows)
$ kubectl exec -it pod/win-webserver-6778785459-csk4x -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10
  • Restart rke2 service in Windows
 C:\usr\local\bin\rke2.exe --version
rke2.exe version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5

PS C:\Users\Administrator> Get-Service -Name rke2

Status   Name               DisplayName
------   ----               -----------
Running  rke2               rke2


PS C:\Users\Administrator> Restart-Service -Name rke2
  • Exec into pod and nslookup (After restarting rke2 in Windows)
$ kubectl exec -it pod/win-webserver-6778785459-csk4x -- powershell.exe nslookup
*** Default servers are not available
Default Server:  UnKnown
Address:  127.0.0.1

Validation

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto
  • Nodes are up
$ kgn
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-66.us-east-2.compute.internal   Ready    control-plane,etcd,master   43m   v1.30.4+rke2r1
ip-xxx-xx-8-164.us-east-2.compute.internal   Ready    <none>                      41m   v1.30.4+rke2r1
ip-ac1f2610                                  Ready    <none>                      38m   v1.30.4
  • Windows pod is up
$ kgp | grep win
default           win-webserver-6778785459-g4rrk                                        1/1     Running     0             13m
  • Exec into pod and nslookup (Before restarting rke2 in Windows)
$ kubectl exec -it pod/win-webserver-6778785459-g4rrk -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10
  • Restart rke2 service in Windows
C:\usr\local\bin\rke2.exe --version
rke2.exe version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5
PS C:\Users\Administrator> Get-Service -Name rke2

Status   Name               DisplayName
------   ----               -----------
Running  rke2               rke2


PS C:\Users\Administrator> Restart-Service -Name rke2
  • Exec into pod and nslookup (After restarting rke2 in Windows)
$ kubectl exec -it pod/win-webserver-6778785459-g4rrk -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

NOTE: Does not work when upgrading from a version without this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants