Windows workloads cannot be deleted post upgrade, stuck in `Terminating` #5551

HarrisonWAffel · 2024-03-05T16:19:59Z

Environmental Info:
RKE2 Version: v1.25.16, upgrading to v1.26 - seems to reproduce for upgrades between any version

rke2 version v1.25.16+rke2r1 (3fe54b9)
go version go1.20.11 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Server: Linux haffel-testing-linux-server-0 5.4.0-1109-azure #115~18.04.1-Ubuntu SMP Mon May 22 20:06:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Windows Worker: Microsoft Windows [Version 10.0.20348.2322] (2022 Azure datacenter)

Cluster Configuration:

1 server, all roles
1 windows worker

Describe the bug:

Windows workloads will not properly delete post rke2 upgrade. For instance, deploying a simple IIS web server via a deployment on 1.25 and then attempting to delete a pod spawned from that deployment post upgrade to 1.26 will result in the pod never completely terminating. kubectl describe pod shows the following error message

error killing pod: failed to "KillPodSandbox" for "18d4a5fe-e570-4257-a3f4-0c6cb54f3a79" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to remove network namespace for sandbox \"8a0c4392b09910f8a944ab51a7639185e829c884a568e4a3d9bacae43db63d97\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}

This issue was encountered while debugging rancher/rancher#42414, but it reproduces for workloads other than the Rancher monitoring chart. This issue doesn't seem to be version specific, and has been reproduced for a number of different rke2 upgrade paths

Steps To Reproduce:

Create 1 linux server node running 1.25 (though this is not specific to 1.25)
Create 1 windows worker node running 1.25
Create a windows IIS web server deployment and wait for the pods to be created
Upgrade the linux server to 1.26, wait for upgrade to complete
Upgrade the windows worker to 1.26
Attempt to delete a pod created by the IIS web server deployment

Expected behavior:
The pod is fully deleted

Actual behavior:
The pod is stuck in Terminating

Additional context / logs:

The text was updated successfully, but these errors were encountered:

HarrisonWAffel · 2024-07-18T16:46:22Z

From what I can determine, the root cause of this issue seems to stem from RKE2 deleting and recreating all calico networks each time it restarts. Calico/Felix tracks HNS endpoints using their IP addresses, and when networks are recreated endpoints are also recreated and are assigned new IP addresses. This results in an inability to delete those endpoints, which prevents references to containers from being removed from HNS namespaces.

The error shown when running kubectl describe on a pod stuck in terminating ({\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}) indicates a failure to delete an HNS namespace due to containers still running within that namespace (as described in the above linked issue in microsoft/Windows-Containers).

It looks like upstream Calico scripts which handle starting the calico-node service on Windows only delete and recreate networks when starting the node after a reboot. RKE2 may need to adopt the same behavior.

brandond · 2024-07-18T17:36:32Z

It looks like prior to #3615 that function was called deleteAllNetworksOnNodeRestart, so clearly that was the intent at some point - however @manuelbuil noticed that it didn't actually contain any code to detect a reboot and just unconditionally deleted all the networks whenever called, so he corrected the function name.

The incorrect behavior goes back to the original rushed windows implementation from Jamie in #1268

brandond · 2024-07-18T19:13:19Z

@caroline-suse-rancher can we get one of the network team on this?

rbrtbnfgl · 2024-07-30T16:31:34Z

I think we already doing it here from the code https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L305

Edit: I read Brad comment that was already saying that.

mdrahman-suse · 2024-08-19T15:40:17Z

@HarrisonWAffel Do you have an example Windows IIS workload that I can use? I am not able to replicate this issue when I used this deployment for testing: https://raw.githubusercontent.com/rancher/distros-test-framework/main/workloads/amd64/windows_app_deployment.yaml
Also followed the steps below to upgrade rke2 with Windows agent:

Setup:

1 linux server, 1 linux worker, 1 Windows (2019) worker

Steps:

Installed rke2 v1.30.2+rke2r1 on all the nodes
Ensured the cluster is up and running
Deployed the workload mentioned above (pod/windows-app-deployment-*)

Installed v1.30.3+rke2r1 on the server / worker nodes
Restarted services on the nodes
On Windows node
- Stopped rke2 service
- Installed v1.30.3+rke2r1
- Restarted rke2 service
Ensured the cluster is upgraded

Deleted the deployment successfully

Please advice if any changes needed on the above steps

HarrisonWAffel · 2024-08-20T13:55:01Z

@mdrahman-suse

I was able to reproduce this on v1.25 using the following workload

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

This should be reproducible both after an upgrade of the rke2 version as well as by simply restarting the rke2 service with powershell. The image in the above yaml can be changed to use ltsc2019 depending on your setup, but ideally this would be tested against both 2019 and 2022

mdrahman-suse · 2024-08-20T17:10:36Z

Update: I was able to replicate the issue. Will validate and update in the respective release branches
As per @rbrtbnfgl

the issue is related to the RKE2 agent that when it restarts it deletes the windows virtual network so the pods that are currently running are not able to communicate. So try nslookup on the Windows pod instead of deletion

mdrahman-suse · 2024-08-20T17:24:02Z

@mdrahman-suse

I was able to reproduce this on v1.25 using the following workload

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

This should be reproducible both after an upgrade of the rke2 version as well as by simply restarting the rke2 service with powershell. The image in the above yaml can be changed to use ltsc2019 depending on your setup, but ideally this would be tested against both 2019 and 2022

Also FYI 1.25 is EOL so this fix wont be available in that version. It will be in 1.27+

HarrisonWAffel · 2024-08-20T17:30:10Z

Yep, totally understand 1.25 is super EOL. Just wanted to preface that the workload I provided was last used to produce the issue on 1.25. I retested it after adding the comment on 1.29 and also repro'd it there

mdrahman-suse · 2024-08-21T21:42:21Z

Validated on all the release branches with the latest RCs, except v1.27 (on commit). Closing this issue

HarrisonWAffel added kind/bug Something isn't working area/windows labels Mar 5, 2024

tzifudzi mentioned this issue Apr 15, 2024

Requesting elaboration for HCN error 2151350293 (0x803b0015) microsoft/Windows-Containers#485

Open

brandond added this to the August 2024 Release Cycle milestone Jul 18, 2024

thomasferrandiz self-assigned this Jul 19, 2024

snasovich mentioned this issue Jul 19, 2024

[BUG]Windows rke2 custom cluster - nodes stop displaying metrics in Monitoring after upgrading kubernetes version rancher/rancher#42414

Closed

HarrisonWAffel mentioned this issue Jul 25, 2024

[Windows] System Agent and rke2 startup may conflict with one another rancher/rancher#46344

Closed

rbrtbnfgl mentioned this issue Jul 31, 2024

Added check if the node is rebooted before the networks is deleted on windows #6437

Merged

rbrtbnfgl self-assigned this Aug 7, 2024

HarrisonWAffel mentioned this issue Aug 9, 2024

[BUG] system-agent does not respect maxFailures if running the same plan post restart rancher/rancher#46620

Closed

endawkins assigned mdrahman-suse Aug 13, 2024

mdrahman-suse closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows workloads cannot be deleted post upgrade, stuck in `Terminating` #5551

Windows workloads cannot be deleted post upgrade, stuck in `Terminating` #5551

HarrisonWAffel commented Mar 5, 2024

HarrisonWAffel commented Jul 18, 2024

brandond commented Jul 18, 2024

brandond commented Jul 18, 2024

rbrtbnfgl commented Jul 30, 2024 •

edited

Loading

mdrahman-suse commented Aug 19, 2024 •

edited

Loading

HarrisonWAffel commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

HarrisonWAffel commented Aug 20, 2024

mdrahman-suse commented Aug 21, 2024 •

edited

Loading

Windows workloads cannot be deleted post upgrade, stuck in Terminating #5551

Windows workloads cannot be deleted post upgrade, stuck in Terminating #5551

Comments

HarrisonWAffel commented Mar 5, 2024

HarrisonWAffel commented Jul 18, 2024

brandond commented Jul 18, 2024

brandond commented Jul 18, 2024

rbrtbnfgl commented Jul 30, 2024 • edited Loading

mdrahman-suse commented Aug 19, 2024 • edited Loading

Setup:

Steps:

HarrisonWAffel commented Aug 20, 2024 • edited Loading

mdrahman-suse commented Aug 20, 2024 • edited Loading

mdrahman-suse commented Aug 20, 2024 • edited Loading

HarrisonWAffel commented Aug 20, 2024

mdrahman-suse commented Aug 21, 2024 • edited Loading

Windows workloads cannot be deleted post upgrade, stuck in `Terminating` #5551

Windows workloads cannot be deleted post upgrade, stuck in `Terminating` #5551

rbrtbnfgl commented Jul 30, 2024 •

edited

Loading

mdrahman-suse commented Aug 19, 2024 •

edited

Loading

HarrisonWAffel commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 21, 2024 •

edited

Loading