-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows workloads cannot be deleted post upgrade, stuck in Terminating
#5551
Comments
From what I can determine, the root cause of this issue seems to stem from RKE2 deleting and recreating all calico networks each time it restarts. Calico/Felix tracks HNS endpoints using their IP addresses, and when networks are recreated endpoints are also recreated and are assigned new IP addresses. This results in an inability to delete those endpoints, which prevents references to containers from being removed from HNS namespaces. The error shown when running It looks like upstream Calico scripts which handle starting the |
It looks like prior to #3615 that function was called The incorrect behavior goes back to the original rushed windows implementation from Jamie in #1268 |
@caroline-suse-rancher can we get one of the network team on this? |
I think we already doing it here from the code https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L305 Edit: I read Brad comment that was already saying that. |
@HarrisonWAffel Do you have an example Windows IIS workload that I can use? I am not able to replicate this issue when I used this deployment for testing: https://raw.githubusercontent.com/rancher/distros-test-framework/main/workloads/amd64/windows_app_deployment.yaml Setup:
Steps:
Please advice if any changes needed on the above steps |
I was able to reproduce this on v1.25 using the following workload apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: win-webserver
name: win-webserver
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: win-webserver
template:
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
containers:
- name: windowswebserver
image: mcr.microsoft.com/windows/servercore:ltsc2022
command:
- powershell.exe
- -command
- "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ; ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus) } ; "
nodeSelector:
kubernetes.io/os: windows This should be reproducible both after an upgrade of the rke2 version as well as by simply restarting the rke2 service with powershell. The image in the above yaml can be changed to use |
Update: I was able to replicate the issue. Will validate and update in the respective release branches
|
Also FYI 1.25 is EOL so this fix wont be available in that version. It will be in 1.27+ |
Yep, totally understand 1.25 is super EOL. Just wanted to preface that the workload I provided was last used to produce the issue on 1.25. I retested it after adding the comment on 1.29 and also repro'd it there |
Validated on all the release branches with the latest RCs, except v1.27 (on commit). Closing this issue |
Environmental Info:
RKE2 Version: v1.25.16, upgrading to v1.26 - seems to reproduce for upgrades between any version
rke2 version v1.25.16+rke2r1 (3fe54b9)
go version go1.20.11 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Server:
Linux haffel-testing-linux-server-0 5.4.0-1109-azure #115~18.04.1-Ubuntu SMP Mon May 22 20:06:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Windows Worker:
Microsoft Windows [Version 10.0.20348.2322] (2022 Azure datacenter)
Cluster Configuration:
1 server, all roles
1 windows worker
Describe the bug:
Windows workloads will not properly delete post rke2 upgrade. For instance, deploying a simple IIS web server via a deployment on 1.25 and then attempting to delete a pod spawned from that deployment post upgrade to 1.26 will result in the pod never completely terminating.
kubectl describe pod
shows the following error messageThis issue was encountered while debugging rancher/rancher#42414, but it reproduces for workloads other than the Rancher monitoring chart. This issue doesn't seem to be version specific, and has been reproduced for a number of different rke2 upgrade paths
Steps To Reproduce:
Expected behavior:
The pod is fully deleted
Actual behavior:
The pod is stuck in Terminating
Additional context / logs:
The text was updated successfully, but these errors were encountered: