fix(kwok): prevent quitting when scaling down node group #6336

qianlei90 · 2023-12-02T14:52:50Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When using the Kwok provider, CA quits when scaling down a node group because the Kwok provider cannot retrieve the node group name from a fake node. This PR primarily aims to fix this issue.

Additionally, I have fixed the target size when scaling up and down the node group.

Which issue(s) this PR fixes:

kwok-provider-config

apiVersion: v1
data:
  config: |-
    apiVersion: v1alpha1
    readNodesFrom: configmap
    nodegroups:
      fromNodeLabelKey: "kwok-nodegroup"
    nodes:
    configmap:
      name: kwok-provider-templates
    kwok:
      install: false
kind: ConfigMap
metadata:
  name: kwok-provider-config
  namespace: default

kwok-provider-templates

apiVersion: v1
data:
  templates: |-
    apiVersion: v1
    items:
    - apiVersion: v1
      kind: Node
      metadata:
        annotations:
          node.alpha.kubernetes.io/ttl: "0"
          kwok.x-k8s.io/node: fake
        labels:
          beta.kubernetes.io/arch: amd64
          beta.kubernetes.io/os: linux
          kubernetes.io/arch: amd64
          kubernetes.io/hostname: kwok-node-0
          kubernetes.io/os: linux
          kubernetes.io/role: agent
          node-role.kubernetes.io/agent: ""
          type: kwok
          kwok-nodegroup: cluster-autoscaler
        name: kwok-node-0
      spec: {}
      status:
        allocatable:
          cpu: 32
          memory: 256Gi
          pods: 110
        capacity:
          cpu: 32
          memory: 256Gi
          pods: 110
        nodeInfo:
          architecture: amd64
          bootID: ""
          containerRuntimeVersion: ""
          kernelVersion: ""
          kubeProxyVersion: fake
          kubeletVersion: fake
          machineID: ""
          operatingSystem: linux
          osImage: ""
          systemUUID: ""
        phase: Running
    kind: List
    metadata:
      resourceVersion: ""
kind: ConfigMap
metadata:
  name: kwok-provider-templates
  namespace: default

starting CA

POD_NAMESPACE=default KWOK_PROVIDER_MODE=local ./cluster-autoscaler-amd64
    --cloud-provider=kwok \
    --namespace=default \
    --kubeconfig=<kubeconfig> \
    --expander=random \
    --scale-down-enabled=true \
    --scale-down-utilization-threshold=0.5 \
    --scale-down-gpu-utilization-threshold=0.5 \
    --scale-down-delay-after-add=10s \
    --scale-down-delay-after-failure=10s \
    --scale-down-unneeded-time=0s \
    --skip-nodes-with-system-pods=true \
    --skip-nodes-with-local-storage=true \
    --logtostderr=true \
    --stderrthreshold=info \
    --leader-elect=false \
    --v=4 \
    --scan-interval=3s

scale this deployment to test scale up and down

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployments-simple-deployment-deployment
  namespace: default
spec:
  replicas: 0
  selector:
    matchLabels:
      app: deployments-simple-deployment-app
  template:
    metadata:
      labels:
        app: deployments-simple-deployment-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kwok-nodegroup
                operator: In
                values:
                - cluster-autoscaler
      containers:
      - command:
        - sleep
        - "3600"
        image: busybox
        imagePullPolicy: Always
        name: busybox
        resources:
          requests:
            cpu: "31"
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        key: kwok-provider
        operator: Equal
        value: "true"

CA log

I1202 22:49:14.738401 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:14.738533 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:14.738623 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:14.738654 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:14.738720 1229586 klogx.go:87] failed to find place for default/deployments-simple-deployment-deployment-59994f79f6-r85f5: cannot put pod deployments-simple-deployment-deployment-59994f79f6-r85f5 on any node
I1202 22:49:14.738732 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:14.738740 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:14.738746 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:14.738751 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 1 unschedulable pods left
I1202 22:49:14.738768 1229586 klogx.go:87] Pod default/deployments-simple-deployment-deployment-59994f79f6-r85f5 is unschedulable
I1202 22:49:14.738839 1229586 orchestrator.go:108] Upcoming 0 nodes
I1202 22:49:14.738847 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:14.739038 1229586 orchestrator.go:181] Best option to resize: cluster-autoscaler-1701528461
I1202 22:49:14.739047 1229586 orchestrator.go:185] Estimated 1 nodes needed in cluster-autoscaler-1701528461
I1202 22:49:14.739061 1229586 orchestrator.go:291] Final scale-up plan: [{cluster-autoscaler-1701528461 0->1 (max: 200)}]
I1202 22:49:14.739077 1229586 executor.go:147] Scale-up: setting group cluster-autoscaler-1701528461 size to 1
I1202 22:49:14.739181 1229586 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"default", Name:"cluster-autoscaler-status", UID:"095e2c8c-de6b-44b1-bda9-7174134f7a6e", APIVersion:"v1", ResourceVersion:"24451", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group cluster-autoscaler-1701528461 size to 1 instead of 0 (max: 200)
I1202 22:49:14.743388 1229586 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I1202 22:49:14.743494 1229586 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"deployments-simple-deployment-deployment-59994f79f6-r85f5", UID:"fbfb034f-f5a8-42b8-9e81-411caaa49042", APIVersion:"v1", ResourceVersion:"24446", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{cluster-autoscaler-1701528461 0->1 (max: 200)}]
I1202 22:49:17.749563 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:17.749713 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:17.749828 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:17.749844 1229586 clusterstate.go:260] Scale up in group cluster-autoscaler-1701528461 finished successfully in 3.006191994s
I1202 22:49:17.749875 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:17.749884 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:17.749892 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:17.749901 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:17.749905 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 0 unschedulable pods left
I1202 22:49:17.749912 1229586 static_autoscaler.go:547] No unschedulable pods
I1202 22:49:17.749918 1229586 static_autoscaler.go:570] Calculating unneeded nodes
I1202 22:49:17.749924 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:17.749930 1229586 pre_filtering_processor.go:57] Node minikube should not be processed by cluster autoscaler (no node group config)
I1202 22:49:17.749961 1229586 eligibility.go:162] Node cluster-autoscaler-1701528461-xghmv unremovable: cpu requested (96.875% of allocatable) is above the scale-down utilization threshold
I1202 22:49:17.749989 1229586 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2023-12-02 22:49:14.738369867 +0800 CST m=+97.234587115 lastScaleDownDeleteTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 lastScaleDownFailTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 scaleDownForbidden=false scaleDownInCooldown=true
I1202 22:49:20.754194 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:20.754386 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:20.754575 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:20.754637 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:20.754651 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:20.754666 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:20.754674 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:20.754683 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 0 unschedulable pods left
I1202 22:49:20.754698 1229586 static_autoscaler.go:547] No unschedulable pods
I1202 22:49:20.754709 1229586 static_autoscaler.go:570] Calculating unneeded nodes
I1202 22:49:20.754719 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:20.754727 1229586 pre_filtering_processor.go:57] Node minikube should not be processed by cluster autoscaler (no node group config)
I1202 22:49:20.754761 1229586 klogx.go:87] Node cluster-autoscaler-1701528461-xghmv - memory requested is 0% of allocatable
I1202 22:49:20.754785 1229586 cluster.go:156] Simulating node cluster-autoscaler-1701528461-xghmv removal
I1202 22:49:20.754805 1229586 cluster.go:179] node cluster-autoscaler-1701528461-xghmv may be removed
I1202 22:49:20.754817 1229586 nodes.go:84] cluster-autoscaler-1701528461-xghmv is unneeded since 2023-12-02 22:49:20.754082099 +0800 CST m=+103.250299367 duration 0s
I1202 22:49:20.754857 1229586 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2023-12-02 22:49:14.738369867 +0800 CST m=+97.234587115 lastScaleDownDeleteTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 lastScaleDownFailTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 scaleDownForbidden=false scaleDownInCooldown=true
I1202 22:49:23.760153 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:23.760340 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:23.760477 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:23.760546 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:23.760560 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:23.760572 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:23.760580 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:23.760587 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 0 unschedulable pods left
I1202 22:49:23.760600 1229586 static_autoscaler.go:547] No unschedulable pods
I1202 22:49:23.760611 1229586 static_autoscaler.go:570] Calculating unneeded nodes
I1202 22:49:23.760620 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:23.760633 1229586 pre_filtering_processor.go:57] Node minikube should not be processed by cluster autoscaler (no node group config)
I1202 22:49:23.760665 1229586 klogx.go:87] Node cluster-autoscaler-1701528461-xghmv - memory requested is 0% of allocatable
I1202 22:49:23.760690 1229586 cluster.go:156] Simulating node cluster-autoscaler-1701528461-xghmv removal
I1202 22:49:23.760709 1229586 cluster.go:179] node cluster-autoscaler-1701528461-xghmv may be removed
I1202 22:49:23.760722 1229586 nodes.go:84] cluster-autoscaler-1701528461-xghmv is unneeded since 2023-12-02 22:49:20.754082099 +0800 CST m=+103.250299367 duration 3.006038664s
I1202 22:49:23.760758 1229586 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2023-12-02 22:49:14.738369867 +0800 CST m=+97.234587115 lastScaleDownDeleteTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 lastScaleDownFailTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 scaleDownForbidden=false scaleDownInCooldown=true
I1202 22:49:26.766561 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:26.766740 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:26.766880 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:26.766936 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:26.766951 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:26.766964 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:26.766972 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:26.766980 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 0 unschedulable pods left
I1202 22:49:26.766992 1229586 static_autoscaler.go:547] No unschedulable pods
I1202 22:49:26.767003 1229586 static_autoscaler.go:570] Calculating unneeded nodes
I1202 22:49:26.767014 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:26.767022 1229586 pre_filtering_processor.go:57] Node minikube should not be processed by cluster autoscaler (no node group config)
I1202 22:49:26.767052 1229586 klogx.go:87] Node cluster-autoscaler-1701528461-xghmv - memory requested is 0% of allocatable
I1202 22:49:26.767073 1229586 cluster.go:156] Simulating node cluster-autoscaler-1701528461-xghmv removal
I1202 22:49:26.767093 1229586 cluster.go:179] node cluster-autoscaler-1701528461-xghmv may be removed
I1202 22:49:26.767106 1229586 nodes.go:84] cluster-autoscaler-1701528461-xghmv is unneeded since 2023-12-02 22:49:20.754082099 +0800 CST m=+103.250299367 duration 6.012451453s
I1202 22:49:26.767143 1229586 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2023-12-02 22:49:14.738369867 +0800 CST m=+97.234587115 lastScaleDownDeleteTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 lastScaleDownFailTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 scaleDownForbidden=false scaleDownInCooldown=false
I1202 22:49:26.767173 1229586 static_autoscaler.go:642] Starting scale down
I1202 22:49:26.767203 1229586 nodes.go:126] cluster-autoscaler-1701528461-xghmv was unneeded for 6.012451453s
I1202 22:49:26.767222 1229586 scale_down_set_processor.go:103] Considering node cluster-autoscaler-1701528461-xghmv for standard scale down
I1202 22:49:26.776287 1229586 taints.go:221] Successfully added ToBeDeletedTaint on node cluster-autoscaler-1701528461-xghmv
I1202 22:49:26.776372 1229586 actuator.go:143] Scale-down: removing empty node "cluster-autoscaler-1701528461-xghmv"
I1202 22:49:26.776470 1229586 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"cluster-autoscaler-1701528461-xghmv", UID:"88187c9f-7dc0-419b-9edb-47223f904a76", APIVersion:"v1", ResourceVersion:"24471", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' marked the node as toBeDeleted/unschedulable
I1202 22:49:26.776627 1229586 actuator.go:238] Scale-down: waiting 5s before trying to delete nodes
I1202 22:49:26.779502 1229586 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"default", Name:"cluster-autoscaler-status", UID:"095e2c8c-de6b-44b1-bda9-7174134f7a6e", APIVersion:"v1", ResourceVersion:"24498", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node "cluster-autoscaler-1701528461-xghmv"
I1202 22:49:29.781983 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:29.782130 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:29.782214 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:29.782247 1229586 filter_out_schedulable.go:63] Filtering out schedulables
I1202 22:49:29.782257 1229586 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I1202 22:49:29.782264 1229586 filter_out_schedulable.go:83] No schedulable pods
I1202 22:49:29.782269 1229586 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1202 22:49:29.782274 1229586 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 0 unschedulable pods left
I1202 22:49:29.782281 1229586 static_autoscaler.go:547] No unschedulable pods
I1202 22:49:29.782287 1229586 static_autoscaler.go:570] Calculating unneeded nodes
I1202 22:49:29.782294 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:29.782299 1229586 pre_filtering_processor.go:57] Node minikube should not be processed by cluster autoscaler (no node group config)
I1202 22:49:29.782323 1229586 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2023-12-02 22:49:14.738369867 +0800 CST m=+97.234587115 lastScaleDownDeleteTime=2023-12-02 22:49:26.766533552 +0800 CST m=+109.262750820 lastScaleDownFailTime=2023-12-02 21:47:41.492276216 +0800 CST m=-3596.011506536 scaleDownForbidden=false scaleDownInCooldown=false
I1202 22:49:29.782346 1229586 static_autoscaler.go:642] Starting scale down
I1202 22:49:32.788976 1229586 static_autoscaler.go:290] Starting main loop
I1202 22:49:32.789123 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
I1202 22:49:32.789226 1229586 kwok_provider.go:58] ignoring node 'minikube' because it is not managed by kwok
F1202 22:49:32.789237 1229586 kwok_helpers.go:270] label 'kwok-nodegroup' for node 'kwok:cluster-autoscaler-1701528461-xghmv' not present in the manifest

Debugger finished with the exit code 0

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

qianlei90 · 2023-12-02T14:56:48Z

cluster-autoscaler/cloudprovider/kwok/kwok_nodegroups.go

+		nodeGroup.targetSize += 1
 	}

-	nodeGroup.targetSize = newSize
-


For a case in which some nodes are created successfully and some fail

Makes sense. Should we add a test case around this (for both IncreaseSize and DeleteNodes)?

qianlei90 · 2023-12-02T14:58:47Z

/assign @vadasambar

vadasambar · 2023-12-04T18:30:46Z

Thank you for the PR!

vadasambar · 2023-12-05T17:51:24Z

I can't reproduce the issue. I used the same commands and configmap you used.

Here's what I did:

kubectl scale deploy deployments-simple-deployment-deployment --replicas=2

kwok provider created 2 fake nodes.

And then

kubectl scale deploy deployments-simple-deployment-deployment --replicas=0

kwok provider scaled down the 2 fake nodes.

Logs for reference: https://gist.github.com/vadasambar/56ac07f2eedbd97e5d8aaa1424df3481

vadasambar · 2023-12-05T17:51:50Z

Maybe you can share the error you saw?

qianlei90 · 2023-12-06T01:59:58Z

@vadasambar Sorry for the misunderstanding in the title. CA does not panic; it simply quits without any stack information. The latest line in your log shows the reason:

F1205 23:17:08.902773  116597 kwok_helpers.go:270] label 'kwok-nodegroup' for node 'kwok:cluster-autoscaler-1701798315-296wg' not present in the manifest

qianlei90 · 2023-12-06T02:08:32Z

cluster-autoscaler/cloudprovider/kwok/kwok_helpers.go

-			no.GetName())
-	}
-
-	ngName = fmt.Sprintf("%s-%v", ngName, time.Now().Unix())


I think it might be better to keep the node group name unchanged, especially in cases where nodes still remain in the cluster and CA is restarted.

Line 275 is clearly a bug. The original intention was to ensure targetSize of a nodegroup correctly reflects the number of nodes with matching annotation/label. Making sure nodegroup has a unique name on every pod restart achieves this.

If we already have nodes in the cluster with matching nodegroup annotations/labels and we remove time.Now().Unix() suffix, when creating nodegroups (imagine the CA pod got restarted) the target size won't accurately reflect the actual nodegroup size because there are more nodes matching nodegroup label in the cluster now.

One solution can be to implement Refresh and update the nodegroup target size based on the actual matching nodes in the cluster.

Let me know what you think.

The Kwok provider will calculate the target sizes during startup, and the Cluster Autoscaler will scale these nodegroups to the appropriate size. I think it's not necessary to calculate the target size in the Refresh function, just leave this work to CA may be a better choice.

kwok provider will clean all the fake node when CA quit:

// Cleanup cleans up all resources before the cloud provider is removed func (kwok *KwokCloudProvider) Cleanup() error { for _, ng := range kwok.nodeGroups { nodeNames, err := ng.getNodeNamesForNodeGroup() if err != nil { return fmt.Errorf("error cleaning up: %v", err) } for _, node := range nodeNames { err := kwok.kubeClient.CoreV1().Nodes().Delete(context.Background(), node, v1.DeleteOptions{}) if err != nil { klog.Errorf("error cleaning up kwok provider nodes '%v'", node) } } } return nil }

True. I was considering a case when Cleanup doesn't clean all the nodes (it doesn't work correctly for whatever reason)

The Kwok provider will calculate the target sizes during startup, and the Cluster Autoscaler will scale these nodegroups to the appropriate size. I think it's not necessary to calculate the target size in the Refresh function, just leave this work to CA may be a better choice.

I think we might have to try this out to confirm if CA can handle such a situation. If you can try it out as a part of this PR, great. If not, we can take care of it in another issue.

Hi @vadasambar, I have implemented the Refresh() function. Please take a look.

vadasambar · 2023-12-11T18:25:01Z

@vadasambar Sorry for the misunderstanding in the title. CA does not panic; it simply quits without any stack information. The latest line in your log shows the reason:
F1205 23:17:08.902773  116597 kwok_helpers.go:270] label 'kwok-nodegroup' for node 'kwok:cluster-autoscaler-1701798315-296wg' not present in the manifest

This is clearly a bug. Thank you for the explanation!

vadasambar

Also, I think we should add a test case which fails with the current code and passes with the fix.

qianlei90 · 2023-12-19T11:45:01Z

Also, I think we should add a test case which fails with the current code and passes with the fix.

Thanks for you advise, it will be done in a few days.

/hold

qianlei90 · 2023-12-29T01:58:17Z

Also, I think we should add a test case which fails with the current code and passes with the fix.

Done.

/unhold

vadasambar · 2024-01-03T04:16:16Z

cluster-autoscaler/cloudprovider/builder/builder_kwok.go

 	switch opts.CloudProviderName {
 	case cloudprovider.KwokProviderName:
-		return kwok.BuildKwokCloudProvider(opts, do, rl)(opts, do, rl)
+		return kwok.BuildKwok(opts, do, rl, informerFactory)


vadasambar · 2024-01-03T04:37:21Z

@qianlei90 apologies for the delay (was out on vacation). I plan to review this PR this week.

vadasambar · 2024-01-04T04:29:46Z

cluster-autoscaler/cloudprovider/kwok/kwok_provider.go

-	// 	ngs = append(ngs, ng)
-	// }
+	for _, node := range allNodes {
+		ngName := getNGName(node, kwok.config)


This will lead to klog.Fatal for cases where the node doesn't have the nodegroup label or annotation.

You might have to use a filter function to filter only nodes which have the nodegroup label/annotation OR convert klog.Fatal into an non-fatal error log.

ok. I converted klog.Fatal to klog.Warning and added comments to getNGName.

vadasambar · 2024-01-04T04:31:18Z

cluster-autoscaler/cloudprovider/kwok/kwok_provider.go

+	}
+
+	for _, ng := range kwok.nodeGroups {
+		ng.targetSize = targetSizeInCluster[ng.Id()]


Maybe something for the future: I wonder if we should delete nodes which have ng that is not a part of kwok.nodeGroups.

vadasambar · 2024-01-04T04:49:55Z

cluster-autoscaler/cloudprovider/kwok/kwok_provider_test.go

+		assert.NoError(t, err)
+		assert.NotNil(t, p)
+
+		err = p.Refresh()


kwokConfig.status is coming out nil here for some reason because of which node4 test case is not throwing an error

Do you mean the p.config.status is nil? I ran this test case in debug mode and found that this field was not nil. Is this the expected value?

Sorry, it was a misunderstanding on my end. I ran the test again. Looks good to me.

vadasambar · 2024-01-12T16:36:56Z

/lgtm

vadasambar · 2024-01-12T16:37:24Z

/unhold

vadasambar · 2024-01-12T16:41:18Z

Thank you @qianlei90 . LGTM.

@BigDarkClown can you please merge the PR 🙏
I am the approver and reviewer for kwok cloud provider (this PR contains only kwok provider changes) but I can't seem to merge the PR.

vadasambar · 2024-01-12T17:49:29Z

/assign @towca

towca · 2024-01-18T13:18:30Z

/approve

k8s-ci-robot · 2024-01-18T13:19:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qianlei90, towca, vadasambar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/cluster-autoscaler labels Dec 2, 2023

k8s-ci-robot requested review from BigDarkClown and vadasambar December 2, 2023 14:53

qianlei90 force-pushed the fix-kwok-provider branch from 3a67a32 to ca0bb24 Compare December 2, 2023 14:54

qianlei90 commented Dec 2, 2023

View reviewed changes

k8s-ci-robot assigned vadasambar Dec 2, 2023

vadasambar mentioned this pull request Dec 4, 2023

Dec 2023 vadafoss/daily-updates#16

Closed

qianlei90 changed the title ~~fix(kwok): fix panic when scale down node group~~ fix(kwok): prevent quitting when scaling down node group Dec 6, 2023

qianlei90 commented Dec 6, 2023

View reviewed changes

vadasambar reviewed Dec 13, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2023

qianlei90 force-pushed the fix-kwok-provider branch from ca0bb24 to 3feb6ce Compare December 28, 2023 09:22

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 28, 2023

qianlei90 force-pushed the fix-kwok-provider branch from 3feb6ce to 1b8f1cc Compare December 28, 2023 10:14

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 28, 2023

vadasambar mentioned this pull request Jan 2, 2024

Jan 2024 vadafoss/daily-updates#17

Closed

vadasambar reviewed Jan 3, 2024

View reviewed changes

vadasambar reviewed Jan 4, 2024

View reviewed changes

fix(kwok): fix panic when scale down node group

e71a123

qianlei90 force-pushed the fix-kwok-provider branch from 1b8f1cc to e71a123 Compare January 10, 2024 03:11

vadasambar approved these changes Jan 12, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 12, 2024

k8s-ci-robot assigned towca Jan 12, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 18, 2024

k8s-ci-robot merged commit df0ce2d into kubernetes:master Jan 18, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kwok): prevent quitting when scaling down node group #6336

fix(kwok): prevent quitting when scaling down node group #6336

qianlei90 commented Dec 2, 2023 •

edited

Loading

qianlei90 Dec 2, 2023

vadasambar Dec 11, 2023

qianlei90 Dec 19, 2023

qianlei90 commented Dec 2, 2023

vadasambar commented Dec 4, 2023

vadasambar commented Dec 5, 2023

vadasambar commented Dec 5, 2023

qianlei90 commented Dec 6, 2023

qianlei90 Dec 6, 2023

vadasambar Dec 11, 2023 •

edited

Loading

vadasambar Dec 13, 2023

qianlei90 Dec 19, 2023

qianlei90 Dec 20, 2023

vadasambar Dec 20, 2023

qianlei90 Dec 29, 2023

vadasambar commented Dec 11, 2023

vadasambar left a comment

qianlei90 commented Dec 19, 2023 •

edited

Loading

qianlei90 commented Dec 29, 2023 •

edited

Loading

vadasambar Jan 3, 2024

vadasambar commented Jan 3, 2024

vadasambar Jan 4, 2024

qianlei90 Jan 10, 2024

vadasambar Jan 4, 2024

vadasambar Jan 4, 2024

qianlei90 Jan 10, 2024

vadasambar Jan 12, 2024

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

towca commented Jan 18, 2024

k8s-ci-robot commented Jan 18, 2024

fix(kwok): prevent quitting when scaling down node group #6336

fix(kwok): prevent quitting when scaling down node group #6336

Conversation

qianlei90 commented Dec 2, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qianlei90 commented Dec 2, 2023

vadasambar commented Dec 4, 2023

vadasambar commented Dec 5, 2023

vadasambar commented Dec 5, 2023

qianlei90 commented Dec 6, 2023

Choose a reason for hiding this comment

vadasambar Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar commented Dec 11, 2023

vadasambar left a comment

Choose a reason for hiding this comment

qianlei90 commented Dec 19, 2023 • edited Loading

qianlei90 commented Dec 29, 2023 • edited Loading

Choose a reason for hiding this comment

vadasambar commented Jan 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

vadasambar commented Jan 12, 2024

towca commented Jan 18, 2024

k8s-ci-robot commented Jan 18, 2024

qianlei90 commented Dec 2, 2023 •

edited

Loading

vadasambar Dec 11, 2023 •

edited

Loading

qianlei90 commented Dec 19, 2023 •

edited

Loading

qianlei90 commented Dec 29, 2023 •

edited

Loading