Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

bparamjeet · 2024-11-29T12:35:35Z

Description

Observed Behavior:
Karpenter is recreating the same type of node over a period of time and deleting it in parallel when replicas are increased.

Expected Behavior:
Karpenter should not consolidate often when replicas are increased and new nodes are created , deleted over a period of time marking the pods to be shifted to new nodes.

Reproduction Steps (Please include YAML):

Create multiple nodepools with same weights and each for an AZ.
Create a deployment with 1 replica without any nodeselectors.
Increase the replica count to 10
Karpenter will create multiple nodeclaims and the pods will be re-shuffled over period of time.

Versions:

Karpenter Version: 1.0.5
Kubernetes Version (kubectl version): v1.31

Attached logs :
karpenter-dec07.log

Questions :

What could be the reason for a node repeatedly entering a recreating loop, even though there hasn’t been any scale-up or scale-down activity in the cluster? Specific Time: 16:15PM - 16:45PM , issue can be found in the attached logs.
What is the purpose of including deleting pods in the “found provisionable pod” category, especially those currently on a terminating node?

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L326

The text was updated successfully, but these errors were encountered:

sekar-saravanan · 2024-12-10T11:13:55Z

We were able to reproduce the issue, and it’s not a bug in Karpenter itself but rather an inefficiency of Karpenter’s scheduling with respect to the Kube-scheduler. Kube-scheduler have full control of pod placement, which causes conflicts with the Karpenter's consolidation calculation during scheduling.

How to Reproduce:

Deployments Configuration:

deploy1: 2 CPU
deploy2: 3 CPU
deploy3: 4 CPU
deploy4: 12 CPU
DaemonSet: 1.2 CPU per node

Node Pool Configuration:

Created a separate node pool for testing using specific labels and taints.
Instance Type: c5.4xlarge (16 CPU, 32 Gi memory)

Scale the Deployment:

Scale the deployments and ensure to schedule the pods in below mentioned nodes with the help of kubectl cordon/uncordon commands.

	node 1		node 2		node 3
resource	pod count	cpu requested	pod count	cpu requested	pod count	cpu requested
deploy1 - 2 cpu	0	0	1	2	0	0
deploy2 - 3 cpu	0	0	2	6	0	0
deploy3 - 4 cpu	0	0	0	0	2	8
deploy4 - 12 cpu	1	12	0	0	0	0
daemonset	-	1.2	-	1.2	-	1.2
total		13.2		9.2		9.2
free		2.8		6.8		6.8

Setup Complete:

Once pods are placed on the nodes, uncordon all nodes and ensure consolidation is enabled for the node pool.

What will happen next ?

Karpenter will now perform consolidation for node2, aiming to move the deploy1 pod (1 pod) to node1 and the deploy2 pods (2 pods) to node3, allowing the node to be evicted

Expected Scheduling by Karpenter:

	node 1		node 2		node3
resource	pod count	cpu requested	pod count	cpu requested	pod count	cpu requested
deploy1 - 2 cpu	1	2	0	0	0	0
deploy2 - 3 cpu	0	0	0	0	2	6
deploy3 - 4 cpu	0	0	0	0	2	8
deploy4 - 12 cpu	1	12	0	0	0	0
daemonset	-	1.2	-	1.2	-	1.2
total		15.2		1.2		15.2
free		0.8		14.8		0.8

Actual Scheduling by Kube-Scheduler

	node 1		node 2		node 3
resource	pod count	cpu requested	pod count	cpu requested	pod count	cpu requested
deploy1 - 2 cpu	0	0	0	0	1	2
deploy2 - 3 cpu	0	0	1	3	1	3
deploy3 - 4 cpu	0	0	0	0	2	8
deploy4 - 12 cpu	1	12	0	0	0	0
daemonset	-	1.2	-	1.2	-	1.2
total		13.2		4.2		14.2
free		2.8		11.8		1.8

Observed Issue:

One of the deploy2 pods enters the Pending state, which triggers Karpenter to provision a new node.
This behavior repeats in a loop, continuously recreating nodes.

Root Cause:

Karpenter performs calculations during consolidation to schedule pods on appropriate nodes. However, kube-scheduler disregards these calculations and places pods randomly.
This discrepancy leads to inefficient pod placement, causing some pods to remain pending and triggering unnecessary node provisioning.

Attached the deployment, pdb and nodepool manifests below.
manifest.zip

jonathan-innis · 2024-12-10T17:21:03Z

Agree with @sekar-saravanan that this is an unfortunate interaction between Karpenter and the kube-scheduler being different entities here and drain ordering playing some factor into which pods end-up where. Beyond Karpenter being the scheduler itself, we've talked about some mitigations for this issue. Most notably: the use of consolidateAfter which will prevent the node from consolidating if there has just been a pod that scheduled to it -- you may be able to consider extending this timing to at least prevent the continual nature of this churn.

jonathan-innis · 2024-12-10T17:21:12Z

/triage accepted

sekar-saravanan · 2024-12-18T07:37:36Z

Increasing the consolidateAfter time helps reduce the node churn cycle, but it won’t resolve the actual issue. We hope that configuring the node scoring policy as MostAllocated in the Kube Scheduler could help in this scenario. However, EKS currently does not support customizing the Kube Scheduler configuration. aws/containers-roadmap#1468.

bparamjeet added the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 29, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

bparamjeet commented Nov 29, 2024 •

edited

Loading

sekar-saravanan commented Dec 10, 2024 •

edited

Loading

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sekar-saravanan commented Dec 18, 2024 •

edited

Loading

Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

Comments

bparamjeet commented Nov 29, 2024 • edited Loading

Description

sekar-saravanan commented Dec 10, 2024 • edited Loading

How to Reproduce:

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sekar-saravanan commented Dec 18, 2024 • edited Loading

bparamjeet commented Nov 29, 2024 •

edited

Loading

sekar-saravanan commented Dec 10, 2024 •

edited

Loading

sekar-saravanan commented Dec 18, 2024 •

edited

Loading