Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter is recreating the same type of node over a period of time and deleting it in parallel #1851

Open
bparamjeet opened this issue Nov 29, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bparamjeet
Copy link

bparamjeet commented Nov 29, 2024

Description

Observed Behavior:
Karpenter is recreating the same type of node over a period of time and deleting it in parallel when replicas are increased.

Expected Behavior:
Karpenter should not consolidate often when replicas are increased and new nodes are created , deleted over a period of time marking the pods to be shifted to new nodes.

Reproduction Steps (Please include YAML):

  • Create multiple nodepools with same weights and each for an AZ.
  • Create a deployment with 1 replica without any nodeselectors.
  • Increase the replica count to 10
  • Karpenter will create multiple nodeclaims and the pods will be re-shuffled over period of time.

Versions:

  • Karpenter Version: 1.0.5
  • Kubernetes Version (kubectl version): v1.31
Screenshot 2024-12-10 at 10 48 11 AM Screenshot 2024-12-10 at 10 46 43 AM Screenshot 2024-12-10 at 10 46 50 AM

Attached logs :
karpenter-dec07.log

Questions :

  1. What could be the reason for a node repeatedly entering a recreating loop, even though there hasn’t been any scale-up or scale-down activity in the cluster? Specific Time: 16:15PM - 16:45PM , issue can be found in the attached logs.
  2. What is the purpose of including deleting pods in the “found provisionable pod” category, especially those currently on a terminating node?
@bparamjeet bparamjeet added the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 29, 2024
@sekar-saravanan
Copy link

sekar-saravanan commented Dec 10, 2024

We were able to reproduce the issue, and it’s not a bug in Karpenter itself but rather an inefficiency of Karpenter’s scheduling with respect to the Kube-scheduler. Kube-scheduler have full control of pod placement, which causes conflicts with the Karpenter's consolidation calculation during scheduling.

How to Reproduce:

Deployments Configuration:

  • deploy1: 2 CPU
  • deploy2: 3 CPU
  • deploy3: 4 CPU
  • deploy4: 12 CPU
  • DaemonSet: 1.2 CPU per node

Node Pool Configuration:

  • Created a separate node pool for testing using specific labels and taints.
  • Instance Type: c5.4xlarge (16 CPU, 32 Gi memory)

Scale the Deployment:

  • Scale the deployments and ensure to schedule the pods in below mentioned nodes with the help of kubectl cordon/uncordon commands.

  node 1   node 2   node 3  
resource pod count cpu requested pod count cpu requested pod count cpu requested
deploy1 - 2 cpu 0 0 1 2 0 0
deploy2 - 3 cpu 0 0 2 6 0 0
deploy3 - 4 cpu 0 0 0 0 2 8
deploy4 - 12 cpu 1 12 0 0 0 0
daemonset - 1.2 - 1.2 - 1.2
total   13.2   9.2   9.2
free   2.8   6.8   6.8

Setup Complete:

  • Once pods are placed on the nodes, uncordon all nodes and ensure consolidation is enabled for the node pool.

What will happen next ?

  • Karpenter will now perform consolidation for node2, aiming to move the deploy1 pod (1 pod) to node1 and the deploy2 pods (2 pods) to node3, allowing the node to be evicted

Expected Scheduling by Karpenter:

  node 1   node 2   node3  
resource pod count cpu requested pod count cpu requested pod count cpu requested
deploy1 - 2 cpu 1 2 0 0 0 0
deploy2 - 3 cpu 0 0 0 0 2 6
deploy3 - 4 cpu 0 0 0 0 2 8
deploy4 - 12 cpu 1 12 0 0 0 0
daemonset - 1.2 - 1.2 - 1.2
total   15.2   1.2   15.2
free   0.8   14.8   0.8

Actual Scheduling by Kube-Scheduler

  node 1   node 2   node 3  
resource pod count cpu requested pod count cpu requested pod count cpu requested
deploy1 - 2 cpu 0 0 0 0 1 2
deploy2 - 3 cpu 0 0 1 3 1 3
deploy3 - 4 cpu 0 0 0 0 2 8
deploy4 - 12 cpu 1 12 0 0 0 0
daemonset - 1.2 - 1.2 - 1.2
total   13.2   4.2   14.2
free   2.8   11.8   1.8

Observed Issue:

  • One of the deploy2 pods enters the Pending state, which triggers Karpenter to provision a new node.
  • This behavior repeats in a loop, continuously recreating nodes.

Root Cause:

  • Karpenter performs calculations during consolidation to schedule pods on appropriate nodes. However, kube-scheduler disregards these calculations and places pods randomly.
  • This discrepancy leads to inefficient pod placement, causing some pods to remain pending and triggering unnecessary node provisioning.

Attached the deployment, pdb and nodepool manifests below.
manifest.zip

@jonathan-innis
Copy link
Member

Agree with @sekar-saravanan that this is an unfortunate interaction between Karpenter and the kube-scheduler being different entities here and drain ordering playing some factor into which pods end-up where. Beyond Karpenter being the scheduler itself, we've talked about some mitigations for this issue. Most notably: the use of consolidateAfter which will prevent the node from consolidating if there has just been a pod that scheduled to it -- you may be able to consider extending this timing to at least prevent the continual nature of this churn.

@jonathan-innis
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024
@sekar-saravanan
Copy link

sekar-saravanan commented Dec 18, 2024

Increasing the consolidateAfter time helps reduce the node churn cycle, but it won’t resolve the actual issue. We hope that configuring the node scoring policy as MostAllocated in the Kube Scheduler could help in this scenario. However, EKS currently does not support customizing the Kube Scheduler configuration. aws/containers-roadmap#1468.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants