Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Add Pod Condition and unblock cluster autoscaler #526

Closed
Jeffwan opened this issue Jan 1, 2019 · 13 comments · Fixed by #535
Closed

Add Pod Condition and unblock cluster autoscaler #526

Jeffwan opened this issue Jan 1, 2019 · 13 comments · Fixed by #535
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@Jeffwan
Copy link
Contributor

Jeffwan commented Jan 1, 2019

Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature

What happened:
Cluster autoscaler can not scale up nodes if pending pods are scheduled by kube-batch

After some investigation, I notice cluster autoscaler use following logic to filter pending pods. In this case, pending pods scheduled by kube-batch won't trigger autoscaling and it has to wait for other pods to release resource. The root cause is because pods in pending doesn't have podCondition and autoscaler will skip those pods.

	for _, pod := range allPods {
		_, condition := podv1.GetPodCondition(&pod.Status, apiv1.PodScheduled)
		if condition != nil && condition.Status == apiv1.ConditionFalse && condition.Reason == apiv1.PodReasonUnschedulable {
			unschedulablePods = append(unschedulablePods, pod)
		}
	}

pods in pending scheduled by kube-batch will have status spec like this

status:
  phase: Pending

Look at normal pending unschedule pods scheduled by kubernetes scheduler.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-01T00:39:13Z
    message: '0/2 nodes are available: 2 Insufficient cpu.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Try to schedule job using kube-batch and use autoscaler for node scaling.

Anything else we need to know?:
I notice #521 will add PodGroupStatus, but I think this won't work with autoscaler either.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration: aws

  • OS (e.g. from /etc/os-release):
    NAME="Amazon Linux"
    VERSION="2"
    ID="amzn"
    ID_LIKE="centos rhel fedora"
    VERSION_ID="2"
    PRETTY_NAME="Amazon Linux 2"
    ANSI_COLOR="0;33"
    CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
    HOME_URL="https://amazonlinux.com/"

  • Kernel (e.g. uname -a): 4.14.77-81.59.amzn2.x86_64 typo fixes #1 SMP Mon Nov 12 21:32:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: eksctl

  • Others:

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 1, 2019
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 1, 2019

I have not checked code yet. I think pods status needs to be updated in the scheduling process. Correct me if I am wrong. If there's any code change work, I can take it. Thanks!

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 1, 2019

A following up questions: Assume this is addressed and we have a job consist of 5 pod, each one requires 1 cpu. Now, we only have 4 nodes (1 CPU 1Gb Memory). Ideally, pending jobs waits for 1 more cpu, autoscaler at this time will detect 5 pending jobs, it will scale up 5 nodes finally.. Do you think if we can make some improvement either on autoscaler side or kube-batch side?

@k82cn k82cn added this to the v0.4 milestone Jan 1, 2019
@k82cn
Copy link
Contributor

k82cn commented Jan 1, 2019

I think pods status needs to be updated in the scheduling process.

Yes, kube-batch should also update Pod's status in this case; code change/PR is necessary :)

Ideally, pending jobs waits for 1 more cpu, autoscaler at this time will detect 5 pending jobs, it will scale up 5 nodes finally.

oh, for this case, maybe PodGroup's status can help; we need a detail solution for that :)

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 2, 2019

@k82cn Thanks! I will submit a PR to reflect pod condition changes.

@MaciekPytel
Copy link

Marking pod as unschedulable will make CA notice the pending pod, but it's by no means enough to make it work with kube-batch. CA has been built specifically for kubernetes default scheduler and the whole logic is built around scheduler code imported form kubernetes/kubernetes repo. More details in #533 (review).

@k82cn
Copy link
Contributor

k82cn commented Jan 3, 2019

CA has been built specifically for kubernetes default scheduler

I think CA is only built based on predicates, same with kube-batch; but after Scheduler Framework, something is also not true to CA, e.g. coscheduling in default scheduler will use reserved callback instead of a predicate.

@MaciekPytel
Copy link

coscheduling in default scheduler will use reserved callback instead of a predicate.

This is why I initially commented on kubernetes/enhancements#639. CA is part of Kubernetes, anything new feature in default scheduler should either be compatible with CA or have a design including how the new feature would be added to CA agreed between sig-scheduling and sig-autoscaling.
cc: @bsalamat

@k82cn
Copy link
Contributor

k82cn commented Jan 3, 2019

how the new feature would be added to CA agreed between sig-scheduling and sig-autoscaling.

Can we de-couple that? For scheduler, we can not include all algorithms in upstream; instead, I'd suggest user to build customized algorithm based on scheduler framework or http extender. In that case, CA can not work :(

@MaciekPytel
Copy link

Sorry, I'm not sure I understand your comment. If the user builds customized algorithm using extender they can no longer use autoscaling. That's how it's always been.

If there is a plan to include a feature in default scheduler (ie. it will run if you use default scheduler that comes in kubernetes tarball without installing any custom schedulers or extenders and / or recompiling anything), than I think working with autoscaler should be a prerequisite.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 4, 2019

@k82cn @MaciekPytel Takes some time to finish a PR #535 to add PodCondition, but right now, this won't trigger ScaleUp action based on my test. CA won't trigger scale up directly with resources of pending jobs, it will FilterOutSchedulable firstly but one by one. (https://github.com/kubernetes/autoscaler/blob/4002559a4c69c5624ee685dbb2f9dd2e6240b896/cluster-autoscaler/core/utils.go#L112-L147) So in this case,

I have two nodes - NodeA 3.7cpus NodeB 2.3 cpus. 4 pods in podGroup, each requires 2cpu.
2/4 tasks in gang unschedulable. Simulation will loop pending pods one by one, for every one, it definitely has enough room to schedule pod. I think one improvement here is to subtract resource already allocated to pods in simulation.

@MaciekPytel
Copy link

There is no resource calculation at all in CA. It uses default scheduler logic for binpacking pods (it puts them one by one on in-memory fake node objects to see how many nodes would be needed). And, as you noticed, it will not scale-up before all space in cluster is used up.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 13, 2019

Right now, we add generic support for pod condition and close this issue. For cluster autoscaling part, we definitely have to take more things into consideration for kube-batch use case. I will try to figure all them out and have a separate issue/doc for you to review.

@k82cn
Copy link
Contributor

k82cn commented Jan 14, 2019

I will try to figure all them out and have a separate issue/doc for you to review.

+1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants