Add Pod Condition and unblock cluster autoscaler #526

Jeffwan · 2019-01-01T01:19:18Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature

What happened:
Cluster autoscaler can not scale up nodes if pending pods are scheduled by kube-batch

After some investigation, I notice cluster autoscaler use following logic to filter pending pods. In this case, pending pods scheduled by kube-batch won't trigger autoscaling and it has to wait for other pods to release resource. The root cause is because pods in pending doesn't have podCondition and autoscaler will skip those pods.

	for _, pod := range allPods {
		_, condition := podv1.GetPodCondition(&pod.Status, apiv1.PodScheduled)
		if condition != nil && condition.Status == apiv1.ConditionFalse && condition.Reason == apiv1.PodReasonUnschedulable {
			unschedulablePods = append(unschedulablePods, pod)
		}
	}

pods in pending scheduled by kube-batch will have status spec like this

status:
  phase: Pending

Look at normal pending unschedule pods scheduled by kubernetes scheduler.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-01T00:39:13Z
    message: '0/2 nodes are available: 2 Insufficient cpu.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Try to schedule job using kube-batch and use autoscaler for node scaling.

Anything else we need to know?:
I notice #521 will add PodGroupStatus, but I think this won't work with autoscaler either.

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
Kernel (e.g. uname -a): 4.14.77-81.59.amzn2.x86_64 typo fixes #1 SMP Mon Nov 12 21:32:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools: eksctl
Others:

The text was updated successfully, but these errors were encountered:

Jeffwan · 2019-01-01T01:21:51Z

I have not checked code yet. I think pods status needs to be updated in the scheduling process. Correct me if I am wrong. If there's any code change work, I can take it. Thanks!

Jeffwan · 2019-01-01T01:31:17Z

A following up questions: Assume this is addressed and we have a job consist of 5 pod, each one requires 1 cpu. Now, we only have 4 nodes (1 CPU 1Gb Memory). Ideally, pending jobs waits for 1 more cpu, autoscaler at this time will detect 5 pending jobs, it will scale up 5 nodes finally.. Do you think if we can make some improvement either on autoscaler side or kube-batch side?

k82cn · 2019-01-01T02:27:04Z

I think pods status needs to be updated in the scheduling process.

Yes, kube-batch should also update Pod's status in this case; code change/PR is necessary :)

Ideally, pending jobs waits for 1 more cpu, autoscaler at this time will detect 5 pending jobs, it will scale up 5 nodes finally.

oh, for this case, maybe PodGroup's status can help; we need a detail solution for that :)

Jeffwan · 2019-01-02T05:53:18Z

@k82cn Thanks! I will submit a PR to reflect pod condition changes.

MaciekPytel · 2019-01-03T10:47:20Z

Marking pod as unschedulable will make CA notice the pending pod, but it's by no means enough to make it work with kube-batch. CA has been built specifically for kubernetes default scheduler and the whole logic is built around scheduler code imported form kubernetes/kubernetes repo. More details in #533 (review).

k82cn · 2019-01-03T11:14:11Z

CA has been built specifically for kubernetes default scheduler

I think CA is only built based on predicates, same with kube-batch; but after Scheduler Framework, something is also not true to CA, e.g. coscheduling in default scheduler will use reserved callback instead of a predicate.

MaciekPytel · 2019-01-03T11:48:21Z

coscheduling in default scheduler will use reserved callback instead of a predicate.

This is why I initially commented on kubernetes/enhancements#639. CA is part of Kubernetes, anything new feature in default scheduler should either be compatible with CA or have a design including how the new feature would be added to CA agreed between sig-scheduling and sig-autoscaling.
cc: @bsalamat

k82cn · 2019-01-03T14:54:33Z

how the new feature would be added to CA agreed between sig-scheduling and sig-autoscaling.

Can we de-couple that? For scheduler, we can not include all algorithms in upstream; instead, I'd suggest user to build customized algorithm based on scheduler framework or http extender. In that case, CA can not work :(

MaciekPytel · 2019-01-03T15:19:31Z

Sorry, I'm not sure I understand your comment. If the user builds customized algorithm using extender they can no longer use autoscaling. That's how it's always been.

If there is a plan to include a feature in default scheduler (ie. it will run if you use default scheduler that comes in kubernetes tarball without installing any custom schedulers or extenders and / or recompiling anything), than I think working with autoscaler should be a prerequisite.

Jeffwan · 2019-01-04T02:27:37Z

@k82cn @MaciekPytel Takes some time to finish a PR #535 to add PodCondition, but right now, this won't trigger ScaleUp action based on my test. CA won't trigger scale up directly with resources of pending jobs, it will FilterOutSchedulable firstly but one by one. (https://github.com/kubernetes/autoscaler/blob/4002559a4c69c5624ee685dbb2f9dd2e6240b896/cluster-autoscaler/core/utils.go#L112-L147) So in this case,

I have two nodes - NodeA 3.7cpus NodeB 2.3 cpus. 4 pods in podGroup, each requires 2cpu.
2/4 tasks in gang unschedulable. Simulation will loop pending pods one by one, for every one, it definitely has enough room to schedule pod. I think one improvement here is to subtract resource already allocated to pods in simulation.

MaciekPytel · 2019-01-04T10:58:33Z

There is no resource calculation at all in CA. It uses default scheduler logic for binpacking pods (it puts them one by one on in-memory fake node objects to see how many nodes would be needed). And, as you noticed, it will not scale-up before all space in cluster is used up.

Jeffwan · 2019-01-13T18:14:59Z

Right now, we add generic support for pod condition and close this issue. For cluster autoscaling part, we definitely have to take more things into consideration for kube-batch use case. I will try to figure all them out and have a separate issue/doc for you to review.

k82cn · 2019-01-14T02:33:17Z

I will try to figure all them out and have a separate issue/doc for you to review.

+1

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 1, 2019

k82cn added this to the v0.4 milestone Jan 1, 2019

k82cn mentioned this issue Jan 3, 2019

Coscheduilng. kubernetes/enhancements#639

Merged

Jeffwan mentioned this issue Jan 4, 2019

Add Unschedulable PodCondition for pods in pending #535

Merged

k8s-ci-robot closed this as completed in #535 Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pod Condition and unblock cluster autoscaler #526

Add Pod Condition and unblock cluster autoscaler #526

Jeffwan commented Jan 1, 2019 •

edited

Loading

Jeffwan commented Jan 1, 2019

Jeffwan commented Jan 1, 2019

k82cn commented Jan 1, 2019

Jeffwan commented Jan 2, 2019

MaciekPytel commented Jan 3, 2019

k82cn commented Jan 3, 2019

MaciekPytel commented Jan 3, 2019

k82cn commented Jan 3, 2019

MaciekPytel commented Jan 3, 2019

Jeffwan commented Jan 4, 2019

MaciekPytel commented Jan 4, 2019

Jeffwan commented Jan 13, 2019

k82cn commented Jan 14, 2019

Add Pod Condition and unblock cluster autoscaler #526

Add Pod Condition and unblock cluster autoscaler #526

Comments

Jeffwan commented Jan 1, 2019 • edited Loading

Jeffwan commented Jan 1, 2019

Jeffwan commented Jan 1, 2019

k82cn commented Jan 1, 2019

Jeffwan commented Jan 2, 2019

MaciekPytel commented Jan 3, 2019

k82cn commented Jan 3, 2019

MaciekPytel commented Jan 3, 2019

k82cn commented Jan 3, 2019

MaciekPytel commented Jan 3, 2019

Jeffwan commented Jan 4, 2019

MaciekPytel commented Jan 4, 2019

Jeffwan commented Jan 13, 2019

k82cn commented Jan 14, 2019

Jeffwan commented Jan 1, 2019 •

edited

Loading