feat: support custom scheduler config (without extenders) #5708

vadasambar · 2023-04-25T06:30:17Z

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Problem

We don't have a mechanism to to tweak configuration for default plugins acting on PreFilter and Filter extension points (which is used in predicate checker).
We don't have a way to support simulating custom scheduler logic without re-compiling the CA

This PR introduces --scheduler-config-file flag to

tweak configuration for default plugins
~~2. add support for scheduler extenders so that the predicate checker can call an external filter URL.~~ DECIDED AGAINST GOING FORWARD WITH EXTENDERS. Check feat: support custom scheduler config (without extenders) #5708 (comment) and feat: support custom scheduler config (without extenders) #5708 (comment)

Which issue(s) this PR fixes:

Fixes #5106

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added: `--scheduler-config-file` flag which supports configuring `PreFilter` and `Filter` extension points for in-tree scheduler plugins used to run scheduler simulations in CA

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

vadasambar · 2023-04-28T14:22:48Z

cluster-autoscaler/main.go

+		return &schedconfig.KubeSchedulerConfiguration{}
+	}
+
+	obj, gvk, err := schedscheme.Codecs.UniversalDecoder().Decode(data, nil, nil)


This line creates a default scheduler config if none is specified.
TODO: return output of this call as default. If it errors, fall back to the old behavior of getting a default config.

Not doing the TODO above leads to a problem where the scheduler config ends up as empty and it shows up in the error log like this:

$ ./cluster-autoscaler --kubeconfig=/home/suraj/.kube/config --namespace=default I0428 19:48:18.545718 1865631 leaderelection.go:245] attempting to acquire leader lease default/cluster-autoscaler... I0428 19:48:40.443736 1865631 leaderelection.go:255] successfully acquired lease default/cluster-autoscaler E0428 19:48:40.444001 1865631 main.go:385] couldn't load scheduler config (falling back to default scheduler config): open : no such file or directory F0428 19:48:40.446465 1865631 main.go:504] Failed to create autoscaler: unexpected scheduler config: expected one scheduler profile only (found 0 profiles)

vadasambar · 2023-05-02T07:06:24Z

Deleted my comment around test results because the screenshots have sensitive info. Will post another comment with sensitive details redacted.

vadasambar · 2023-05-03T06:11:35Z

Note that Go tests are probably failing because of #5708 (comment)

Testing for scheduler extenders

Created a GKE cluster
Deployed a custom scheduler called custom-scheduler in kube-system namespace with custom scheduler config.

Custom scheduler manifests

apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: custom-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin # only for testing
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-scheduler-config
  namespace: kube-system
data:
  custom-scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: custom-scheduler
    leaderElection:
      leaderElect: false
    # I want scheduler to call an external http extender to 
    # filter schedulable nodes
    extenders:
    - urlPrefix: http://sample-scheduler-extender-service.default
      filterVerb: apiv1/filter
      nodeCacheCapable: false
      ignorable: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: custom-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: custom-scheduler
      volumes:
        - name: config-volume
          configMap:
            name: custom-scheduler-config
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --config=/etc/kubernetes/custom-scheduler/custom-scheduler-config.yaml
        - --v=5
        image: k8s.gcr.io/kube-scheduler:v1.25.3
        volumeMounts:
          - name: config-volume
            mountPath: /etc/kubernetes/custom-scheduler
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false

      hostNetwork: false
      hostPID: false

Based on https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#define-a-kubernetes-deployment-for-the-scheduler

Deployed a simple scheduler extender called sample-scheduler-extender
Deployed a custom CA (built a custom image for this PR)

When nodes don't have `extender='true'` label

Pod (which has spec.schedulerName: custom-scheduler) gets stuck in Pending and CA is not able to scale-up as expected.

Pod manifest

apiVersion: v1
kind: Pod
metadata:
  name: annotation-second-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulerName: custom-scheduler
  containers:
  - name: pod-with-second-annotation-container
    image: registry.k8s.io/pause:2.0

Based on https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods

As expected, CA throws error when it sees that the pod cannot be scheduled on the default nodepool (whose nodes don't have extender: 'true' label)

When nodes have `extender: 'true'` label

I created a new nodepool in GKE which brings a node with extender: 'true' label.

CA correctly thinks that the pod can be scheduled on the upcoming node.

Node has the label

Pod gets scheduled

x13n

/assign

x13n · 2023-05-05T13:30:58Z

cluster-autoscaler/main.go

@@ -368,6 +376,36 @@ func getKubeConfig() *rest.Config {
 	return kubeConfig
 }

+func getSchedulerConfig(path string) *schedconfig.KubeSchedulerConfiguration {


This function represents non-trivial logic which I think should live in a separate package, not main. (Same goes for a lot of the existing code, but I think least we can do is not making things worse.)

I like the idea. Thanks for the feedback.

P.S.:
I created this PR as a PoC to evaluate the idea and get an OK from the community before proceeding ahead (one of the items on my agenda for today i.e., 8th May 2023's SIG meeting since we didn't have 2 meetings before). I had planned on refactoring this function and the PR as a whole (+ tests) but stopped myself so that I don't end up spending too much time in case this feature doesn't sound like a good idea to the community.

x13n · 2023-05-05T13:41:57Z

cluster-autoscaler/simulator/predicatechecker/schedulerbased.go

+func (p *SchedulerBasedPredicateChecker) runExtendersForNode(pod *apiv1.Pod, nodeInfo *schedulerframework.NodeInfo) *PredicateError {
+	for _, extender := range p.framework.Extenders() {
+		if extender.IsInterested(pod) {
+			filteredNodes, failedNodesMap, unresolvableNodesMap, err := extender.Filter(pod, []*apiv1.Node{nodeInfo.Node()})


nit: Looks like extender.Filter accepts a list of nodes - wouldn't it make more sense to run this once after PreFilter, but before Filter and pass all the nodes that came out of PreFilter?

Passing all nodes extender.Filter at once so that we don't call it multiple times makes sense to me but same could be said of running filter plugins (even though cost of calling in-tree filter plugins per node won't be as high). I just took a safe bet here and went with the same approach as running filter plugins (might not be the best approach as you rightfully pointed out). I guess the question here would be, why didn't we do the same for running filter plugins?

extender.Filter is called in the upstream scheduler code after all PreFilter and Filter plugins are called. I don't see much benefit in changing the order.

File: pkg/scheduler/schedule_one.go 385: // Filters the nodes to find the ones that fit the pod based on the framework 386: // filter plugins and filter extenders. 387: func (sched *Scheduler) findNodesThatFitPod(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) ([]*v1.Node, framework.Diagnosis, error) { ... 393: allNodes, err := sched.nodeInfoSnapshot.NodeInfos().List() 394: if err != nil { 395: return nil, diagnosis, err 396: } 397: // Run "prefilter" plugins. 398: preRes, s := fwk.RunPreFilterPlugins(ctx, state, pod) 399: ... 413: 414: ... 438: feasibleNodes, err := sched.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, nodes) ... 446: 447: feasibleNodes, err = findNodesThatPassExtenders(sched.Extenders, pod, feasibleNodes, diagnosis.NodeToStatusMap) 448: if err != nil { 449: return nil, diagnosis, err 450: }

https://github.com/kubernetes/kubernetes/blob/8d18ae6fc245deb68534cab608056c5aae9160bd/pkg/scheduler/schedule_one.go#L385-L452

File: pkg/scheduler/schedule_one.go 576: func findNodesThatPassExtenders(extenders []framework.Extender, pod *v1.Pod, feasibleNodes []*v1.Node, statuses framework.NodeToStatusMap) ([]*v1.Node, error) { 577: // Extenders are called sequentially. 578: // Nodes in original feasibleNodes can be excluded in one extender, and pass on to the next 579: // extender in a decreasing manner. ... 587: 588: // Status of failed nodes in failedAndUnresolvableMap will be added or overwritten in <statuses>, 589: // so that the scheduler framework can respect the UnschedulableAndUnresolvable status for 590: // particular nodes, and this may eventually improve preemption efficiency. 591: // Note: users are recommended to configure the extenders that may return UnschedulableAndUnresolvable 592: // status ahead of others. 593: feasibleList, failedMap, failedAndUnresolvableMap, err := extender.Filter(pod, feasibleNodes) ...

https://github.com/kubernetes/kubernetes/blob/8d18ae6fc245deb68534cab608056c5aae9160bd/pkg/scheduler/schedule_one.go#L576-L627

x13n · 2023-05-05T13:42:53Z

cluster-autoscaler/simulator/predicatechecker/schedulerbased.go

 		}
+
+		if err := p.runExtendersForNode(pod, nodeInfo); err != nil {


Can we not ignore err here? I think it has some useful information when not nil.

Fair enough. I will think something around this. Thanks for the comment.

MaciekPytel · 2023-05-08T15:59:28Z

Supporting configuration of existing PreFilters/Filters seems non-controversial to me. I'm worried about extenders though. The 2 specific problems I see are:

What exactly is being sent to HTTP extender? I suspect it's just node names, with extender taking the actual Node and Pod objects from informer / api-server? If so, that doesn't really work with CA simulation that operate on hypothetical cluster state and not real one (CA simulation has in-memory template nodes and even existing nodes can run an entirely different set of pods in CA simulation than what is really scheduled on them). Unless we serialize the entire ClusterSnapshot and send it over to extender, the extender is operating on wrong data and CA will make incorrect decisions as a result.
Is there a separate HTTP request for each Filter run? If so, I don't think it will work on scales supported by CA (ex. simulating a scale-up of 100 nodes on 100 different nodegroups to make a scale-up decision).

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod).
My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

vadasambar · 2023-05-09T06:38:48Z

Supporting configuration of existing PreFilters/Filters seems non-controversial to me. I'm worried about extenders though. The 2 specific problems I see are:

What exactly is being sent to HTTP extender? I suspect it's just node names, with extender taking the actual Node and Pod objects from informer / api-server? If so, that doesn't really work with CA simulation that operate on hypothetical cluster state and not real one (CA simulation has in-memory template nodes and even existing nodes can run an entirely different set of pods in CA simulation than what is really scheduled on them). Unless we serialize the entire ClusterSnapshot and send it over to extender, the extender is operating on wrong data and CA will make incorrect decisions as a result.

Is there a separate HTTP request for each Filter run? If so, I don't think it will work on scales supported by CA (ex. simulating a scale-up of 100 nodes on 100 different nodegroups to make a scale-up decision).

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod). My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

Thank you for the comment.

This is configurable.
- If nodeCacheCapable field in extenders[].nodeCacheCapable in scheduler config is set to true, scheduler sends node names to HTTP extender
- If the field is set to false scheduler sends nodes to the HTTP extender
- References:
  - Code ref in scheduler: https://github.com/kubernetes/kubernetes/blob/8d18ae6fc245deb68534cab608056c5aae9160bd/pkg/scheduler/extender.go#L267-L278
  - HTTP extender I used to test: https://github.com/vadasambar/sample-scheduler-extender
If you mean is there a separate HTTP request for every time we want to check a pod against Filter predicates in our CA code, then yes.

I had a discussion around 2 with @mwielgus in yesterday's SIG meeting (8th May 2023) and he had similar concerns (summary of the discussion). There are concerns around performance of HTTP extender in scheduler as well (ref).
I was hoping we could provide this as an optional feature for small to medium clusters since we don't support any kind of customization for scheduler simulation at the moment (and maybe ask for GRPC support for extenders?).

That being said,

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod).
My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

I understand your concern. Personally, I think we can add some checks to disable the extender when the cluster size becomes too big but there are problems with that approach too since it would surprise our users (e.g., I expect the extender to be working but it has suddenly stopped working) and the size would depend on the HTTP extender being used (some might degrade in performance at 20 nodes while others might degrade at 80 nodes).

I will drop the idea of supporting external extenders and proceed ahead with supporting custom scheduler config for in-tree scheduler plugins.

Thanks for taking the time to write all of this. Really helps understand the why of things.

x13n · 2023-05-11T11:47:06Z

cluster-autoscaler/simulator/predicatechecker/schedulerbased.go

@@ -27,7 +27,7 @@ import (
 	kube_client "k8s.io/client-go/kubernetes"
 	v1listers "k8s.io/client-go/listers/core/v1"
 	klog "k8s.io/klog/v2"
-	scheduler_config "k8s.io/kubernetes/pkg/scheduler/apis/config/latest"
+	config "k8s.io/kubernetes/pkg/scheduler/apis/config"


nit: This alias is a no-op, the package is called config already.

x13n · 2023-05-11T11:56:01Z

cluster-autoscaler/main.go

@@ -132,6 +133,7 @@ var (
 			"for scale down when some candidates from previous iteration are no longer valid."+
 			"When calculating the pool size for additional candidates we take"+
 			"max(#nodes * scale-down-candidates-pool-ratio, scale-down-candidates-pool-min-count).")
+	schedulerConfig             = flag.String("scheduler-config-file", "", "scheduler-config allows adding extenders and changing configuration of default scheduler plugins acting on PreFilter and Filter extension points")


I think the comment is not accurate anymore as extenders won't be used.

x13n · 2023-05-11T11:58:45Z

cluster-autoscaler/utils/scheduler/scheduler.go

+// GetSchedulerConfig loads scheduler config from a path
+// (falls back to default config in case of error)
+func GetSchedulerConfig(path string) (*schedconfig.KubeSchedulerConfiguration, error) {
+	defaultFallbackMsg := "(falling back to default scheduler config)"


Is the fallback really the desired behavior here? This can hide some errors - e.g. user making a typo in config path will expect custom config to be used, but CA instead of crashing right away will silently fall back to the default configuration.

Using default config instead of failing fast might surprise our users. I will change this. Thanks for the suggestion.

x13n · 2023-05-11T12:00:03Z

cluster-autoscaler/utils/scheduler/scheduler.go

+
+// GetSchedulerConfig loads scheduler config from a path
+// (falls back to default config in case of error)
+func GetSchedulerConfig(path string) (*schedconfig.KubeSchedulerConfiguration, error) {


This function is actually the main code change in this PR now. Maybe worth adding some unit tests?

Sorry, I should have added a comment after #5708 (comment). Since we have moved on from the PoC phase, I was planning to do another pass on PR (cleanup, add unit tests etc.,) and the change the PR from draft to Ready for Review. Maybe I can mention you again here once I am done?

Yup, sounds good!

x13n · 2023-05-11T12:00:51Z

cluster-autoscaler/simulator/predicatechecker/testchecker.go

 	clientsetfake "k8s.io/client-go/kubernetes/fake"
 )

 // NewTestPredicateChecker builds test version of PredicateChecker.
 func NewTestPredicateChecker() (PredicateChecker, error) {
+
+	cfg, err := scheduler.GetSchedulerConfig("")


nit: Why not use the default config directly?

Let me think about this 🙏

This has been updated to use the default scheduler config in case no scheduler config is passed through the function parameter: https://github.com/kubernetes/autoscaler/pull/5708/files#diff-5818325464670a5cd54280fe253b0841a163da5de2764491a3065805785ffb1bR51-R56

vadasambar · 2023-05-17T06:29:38Z

cluster-autoscaler/utils/scheduler/scheduler.go

+
+	// this needs to be set explicitly because config's api version is empty after decoding
+	// check kubernetes/cmd/kube-scheduler/app/options/configfile.go for more info
+	cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()


This function is based on upstream scheduler's private function called loadConfigFromFile

File: cmd/kube-scheduler/app/options/configfile.go 34: func loadConfigFromFile(logger klog.Logger, file string) (*config.KubeSchedulerConfiguration, error) { 35: data, err := os.ReadFile(file) 36: if err != nil { 37: return nil, err 38: } 39: 40: return loadConfig(logger, data) 41: } 42: 43: func loadConfig(logger klog.Logger, data []byte) (*config.KubeSchedulerConfiguration, error) { 44: // The UniversalDecoder runs defaulting and returns the internal type by default. 45: obj, gvk, err := scheme.Codecs.UniversalDecoder().Decode(data, nil, nil) 46: if err != nil { 47: return nil, err 48: } 49: if cfgObj, ok := obj.(*config.KubeSchedulerConfiguration); ok { 50: // We don't set this field in pkg/scheduler/apis/config/{version}/conversion.go 51: // because the field will be cleared later by API machinery during 52: // conversion. See KubeSchedulerConfiguration internal type definition for 53: // more details. 54: cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String() 55: switch cfgObj.TypeMeta.APIVersion { 56: case configv1beta2.SchemeGroupVersion.String(): 57: logger.Info("KubeSchedulerConfiguration v1beta2 is deprecated in v1.25, will be removed in v1.28") 58: case configv1beta3.SchemeGroupVersion.String(): 59: logger.Info("KubeSchedulerConfiguration v1beta3 is deprecated in v1.26, will be removed in v1.29") 60: } 61: return cfgObj, nil 62: } 63: return nil, fmt.Errorf("couldn't decode as KubeSchedulerConfiguration, got %s: ", gvk) 64: }

https://github.com/kubernetes/kubernetes/blob/6b34fafdaf5998039c7e01fa33920a641b216d3e/cmd/kube-scheduler/app/options/configfile.go#L34-L64

Note that I haven't included the deprecated log messages because they might change and we'd have to remember updating them every time. I don't see a lot of value in printing these messages because we'd have to support the scheduler config even if it is deprecated anyway (open to other opinions here).

Note that if the user uses an old scheduler api version which isn't supported anymore like say v1alpha1, UniversalDecoder().Decode throws an error like this:

panic: no kind "KubeSchedulerConfiguration" is registered for version "kubescheduler.config.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"

Could we make this function public in scheduler codebase instead?

Thanks for the comment. I thought about this but I was also on the fence because scheduler has a similar function to write the config to a file

File: cmd/kube-scheduler/app/options/configfile.go 091: // LogOrWriteConfig logs the completed component config and writes it into the given file name as YAML, if either is enabled 092: func LogOrWriteConfig(logger klog.Logger, fileName string, cfg *config.KubeSchedulerConfiguration, completedProfiles []config.KubeSchedulerProfile) error { 093: loggerV := logger.V(2) 094: if !loggerV.Enabled() && len(fileName) == 0 { 095: return nil 096: } 097: cfg.Profiles = completedProfiles 098: 099: buf, err := encodeConfig(cfg) 100: if err != nil { 101: return err 102: } 103: 104: if loggerV.Enabled() { 105: loggerV.Info("Using component config", "config", buf.String()) 106: } 107: 108: if len(fileName) > 0 { 109: configFile, err := os.Create(fileName) 110: if err != nil { 111: return err 112: } 113: defer configFile.Close() 114: if _, err := io.Copy(configFile, buf); err != nil { 115: return err 116: } 117: logger.Info("Wrote configuration", "file", fileName) 118: os.Exit(0) 119: } 120: return nil 121: }

https://github.com/kubernetes/kubernetes/blob/6b34fafdaf5998039c7e01fa33920a641b216d3e/cmd/kube-scheduler/app/options/configfile.go#L91-L121

Notice how it is calling os.Exit(0) on line 118. If you import and use this function in your code, the program is going to exit when you call this function. My concern is, if we don't implement the function to read config file on our own we might make ourselves dependent on the implementation of LoadConfigFromFile (if loadConfigFromFile is converted to a public function) whose implementation is dependent on use-case of scheduler and not CA. If there's a change in the LoadConfigFromFile in the future, it will break things for us. That being said, same thing can be argued for Filter and PostFilter extension points (public functions which scheduler framework exposes) but I think there's an important difference between Filter/PostFilter extension points and LoadConfigFromFile or LogOrWriteConfig. Former is similar to a public API and it is unlikely to change without a strong deprecation policy or notice to users but that might not be the case with latter.

If we decide to go with LoadConfigFromFile way, we have 2 possible options:

Option 1:

Get this PR in to CA codebase

Talk with sig-scheduling about setting expectation about LoadConfigFromFile

Create another PR to replace our implementation with sig-scheduling's implementation (might have to wait until they release 1.28 so that we can import their code OR ask for a patch to 1.27)

Option 2:

Put this PR on hold

Talk with sig-scheduling about setting expectation about LoadConfigFromFile

Create another PR to replace our implementation with sig-scheduling's implementation (might have to wait until they release 1.28 so that we can import their code or ask for a patch to 1.27)

If we decide to go with asking for a public implementation of LoadConfigFromFile and we want to go with Option 1, I will create a separate issue to co-ordinate with sig-scheduling and create another PR to use the public implementation. If we go with Option 2, we will have to put this PR on hold until we get LoadConfigFromFile.

Let me know what you think.

I'm a bit worried changes in scheduler will never happen if we merge here first. OTOH, waiting for k8s release is a bit too long, so I'd:

Send a PR to update scheduler code

Add TODO here to switch to the upstream function and merge as is

PR to update scheduler code: kubernetes/kubernetes#119057

Added TODO here

vadasambar · 2023-05-17T06:34:22Z

cluster-autoscaler/utils/scheduler/scheduler_test.go

+	// Note that even if we are passing minimal config like below
+	// `GetSchedulerConfig` will set the rest of the default fields
+	// on its own (including default profile and default plugins)
+	correctConfigFile := filepath.Join(tmpDir, "correct_config.yaml")


Here's a simple program to test this. Just put the scheduler config in test-config.yaml

import ( "fmt" "os" "gopkg.in/yaml.v2" config "k8s.io/kubernetes/pkg/scheduler/apis/config" "k8s.io/kubernetes/pkg/scheduler/apis/config/scheme" "k8s.io/kubernetes/pkg/scheduler/apis/config/validation" ) func main() { file := "test-config.yaml" data, err := os.ReadFile(file) if err != nil { panic(err) } obj, gvk, err := scheme.Codecs.UniversalDecoder().Decode(data, nil, nil) if err != nil { panic(err) } cfgObj, ok := obj.(*config.KubeSchedulerConfiguration) if !ok { panic(fmt.Errorf("couldn't decode as KubeSchedulerConfiguration, got %s: ", gvk)) } cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String() if err := validation.ValidateKubeSchedulerConfiguration(cfgObj); err != nil { panic(err) } data, err = yaml.Marshal(cfgObj) if err != nil { panic(err) } fmt.Println(string(data)) }

test-config.yaml

apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration

Click here to see the output of the program

$ go run main.go typemeta: kind: "" apiversion: kubescheduler.config.k8s.io/v1 parallelism: 16 leaderelection: leaderelect: true leaseduration: duration: 15s renewdeadline: duration: 10s retryperiod: duration: 2s resourcelock: leases resourcename: kube-scheduler resourcenamespace: kube-system clientconnection: kubeconfig: "" acceptcontenttypes: "" contenttype: application/vnd.kubernetes.protobuf qps: 50 burst: 100 healthzbindaddress: "" metricsbindaddress: "" debuggingconfiguration: enableprofiling: true enablecontentionprofiling: true percentageofnodestoscore: 0 podinitialbackoffseconds: 1 podmaxbackoffseconds: 10 profiles: - schedulername: default-scheduler percentageofnodestoscore: null plugins: preenqueue: enabled: [] disabled: [] queuesort: enabled: [] disabled: [] prefilter: enabled: [] disabled: [] filter: enabled: [] disabled: [] postfilter: enabled: [] disabled: [] prescore: enabled: [] disabled: [] score: enabled: [] disabled: [] reserve: enabled: [] disabled: [] permit: enabled: [] disabled: [] prebind: enabled: [] disabled: [] bind: enabled: [] disabled: [] postbind: enabled: [] disabled: [] multipoint: enabled: - name: PrioritySort weight: 0 - name: NodeUnschedulable weight: 0 - name: NodeName weight: 0 - name: TaintToleration weight: 3 - name: NodeAffinity weight: 2 - name: NodePorts weight: 0 - name: NodeResourcesFit weight: 1 - name: VolumeRestrictions weight: 0 - name: EBSLimits weight: 0 - name: GCEPDLimits weight: 0 - name: NodeVolumeLimits weight: 0 - name: AzureDiskLimits weight: 0 - name: VolumeBinding weight: 0 - name: VolumeZone weight: 0 - name: PodTopologySpread weight: 2 - name: InterPodAffinity weight: 2 - name: DefaultPreemption weight: 0 - name: NodeResourcesBalancedAllocation weight: 1 - name: ImageLocality weight: 1 - name: DefaultBinder weight: 0 disabled: [] pluginconfig: - name: DefaultPreemption args: typemeta: kind: "" apiversion: "" mincandidatenodespercentage: 10 mincandidatenodesabsolute: 100 - name: InterPodAffinity args: typemeta: kind: "" apiversion: "" hardpodaffinityweight: 1 ignorepreferredtermsofexistingpods: false - name: NodeAffinity args: typemeta: kind: "" apiversion: "" addedaffinity: null - name: NodeResourcesBalancedAllocation args: typemeta: kind: "" apiversion: "" resources: - name: cpu weight: 1 - name: memory weight: 1 - name: NodeResourcesFit args: typemeta: kind: "" apiversion: "" ignoredresources: [] ignoredresourcegroups: [] scoringstrategy: type: LeastAllocated resources: - name: cpu weight: 1 - name: memory weight: 1 requestedtocapacityratio: null - name: PodTopologySpread args: typemeta: kind: "" apiversion: "" defaultconstraints: [] defaultingtype: System - name: VolumeBinding args: typemeta: kind: "" apiversion: "" bindtimeoutseconds: 600 shape: [] extenders: []

Note that direct marshalling to yaml is not the right way to serialize the scheduler config but I have used to for the sake of simplicity.

vadasambar · 2023-05-18T04:23:45Z

cluster-autoscaler/main.go

+
+	var parsedSchedConfig *scheduler_config.KubeSchedulerConfiguration
+	// if scheduler config flag was set by the user
+	if pflag.CommandLine.Changed(config.SchedulerConfigFileFlag) {


This is so that if the --scheduler-config-file flag is not used, we don't try to look for the scheduler config in the default path set by the flag i.e., "". We want to use the default scheduler config. Setting the scheduler config to nil ensures that the default scheduler config is used.

Why is it better than just comparing with ""? If someone specifies an empty value for the flag, shouldn't they get the default scheduler config, rather than an error from trying to open a filename of empty string?

I think there are two lines of thought here:

User has set --scheduler-config-file="" (note that this is a file path) which means they have made a mistake. Hence, fail.

User has set --scheduler-config-file="" which means they don't want to set a scheduler config by themselves. Fall back to using a default one.

Falling back to default in case of 2 doesn't seem as valuable to me because if the user wants to use the default config, they can just skip the flag altogether because the flag specifies path to a config file (and not scheduler config itself) and if they have set it to "", I think it's more likely that they have made a mistake (which can happen if say this value comes from a helm chart file and it's set to empty there by mistake) because they can just skip specifying the flag if they want to use the default config.

A counter-argument could be, if the user sets only one or two fields in the custom scheduler config we anyway fill rest of the required fields by default (in this PR). It can be argued that 2 is inline with this behavior.

I think failing here makes more sense than using default config but I am open to change it if you think there's a strong reason for not doing so.

I was thinking about someone templating the manifest and using empty string to mean "default", but I guess such templating can also just skip the flag in such cases, so maybe it is fine to just leave as is.

vadasambar · 2023-05-18T04:24:52Z

cluster-autoscaler/simulator/predicatechecker/schedulerbased_test.go

-			scheduledPods: []*apiv1.Pod{p450},
-			testPod:       p500,
-			expectError:   false,
+			name:             "custom - other pod - ok",


Test cases for custom predicate checker are a copy of the default predicate checker 's test cases except we don't expect an error for any of the custom predicate checker test cases.

vadasambar · 2023-05-18T04:25:55Z

cluster-autoscaler/simulator/predicatechecker/schedulerbased_test.go

-	assert.True(t, nodeName == "n1000" || nodeName == "n2000")
+		// custom predicate checker test cases
+		{
+			name:             "custom - small pod - no error",


Test cases for custom predicate checker are a copy of the default predicate checker's test cases except we don't expect an error for any of the custom predicate checker test cases.

x13n · 2023-06-26T12:45:00Z

cluster-autoscaler/config/const.go

@@ -32,4 +36,53 @@ const (
 	DefaultScaleDownUnreadyTimeKey = "scaledownunreadytime"
 	// DefaultMaxNodeProvisionTimeKey identifies MaxNodeProvisionTime autoscaling option
 	DefaultMaxNodeProvisionTimeKey = "maxnodeprovisiontime"
+
+	// Custom scheduler configs for testing


I don't think this file is a good home for test data.

I guess putting test consts in const.go feels a little out of place. But I don't have a strong opinion on where this should go. I will take another look at this (P.S.: open to suggestions).

I have moved the test config here to a separate package under at cluster-autoscaler/config/test/config.go
This test config is used at two different places:

simulator/predicatechecker/schedulerbased_test.go

utils/scheduler/scheduler_test.go

Other place this can go into is the existing cluster-autoscaler/utils/test/test_utils.go. Since this is config for testing, putting it under a separate test package under cluster-autoscaler/config seems like a better place.

Let me know what you think.

x13n · 2023-06-26T13:00:31Z

cluster-autoscaler/utils/scheduler/scheduler.go

+
+	// this needs to be set explicitly because config's api version is empty after decoding
+	// check kubernetes/cmd/kube-scheduler/app/options/configfile.go for more info
+	cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()


I'm a bit worried changes in scheduler will never happen if we merge here first. OTOH, waiting for k8s release is a bit too long, so I'd:

Send a PR to update scheduler code

Add TODO here to switch to the upstream function and merge as is

x13n · 2023-06-26T13:01:22Z

cluster-autoscaler/utils/scheduler/scheduler.go

@@ -106,3 +117,31 @@ func ResourceToResourceList(r *schedulerframework.Resource) apiv1.ResourceList {
 	}
 	return result
 }
+
+// GetSchedulerConfig loads scheduler config from a path.
+func GetSchedulerConfig(path string) (*scheduler_config.KubeSchedulerConfiguration, error) {


nit: ConfigFromPath? (So it becomes scheduler.ConfigFromPath(path) outside of this package.)

Thanks for the suggestion. I have changed the name to ConfigFromPath. scheduler.ConfigFromPath feels less vague than scheduler.GetSchedulerConfig.

vadasambar · 2023-07-04T07:50:17Z

@x13n I have addressed all the review comments. Waiting for your approval.

vadasambar · 2023-07-10T06:11:24Z

@x13n I have addressed all the review comments. Waiting for your approval.

@x13n waiting for your review.

x13n · 2023-07-10T08:21:49Z

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

vadasambar · 2023-07-11T05:42:42Z

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

@x13n I have squashed the commits.

…ithout extenders) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `--scheduler-config` -> `--scheduler-config-file` to avoid confusion Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: `goto` causing infinite loop - abstract out running extenders in a separate function Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove code around extenders - we decided not to use scheduler extenders for checking if a pod would fit on a node Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: move scheduler config to a `utils/scheduler` package` - use default config as a fallback Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix static_autoscaler test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: `GetSchedulerConfiguration` fn - remove falling back - add mechanism to detect if the scheduler config file flag was set - Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: wip add tests for `GetSchedulerConfig` - tests are failing now Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add tests for `GetSchedulerConfig` - abstract error messages so that we can use them in the tests - set api version explicitly (this is what upstream does as well) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: do a round of cleanup to make PR ready for review - make import names consistent Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: use `pflag` to check if the `--scheduler-config-file` flag was set Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add comments for exported error constants Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't export error messages - exporting is not needed Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: add underscore in test file name Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix test failing because of no comment on exported `SchedulerConfigFileFlag` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refacotr: change name of flag variable `schedulerConfig` -> `schedulerConfigFile` - avoids confusion Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add extra test cases for predicate checker - where the predicate checker uses custom scheduler config Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `setFlags` variable - not needed anymore Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: abstract custom scheduler configs into `conifg` package - make them constants Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix linting error Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: introduce a new custom test predicate checker - instead of adding a param to the current one - this is so that we don't have to pass `nil` to the existing test predicate checker in many places Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `NewCustomPredicateChecker` -> `NewTestPredicateCheckerWithCustomConfig` - latter narrows down meaning of the function better than former Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `GetSchedulerConfig` -> `ConfigFromPath` - `scheduler.ConfigFromPath` is shorter and feels less vague than `scheduler.GetSchedulerConfig` - move test config to a new package `test` under `config` package Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add `TODO` for replacing code to parse scheduler config - with upstream function Signed-off-by: vadasambar <surajrbanakar@gmail.com>

vadasambar · 2023-07-17T04:17:34Z

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

@x13n I have squashed the commits.

@x13n waiting for your approval.

x13n · 2023-07-20T08:53:11Z

/lgtm
/approve

k8s-ci-robot · 2023-07-20T08:53:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vadasambar, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 25, 2023

k8s-ci-robot requested a review from BigDarkClown April 25, 2023 06:31

k8s-ci-robot added the area/cluster-autoscaler label Apr 25, 2023

k8s-ci-robot requested a review from feiskyer April 25, 2023 06:31

vadasambar mentioned this pull request Apr 25, 2023

Apr 2023 vadafoss/daily-updates#8

Closed

vadasambar commented Apr 28, 2023

View reviewed changes

vadasambar mentioned this pull request May 2, 2023

May 2023 vadafoss/daily-updates#9

Closed

x13n requested changes May 5, 2023

View reviewed changes

k8s-ci-robot assigned x13n May 5, 2023

vadasambar mentioned this pull request May 9, 2023

REQUEST: New membership for @vadasambar kubernetes/org#4203

Closed

9 tasks

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2023

x13n requested changes May 11, 2023

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 16, 2023

vadasambar commented May 17, 2023

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2023

vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 07313a7 to db8507e Compare May 17, 2023 10:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2023

vadasambar commented May 18, 2023

View reviewed changes

vadasambar marked this pull request as ready for review May 18, 2023 04:27

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 23, 2023

vadasambar requested a review from x13n May 24, 2023 05:21

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2023

x13n reviewed Jun 26, 2023

View reviewed changes

vadasambar mentioned this pull request Jun 28, 2023

Jun 2023 vadafoss/daily-updates#10

Closed

vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 08d8af8 to ea715bb Compare June 29, 2023 17:09

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 29, 2023

This was referenced Jul 4, 2023

Jul 2023 vadafoss/daily-updates#11

Closed

Export loadConfigFromFile as a public function LoadConfigFromFile kubernetes/kubernetes#119056

Closed

vadasambar requested a review from x13n July 4, 2023 07:50

vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 0966967 to aba1df8 Compare July 11, 2023 05:25

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2023

vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from aba1df8 to 23f03e1 Compare July 13, 2023 04:21

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2023

alculquicondor mentioned this pull request Jul 14, 2023

refactor: export loadConfigFile as a public function LoadConfigFile kubernetes/kubernetes#119057

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2023

k8s-ci-robot merged commit 171021c into kubernetes:master Jul 20, 2023

		}

		if err := p.runExtendersForNode(pod, nodeInfo); err != nil {

feat: support custom scheduler config (without extenders) #5708

feat: support custom scheduler config (without extenders) #5708

Conversation

vadasambar commented Apr 25, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

vadasambar Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

vadasambar commented May 2, 2023

vadasambar commented May 3, 2023

Testing for scheduler extenders

When nodes don't have extender='true' label

When nodes have extender: 'true' label

x13n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented May 8, 2023

vadasambar commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vadasambar commented Jul 4, 2023

vadasambar commented Jul 10, 2023

x13n commented Jul 10, 2023

vadasambar commented Jul 11, 2023

vadasambar commented Jul 17, 2023

x13n commented Jul 20, 2023

k8s-ci-robot commented Jul 20, 2023

vadasambar commented Apr 25, 2023 •

edited

Loading

vadasambar Apr 28, 2023 •

edited

Loading

When nodes don't have `extender='true'` label

When nodes have `extender: 'true'` label

vadasambar May 22, 2023 •

edited

Loading

vadasambar May 17, 2023 •

edited

Loading

vadasambar May 18, 2023 •

edited

Loading