Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support custom scheduler config (without extenders) #5708

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented Apr 25, 2023

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Problem

  1. We don't have a mechanism to to tweak configuration for default plugins acting on PreFilter and Filter extension points (which is used in predicate checker).
  2. We don't have a way to support simulating custom scheduler logic without re-compiling the CA

This PR introduces --scheduler-config-file flag to

  1. tweak configuration for default plugins
    2. add support for scheduler extenders so that the predicate checker can call an external filter URL. DECIDED AGAINST GOING FORWARD WITH EXTENDERS. Check feat: support custom scheduler config (without extenders) #5708 (comment) and feat: support custom scheduler config (without extenders) #5708 (comment)

Which issue(s) this PR fixes:

Fixes #5106

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added: `--scheduler-config-file` flag which supports configuring `PreFilter` and `Filter` extension points for in-tree scheduler plugins used to run scheduler simulations in CA 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 25, 2023
return &schedconfig.KubeSchedulerConfiguration{}
}

obj, gvk, err := schedscheme.Codecs.UniversalDecoder().Decode(data, nil, nil)
Copy link
Member Author

@vadasambar vadasambar Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line creates a default scheduler config if none is specified.
TODO: return output of this call as default. If it errors, fall back to the old behavior of getting a default config.

Not doing the TODO above leads to a problem where the scheduler config ends up as empty and it shows up in the error log like this:

$ ./cluster-autoscaler --kubeconfig=/home/suraj/.kube/config --namespace=default
I0428 19:48:18.545718 1865631 leaderelection.go:245] attempting to acquire leader lease default/cluster-autoscaler...
I0428 19:48:40.443736 1865631 leaderelection.go:255] successfully acquired lease default/cluster-autoscaler
E0428 19:48:40.444001 1865631 main.go:385] couldn't load scheduler config (falling back to default scheduler config): open : no such file or directory
F0428 19:48:40.446465 1865631 main.go:504] Failed to create autoscaler: unexpected scheduler config: expected one scheduler profile only (found 0 profiles)

@vadasambar
Copy link
Member Author

Deleted my comment around test results because the screenshots have sensitive info. Will post another comment with sensitive details redacted.

@vadasambar
Copy link
Member Author

Note that Go tests are probably failing because of #5708 (comment)

Testing for scheduler extenders

  1. Created a GKE cluster
  2. Deployed a custom scheduler called custom-scheduler in kube-system namespace with custom scheduler config.
Custom scheduler manifests
apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: custom-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin # only for testing
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-scheduler-config
  namespace: kube-system
data:
  custom-scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: custom-scheduler
    leaderElection:
      leaderElect: false
    # I want scheduler to call an external http extender to 
    # filter schedulable nodes
    extenders:
    - urlPrefix: http://sample-scheduler-extender-service.default
      filterVerb: apiv1/filter
      nodeCacheCapable: false
      ignorable: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: custom-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: custom-scheduler
      volumes:
        - name: config-volume
          configMap:
            name: custom-scheduler-config
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --config=/etc/kubernetes/custom-scheduler/custom-scheduler-config.yaml
        - --v=5
        image: k8s.gcr.io/kube-scheduler:v1.25.3
        volumeMounts:
          - name: config-volume
            mountPath: /etc/kubernetes/custom-scheduler
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false

      hostNetwork: false
      hostPID: false

Based on https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#define-a-kubernetes-deployment-for-the-scheduler

  1. Deployed a simple scheduler extender called sample-scheduler-extender
  2. Deployed a custom CA (built a custom image for this PR)
    image
When nodes don't have extender='true' label

Pod (which has spec.schedulerName: custom-scheduler) gets stuck in Pending and CA is not able to scale-up as expected.

Pod manifest
apiVersion: v1
kind: Pod
metadata:
  name: annotation-second-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulerName: custom-scheduler
  containers:
  - name: pod-with-second-annotation-container
    image: registry.k8s.io/pause:2.0

Based on https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods

redacted1

As expected, CA throws error when it sees that the pod cannot be scheduled on the default nodepool (whose nodes don't have extender: 'true' label)

When nodes have extender: 'true' label

I created a new nodepool in GKE which brings a node with extender: 'true' label.
image

CA correctly thinks that the pod can be scheduled on the upcoming node.
redacted2

image

Node has the label
redacted3

Pod gets scheduled
image

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@@ -368,6 +376,36 @@ func getKubeConfig() *rest.Config {
return kubeConfig
}

func getSchedulerConfig(path string) *schedconfig.KubeSchedulerConfiguration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function represents non-trivial logic which I think should live in a separate package, not main. (Same goes for a lot of the existing code, but I think least we can do is not making things worse.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Thanks for the feedback.

P.S.:
I created this PR as a PoC to evaluate the idea and get an OK from the community before proceeding ahead (one of the items on my agenda for today i.e., 8th May 2023's SIG meeting since we didn't have 2 meetings before). I had planned on refactoring this function and the PR as a whole (+ tests) but stopped myself so that I don't end up spending too much time in case this feature doesn't sound like a good idea to the community.

func (p *SchedulerBasedPredicateChecker) runExtendersForNode(pod *apiv1.Pod, nodeInfo *schedulerframework.NodeInfo) *PredicateError {
for _, extender := range p.framework.Extenders() {
if extender.IsInterested(pod) {
filteredNodes, failedNodesMap, unresolvableNodesMap, err := extender.Filter(pod, []*apiv1.Node{nodeInfo.Node()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like extender.Filter accepts a list of nodes - wouldn't it make more sense to run this once after PreFilter, but before Filter and pass all the nodes that came out of PreFilter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing all nodes extender.Filter at once so that we don't call it multiple times makes sense to me but same could be said of running filter plugins (even though cost of calling in-tree filter plugins per node won't be as high). I just took a safe bet here and went with the same approach as running filter plugins (might not be the best approach as you rightfully pointed out). I guess the question here would be, why didn't we do the same for running filter plugins?

extender.Filter is called in the upstream scheduler code after all PreFilter and Filter plugins are called. I don't see much benefit in changing the order.

File: pkg/scheduler/schedule_one.go
385: // Filters the nodes to find the ones that fit the pod based on the framework
386: // filter plugins and filter extenders.
387: func (sched *Scheduler) findNodesThatFitPod(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) ([]*v1.Node, framework.Diagnosis, error) {
... 
393: 	allNodes, err := sched.nodeInfoSnapshot.NodeInfos().List()
394: 	if err != nil {
395: 		return nil, diagnosis, err
396: 	}
397: 	// Run "prefilter" plugins.
398: 	preRes, s := fwk.RunPreFilterPlugins(ctx, state, pod)
399: 	...
413: 
414: 	...
438: 	feasibleNodes, err := sched.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, nodes)
...
446: 
447: 	feasibleNodes, err = findNodesThatPassExtenders(sched.Extenders, pod, feasibleNodes, diagnosis.NodeToStatusMap)
448: 	if err != nil {
449: 		return nil, diagnosis, err
450: 	}

https://github.com/kubernetes/kubernetes/blob/8d18ae6fc245deb68534cab608056c5aae9160bd/pkg/scheduler/schedule_one.go#L385-L452

File: pkg/scheduler/schedule_one.go
576: func findNodesThatPassExtenders(extenders []framework.Extender, pod *v1.Pod, feasibleNodes []*v1.Node, statuses framework.NodeToStatusMap) ([]*v1.Node, error) {
577: 	// Extenders are called sequentially.
578: 	// Nodes in original feasibleNodes can be excluded in one extender, and pass on to the next
579: 	// extender in a decreasing manner.
...
587: 
588: 		// Status of failed nodes in failedAndUnresolvableMap will be added or overwritten in <statuses>,
589: 		// so that the scheduler framework can respect the UnschedulableAndUnresolvable status for
590: 		// particular nodes, and this may eventually improve preemption efficiency.
591: 		// Note: users are recommended to configure the extenders that may return UnschedulableAndUnresolvable
592: 		// status ahead of others.
593: 		feasibleList, failedMap, failedAndUnresolvableMap, err := extender.Filter(pod, feasibleNodes)
...

https://github.com/kubernetes/kubernetes/blob/8d18ae6fc245deb68534cab608056c5aae9160bd/pkg/scheduler/schedule_one.go#L576-L627

}

if err := p.runExtendersForNode(pod, nodeInfo); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not ignore err here? I think it has some useful information when not nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I will think something around this. Thanks for the comment.

@MaciekPytel
Copy link
Contributor

Supporting configuration of existing PreFilters/Filters seems non-controversial to me. I'm worried about extenders though. The 2 specific problems I see are:

  1. What exactly is being sent to HTTP extender? I suspect it's just node names, with extender taking the actual Node and Pod objects from informer / api-server? If so, that doesn't really work with CA simulation that operate on hypothetical cluster state and not real one (CA simulation has in-memory template nodes and even existing nodes can run an entirely different set of pods in CA simulation than what is really scheduled on them). Unless we serialize the entire ClusterSnapshot and send it over to extender, the extender is operating on wrong data and CA will make incorrect decisions as a result.
  2. Is there a separate HTTP request for each Filter run? If so, I don't think it will work on scales supported by CA (ex. simulating a scale-up of 100 nodes on 100 different nodegroups to make a scale-up decision).

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod).
My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

@vadasambar
Copy link
Member Author

Supporting configuration of existing PreFilters/Filters seems non-controversial to me. I'm worried about extenders though. The 2 specific problems I see are:

  1. What exactly is being sent to HTTP extender? I suspect it's just node names, with extender taking the actual Node and Pod objects from informer / api-server? If so, that doesn't really work with CA simulation that operate on hypothetical cluster state and not real one (CA simulation has in-memory template nodes and even existing nodes can run an entirely different set of pods in CA simulation than what is really scheduled on them). Unless we serialize the entire ClusterSnapshot and send it over to extender, the extender is operating on wrong data and CA will make incorrect decisions as a result.

  2. Is there a separate HTTP request for each Filter run? If so, I don't think it will work on scales supported by CA (ex. simulating a scale-up of 100 nodes on 100 different nodegroups to make a scale-up decision).

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod). My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

Thank you for the comment.

  1. This is configurable.
  2. If you mean is there a separate HTTP request for every time we want to check a pod against Filter predicates in our CA code, then yes.

I had a discussion around 2 with @mwielgus in yesterday's SIG meeting (8th May 2023) and he had similar concerns (summary of the discussion). There are concerns around performance of HTTP extender in scheduler as well (ref).
I was hoping we could provide this as an optional feature for small to medium clusters since we don't support any kind of customization for scheduler simulation at the moment (and maybe ask for GRPC support for extenders?).

That being said,

The problem with those 2 issues is that they are subtle and a simple enough test (an extender that only looks at the pending pod and is independent from any other pods in a small cluster) will work just fine. Even a more complex extender will not fail every time and the probability of problems likely depends on scale and complexity of the cluster (ie. my tests may pass just fine and it will only blow up once I hit prod).
My personal opinion is that it's better to fail early and explicitly, than to have features that work in some circumstances, but fail unexpectedly. So unless we have a way of addressing my 2 concerns above, I'd be against including the extender support.

I understand your concern. Personally, I think we can add some checks to disable the extender when the cluster size becomes too big but there are problems with that approach too since it would surprise our users (e.g., I expect the extender to be working but it has suddenly stopped working) and the size would depend on the HTTP extender being used (some might degrade in performance at 20 nodes while others might degrade at 80 nodes).

I will drop the idea of supporting external extenders and proceed ahead with supporting custom scheduler config for in-tree scheduler plugins.

Thanks for taking the time to write all of this. Really helps understand the why of things.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2023
@@ -27,7 +27,7 @@ import (
kube_client "k8s.io/client-go/kubernetes"
v1listers "k8s.io/client-go/listers/core/v1"
klog "k8s.io/klog/v2"
scheduler_config "k8s.io/kubernetes/pkg/scheduler/apis/config/latest"
config "k8s.io/kubernetes/pkg/scheduler/apis/config"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This alias is a no-op, the package is called config already.

@@ -132,6 +133,7 @@ var (
"for scale down when some candidates from previous iteration are no longer valid."+
"When calculating the pool size for additional candidates we take"+
"max(#nodes * scale-down-candidates-pool-ratio, scale-down-candidates-pool-min-count).")
schedulerConfig = flag.String("scheduler-config-file", "", "scheduler-config allows adding extenders and changing configuration of default scheduler plugins acting on PreFilter and Filter extension points")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not accurate anymore as extenders won't be used.

// GetSchedulerConfig loads scheduler config from a path
// (falls back to default config in case of error)
func GetSchedulerConfig(path string) (*schedconfig.KubeSchedulerConfiguration, error) {
defaultFallbackMsg := "(falling back to default scheduler config)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fallback really the desired behavior here? This can hide some errors - e.g. user making a typo in config path will expect custom config to be used, but CA instead of crashing right away will silently fall back to the default configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using default config instead of failing fast might surprise our users. I will change this. Thanks for the suggestion.


// GetSchedulerConfig loads scheduler config from a path
// (falls back to default config in case of error)
func GetSchedulerConfig(path string) (*schedconfig.KubeSchedulerConfiguration, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is actually the main code change in this PR now. Maybe worth adding some unit tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have added a comment after #5708 (comment). Since we have moved on from the PoC phase, I was planning to do another pass on PR (cleanup, add unit tests etc.,) and the change the PR from draft to Ready for Review. Maybe I can mention you again here once I am done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good!

clientsetfake "k8s.io/client-go/kubernetes/fake"
)

// NewTestPredicateChecker builds test version of PredicateChecker.
func NewTestPredicateChecker() (PredicateChecker, error) {

cfg, err := scheduler.GetSchedulerConfig("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not use the default config directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated to use the default scheduler config in case no scheduler config is passed through the function parameter: https://github.com/kubernetes/autoscaler/pull/5708/files#diff-5818325464670a5cd54280fe253b0841a163da5de2764491a3065805785ffb1bR51-R56

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 16, 2023

// this needs to be set explicitly because config's api version is empty after decoding
// check kubernetes/cmd/kube-scheduler/app/options/configfile.go for more info
cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is based on upstream scheduler's private function called loadConfigFromFile

File: cmd/kube-scheduler/app/options/configfile.go
34: func loadConfigFromFile(logger klog.Logger, file string) (*config.KubeSchedulerConfiguration, error) {
35: 	data, err := os.ReadFile(file)
36: 	if err != nil {
37: 		return nil, err
38: 	}
39: 
40: 	return loadConfig(logger, data)
41: }
42: 
43: func loadConfig(logger klog.Logger, data []byte) (*config.KubeSchedulerConfiguration, error) {
44: 	// The UniversalDecoder runs defaulting and returns the internal type by default.
45: 	obj, gvk, err := scheme.Codecs.UniversalDecoder().Decode(data, nil, nil)
46: 	if err != nil {
47: 		return nil, err
48: 	}
49: 	if cfgObj, ok := obj.(*config.KubeSchedulerConfiguration); ok {
50: 		// We don't set this field in pkg/scheduler/apis/config/{version}/conversion.go
51: 		// because the field will be cleared later by API machinery during
52: 		// conversion. See KubeSchedulerConfiguration internal type definition for
53: 		// more details.
54: 		cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()
55: 		switch cfgObj.TypeMeta.APIVersion {
56: 		case configv1beta2.SchemeGroupVersion.String():
57: 			logger.Info("KubeSchedulerConfiguration v1beta2 is deprecated in v1.25, will be removed in v1.28")
58: 		case configv1beta3.SchemeGroupVersion.String():
59: 			logger.Info("KubeSchedulerConfiguration v1beta3 is deprecated in v1.26, will be removed in v1.29")
60: 		}
61: 		return cfgObj, nil
62: 	}
63: 	return nil, fmt.Errorf("couldn't decode as KubeSchedulerConfiguration, got %s: ", gvk)
64: }

https://github.com/kubernetes/kubernetes/blob/6b34fafdaf5998039c7e01fa33920a641b216d3e/cmd/kube-scheduler/app/options/configfile.go#L34-L64

Note that I haven't included the deprecated log messages because they might change and we'd have to remember updating them every time. I don't see a lot of value in printing these messages because we'd have to support the scheduler config even if it is deprecated anyway (open to other opinions here).

Note that if the user uses an old scheduler api version which isn't supported anymore like say v1alpha1, UniversalDecoder().Decode throws an error like this:

panic: no kind "KubeSchedulerConfiguration" is registered for version "kubescheduler.config.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this function public in scheduler codebase instead?

Copy link
Member Author

@vadasambar vadasambar May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I thought about this but I was also on the fence because scheduler has a similar function to write the config to a file

File: cmd/kube-scheduler/app/options/configfile.go
091: // LogOrWriteConfig logs the completed component config and writes it into the given file name as YAML, if either is enabled
092: func LogOrWriteConfig(logger klog.Logger, fileName string, cfg *config.KubeSchedulerConfiguration, completedProfiles []config.KubeSchedulerProfile) error {
093: 	loggerV := logger.V(2)
094: 	if !loggerV.Enabled() && len(fileName) == 0 {
095: 		return nil
096: 	}
097: 	cfg.Profiles = completedProfiles
098: 
099: 	buf, err := encodeConfig(cfg)
100: 	if err != nil {
101: 		return err
102: 	}
103: 
104: 	if loggerV.Enabled() {
105: 		loggerV.Info("Using component config", "config", buf.String())
106: 	}
107: 
108: 	if len(fileName) > 0 {
109: 		configFile, err := os.Create(fileName)
110: 		if err != nil {
111: 			return err
112: 		}
113: 		defer configFile.Close()
114: 		if _, err := io.Copy(configFile, buf); err != nil {
115: 			return err
116: 		}
117: 		logger.Info("Wrote configuration", "file", fileName)
118: 		os.Exit(0)
119: 	}
120: 	return nil
121: }

https://github.com/kubernetes/kubernetes/blob/6b34fafdaf5998039c7e01fa33920a641b216d3e/cmd/kube-scheduler/app/options/configfile.go#L91-L121

Notice how it is calling os.Exit(0) on line 118. If you import and use this function in your code, the program is going to exit when you call this function. My concern is, if we don't implement the function to read config file on our own we might make ourselves dependent on the implementation of LoadConfigFromFile (if loadConfigFromFile is converted to a public function) whose implementation is dependent on use-case of scheduler and not CA. If there's a change in the LoadConfigFromFile in the future, it will break things for us. That being said, same thing can be argued for Filter and PostFilter extension points (public functions which scheduler framework exposes) but I think there's an important difference between Filter/PostFilter extension points and LoadConfigFromFile or LogOrWriteConfig. Former is similar to a public API and it is unlikely to change without a strong deprecation policy or notice to users but that might not be the case with latter.

If we decide to go with LoadConfigFromFile way, we have 2 possible options:

Option 1:

  • Get this PR in to CA codebase
  • Talk with sig-scheduling about setting expectation about LoadConfigFromFile
  • Create another PR to replace our implementation with sig-scheduling's implementation (might have to wait until they release 1.28 so that we can import their code OR ask for a patch to 1.27)

Option 2:

  • Put this PR on hold
  • Talk with sig-scheduling about setting expectation about LoadConfigFromFile
  • Create another PR to replace our implementation with sig-scheduling's implementation (might have to wait until they release 1.28 so that we can import their code or ask for a patch to 1.27)

If we decide to go with asking for a public implementation of LoadConfigFromFile and we want to go with Option 1, I will create a separate issue to co-ordinate with sig-scheduling and create another PR to use the public implementation. If we go with Option 2, we will have to put this PR on hold until we get LoadConfigFromFile.

Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried changes in scheduler will never happen if we merge here first. OTOH, waiting for k8s release is a bit too long, so I'd:

  1. Send a PR to update scheduler code
  2. Add TODO here to switch to the upstream function and merge as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to update scheduler code: kubernetes/kubernetes#119057

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO here

// Note that even if we are passing minimal config like below
// `GetSchedulerConfig` will set the rest of the default fields
// on its own (including default profile and default plugins)
correctConfigFile := filepath.Join(tmpDir, "correct_config.yaml")
Copy link
Member Author

@vadasambar vadasambar May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple program to test this. Just put the scheduler config in test-config.yaml

import (
	"fmt"
	"os"

	"gopkg.in/yaml.v2"
	config "k8s.io/kubernetes/pkg/scheduler/apis/config"
	"k8s.io/kubernetes/pkg/scheduler/apis/config/scheme"
	"k8s.io/kubernetes/pkg/scheduler/apis/config/validation"
)

func main() {
	file := "test-config.yaml"
	data, err := os.ReadFile(file)
	if err != nil {
		panic(err)
	}

	obj, gvk, err := scheme.Codecs.UniversalDecoder().Decode(data, nil, nil)
	if err != nil {
		panic(err)
	}
	cfgObj, ok := obj.(*config.KubeSchedulerConfiguration)
	if !ok {
		panic(fmt.Errorf("couldn't decode as KubeSchedulerConfiguration, got %s: ", gvk))
	}

	cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()

	if err := validation.ValidateKubeSchedulerConfiguration(cfgObj); err != nil {
		panic(err)
	}

	data, err = yaml.Marshal(cfgObj)
	if err != nil {
		panic(err)
	}
	fmt.Println(string(data))

}

test-config.yaml

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
Click here to see the output of the program
$ go run main.go
typemeta:
  kind: ""
  apiversion: kubescheduler.config.k8s.io/v1
parallelism: 16
leaderelection:
  leaderelect: true
  leaseduration:
    duration: 15s
  renewdeadline:
    duration: 10s
  retryperiod:
    duration: 2s
  resourcelock: leases
  resourcename: kube-scheduler
  resourcenamespace: kube-system
clientconnection:
  kubeconfig: ""
  acceptcontenttypes: ""
  contenttype: application/vnd.kubernetes.protobuf
  qps: 50
  burst: 100
healthzbindaddress: ""
metricsbindaddress: ""
debuggingconfiguration:
  enableprofiling: true
  enablecontentionprofiling: true
percentageofnodestoscore: 0
podinitialbackoffseconds: 1
podmaxbackoffseconds: 10
profiles:
- schedulername: default-scheduler
  percentageofnodestoscore: null
  plugins:
    preenqueue:
      enabled: []
      disabled: []
    queuesort:
      enabled: []
      disabled: []
    prefilter:
      enabled: []
      disabled: []
    filter:
      enabled: []
      disabled: []
    postfilter:
      enabled: []
      disabled: []
    prescore:
      enabled: []
      disabled: []
    score:
      enabled: []
      disabled: []
    reserve:
      enabled: []
      disabled: []
    permit:
      enabled: []
      disabled: []
    prebind:
      enabled: []
      disabled: []
    bind:
      enabled: []
      disabled: []
    postbind:
      enabled: []
      disabled: []
    multipoint:
      enabled:
      - name: PrioritySort
        weight: 0
      - name: NodeUnschedulable
        weight: 0
      - name: NodeName
        weight: 0
      - name: TaintToleration
        weight: 3
      - name: NodeAffinity
        weight: 2
      - name: NodePorts
        weight: 0
      - name: NodeResourcesFit
        weight: 1
      - name: VolumeRestrictions
        weight: 0
      - name: EBSLimits
        weight: 0
      - name: GCEPDLimits
        weight: 0
      - name: NodeVolumeLimits
        weight: 0
      - name: AzureDiskLimits
        weight: 0
      - name: VolumeBinding
        weight: 0
      - name: VolumeZone
        weight: 0
      - name: PodTopologySpread
        weight: 2
      - name: InterPodAffinity
        weight: 2
      - name: DefaultPreemption
        weight: 0
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: ImageLocality
        weight: 1
      - name: DefaultBinder
        weight: 0
      disabled: []
  pluginconfig:
  - name: DefaultPreemption
    args:
      typemeta:
        kind: ""
        apiversion: ""
      mincandidatenodespercentage: 10
      mincandidatenodesabsolute: 100
  - name: InterPodAffinity
    args:
      typemeta:
        kind: ""
        apiversion: ""
      hardpodaffinityweight: 1
      ignorepreferredtermsofexistingpods: false
  - name: NodeAffinity
    args:
      typemeta:
        kind: ""
        apiversion: ""
      addedaffinity: null
  - name: NodeResourcesBalancedAllocation
    args:
      typemeta:
        kind: ""
        apiversion: ""
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
  - name: NodeResourcesFit
    args:
      typemeta:
        kind: ""
        apiversion: ""
      ignoredresources: []
      ignoredresourcegroups: []
      scoringstrategy:
        type: LeastAllocated
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
        requestedtocapacityratio: null
  - name: PodTopologySpread
    args:
      typemeta:
        kind: ""
        apiversion: ""
      defaultconstraints: []
      defaultingtype: System
  - name: VolumeBinding
    args:
      typemeta:
        kind: ""
        apiversion: ""
      bindtimeoutseconds: 600
      shape: []
extenders: []

Note that direct marshalling to yaml is not the right way to serialize the scheduler config but I have used to for the sake of simplicity.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2023
@vadasambar vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 07313a7 to db8507e Compare May 17, 2023 10:29
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2023

var parsedSchedConfig *scheduler_config.KubeSchedulerConfiguration
// if scheduler config flag was set by the user
if pflag.CommandLine.Changed(config.SchedulerConfigFileFlag) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so that if the --scheduler-config-file flag is not used, we don't try to look for the scheduler config in the default path set by the flag i.e., "". We want to use the default scheduler config. Setting the scheduler config to nil ensures that the default scheduler config is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it better than just comparing with ""? If someone specifies an empty value for the flag, shouldn't they get the default scheduler config, rather than an error from trying to open a filename of empty string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two lines of thought here:

  1. User has set --scheduler-config-file="" (note that this is a file path) which means they have made a mistake. Hence, fail.
  2. User has set --scheduler-config-file="" which means they don't want to set a scheduler config by themselves. Fall back to using a default one.

Falling back to default in case of 2 doesn't seem as valuable to me because if the user wants to use the default config, they can just skip the flag altogether because the flag specifies path to a config file (and not scheduler config itself) and if they have set it to "", I think it's more likely that they have made a mistake (which can happen if say this value comes from a helm chart file and it's set to empty there by mistake) because they can just skip specifying the flag if they want to use the default config.

A counter-argument could be, if the user sets only one or two fields in the custom scheduler config we anyway fill rest of the required fields by default (in this PR). It can be argued that 2 is inline with this behavior.

I think failing here makes more sense than using default config but I am open to change it if you think there's a strong reason for not doing so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about someone templating the manifest and using empty string to mean "default", but I guess such templating can also just skip the flag in such cases, so maybe it is fine to just leave as is.

scheduledPods: []*apiv1.Pod{p450},
testPod: p500,
expectError: false,
name: "custom - other pod - ok",
Copy link
Member Author

@vadasambar vadasambar May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases for custom predicate checker are a copy of the default predicate checker 's test cases except we don't expect an error for any of the custom predicate checker test cases.

assert.True(t, nodeName == "n1000" || nodeName == "n2000")
// custom predicate checker test cases
{
name: "custom - small pod - no error",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases for custom predicate checker are a copy of the default predicate checker's test cases except we don't expect an error for any of the custom predicate checker test cases.

@vadasambar vadasambar marked this pull request as ready for review May 18, 2023 04:27
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 23, 2023
@vadasambar vadasambar requested a review from x13n May 24, 2023 05:21
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2023
@@ -32,4 +36,53 @@ const (
DefaultScaleDownUnreadyTimeKey = "scaledownunreadytime"
// DefaultMaxNodeProvisionTimeKey identifies MaxNodeProvisionTime autoscaling option
DefaultMaxNodeProvisionTimeKey = "maxnodeprovisiontime"

// Custom scheduler configs for testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this file is a good home for test data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess putting test consts in const.go feels a little out of place. But I don't have a strong opinion on where this should go. I will take another look at this (P.S.: open to suggestions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the test config here to a separate package under at cluster-autoscaler/config/test/config.go
This test config is used at two different places:

  1. simulator/predicatechecker/schedulerbased_test.go
  2. utils/scheduler/scheduler_test.go

Other place this can go into is the existing cluster-autoscaler/utils/test/test_utils.go. Since this is config for testing, putting it under a separate test package under cluster-autoscaler/config seems like a better place.

Let me know what you think.


// this needs to be set explicitly because config's api version is empty after decoding
// check kubernetes/cmd/kube-scheduler/app/options/configfile.go for more info
cfgObj.TypeMeta.APIVersion = gvk.GroupVersion().String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried changes in scheduler will never happen if we merge here first. OTOH, waiting for k8s release is a bit too long, so I'd:

  1. Send a PR to update scheduler code
  2. Add TODO here to switch to the upstream function and merge as is

@@ -106,3 +117,31 @@ func ResourceToResourceList(r *schedulerframework.Resource) apiv1.ResourceList {
}
return result
}

// GetSchedulerConfig loads scheduler config from a path.
func GetSchedulerConfig(path string) (*scheduler_config.KubeSchedulerConfiguration, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ConfigFromPath? (So it becomes scheduler.ConfigFromPath(path) outside of this package.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I have changed the name to ConfigFromPath. scheduler.ConfigFromPath feels less vague than scheduler.GetSchedulerConfig.

@vadasambar vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 08d8af8 to ea715bb Compare June 29, 2023 17:09
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 29, 2023
@vadasambar
Copy link
Member Author

@x13n I have addressed all the review comments. Waiting for your approval.

@vadasambar vadasambar requested a review from x13n July 4, 2023 07:50
@vadasambar
Copy link
Member Author

@x13n I have addressed all the review comments. Waiting for your approval.

@x13n waiting for your review.

@x13n
Copy link
Member

x13n commented Jul 10, 2023

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

@vadasambar vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from 0966967 to aba1df8 Compare July 11, 2023 05:25
@vadasambar
Copy link
Member Author

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

@x13n I have squashed the commits.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2023
…ithout extenders)

Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `--scheduler-config` -> `--scheduler-config-file` to avoid confusion
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `goto` causing infinite loop
- abstract out running extenders in a separate function
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove code around extenders
- we decided not to use scheduler extenders for checking if a pod would fit on a node
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move scheduler config to a `utils/scheduler` package`
- use default config as a fallback
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix static_autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: `GetSchedulerConfiguration` fn
- remove falling back
- add mechanism to detect if the scheduler config file flag was set
-
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip add tests for `GetSchedulerConfig`
- tests are failing now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `GetSchedulerConfig`
- abstract error messages so that we can use them in the tests
- set api version explicitly (this is what upstream does as well)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: do a round of cleanup to make PR ready for review
- make import names consistent
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: use `pflag` to check if the `--scheduler-config-file` flag was set
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add comments for exported error constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't export error messages
- exporting is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add underscore in test file name
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix test failing because of no comment on exported `SchedulerConfigFileFlag`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refacotr: change name of flag variable `schedulerConfig` -> `schedulerConfigFile`
- avoids confusion
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add extra test cases for predicate checker
- where the predicate checker uses custom scheduler config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `setFlags` variable
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract custom scheduler configs into `conifg` package
- make them constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix linting error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: introduce a new custom test predicate checker
- instead of adding a param to the current one
- this is so that we don't have to pass `nil` to the existing test predicate checker in many places
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `NewCustomPredicateChecker` -> `NewTestPredicateCheckerWithCustomConfig`
- latter narrows down meaning of the function better than former
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `GetSchedulerConfig` -> `ConfigFromPath`
- `scheduler.ConfigFromPath` is shorter and feels less vague than `scheduler.GetSchedulerConfig`
- move test config to a new package `test` under `config` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add `TODO` for replacing code to parse scheduler config
- with upstream function
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar vadasambar force-pushed the feat/5106/support-custom-scheduler-config-poc branch from aba1df8 to 23f03e1 Compare July 13, 2023 04:21
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2023
@vadasambar
Copy link
Member Author

Apologies for the delay. The change LGTM, but there's a ton of commits here. Can you squash them before merging?

@x13n I have squashed the commits.

@x13n waiting for your approval.

@x13n
Copy link
Member

x13n commented Jul 20, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vadasambar, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2023
@k8s-ci-robot k8s-ci-robot merged commit 171021c into kubernetes:master Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster autoscaler to support custom schedulers' Filter logic
4 participants