scheduler: Use schedulable masters if no compute hosts defined. #2004

russellb · 2019-07-16T20:25:04Z

This change makes use of a new configuration item on the scheduler CR
that specifies that control plane hosts should be able to run
workloads. This option is off by default, but will now be turned on
if there are no compute machine pools with non-zero replicas defined.

This change also removes a validation and warning when no compute
hosts are defined, as an install with this configuration will now
complete successfully.

pkg/asset/manifests/scheduler.go

cgwalters · 2019-07-16T20:29:28Z

/approve
I have some uncertainty though if we want to support this across all platforms. I guess a 3 node AWS cluster e.g. is at least useful for cheaper testing.

eparis

I'm not sure where it should live but I would also like to see an info level alert configured, which informs you if this is true and you have worker nodes that security could be increased by unsetting it.

We should also fix the docs so UPI users stop explcitily setting replicas=0. They would, harmlessly default to 3 (and this code would not trigger), as I understand it if we left if unset.

pkg/asset/manifests/scheduler.go

abhinavdahiya · 2019-07-16T20:35:14Z

if there are no compute machine pools with non-zero replicas defined.

Compute zero is not a great indication of making the control-plane schedulable.

Cases where that's not true:

When users want an UPI install, compute 0 is indication that no compute needs to be created by the installer.
When users are using installer to provision the cluster with no RHCOS compute using the MachineSets.

The heuristic is too vague to turn-on a potential security hole for users by default.

eparis · 2019-07-16T20:36:41Z

There is not a known security hole. This is making the installer do what it should. Give users working clusters. And since this is a day2 tunable, they can then tune their cluster however they want. But day-1 you get something that works. That's the whole point of the installer.

abhinavdahiya · 2019-07-16T20:39:27Z

There is not a known security hole. This is making the installer do what it should. Give users working clusters. And since this is a day2 tunable, they can then tune their cluster however they want. But day-1 you get something that works. That's the whole point of the installer.

But the user didn't ask the control-plane to be schedulable, only that they will take care of the compute.

And running workload on master is definitely a security hole (maybe not a CVE), master instances has elevated cloud API access.

eparis · 2019-07-16T20:42:49Z

Customer's didn't ask to keep their workloads away from masters. That's just something we did. And something we are trying to unroll as quickly as we can. This gives people working clusters. And with proper alerting no surprises.

russellb · 2019-07-16T20:44:09Z

The primary alternative I considered was addiing an explicit config item under controlPlane for this in install-config. The major downside is expanding the install config. Maybe it makes some sense as it's related to ensuring a successful and complete install, though?

ravisantoshgudimetla · 2019-07-16T20:52:31Z

Shouldn't this be specific to baremetal installation? It's true that on AWS and other cloud providers having workloads on master nodes is probably less ideal.

russellb · 2019-07-16T20:53:55Z

Shouldn't this be specific to baremetal installation? It's true that on AWS and other cloud providers having workloads on master nodes is probably less ideal.

Yes, that's another option. We could make this baremetal platform specific. It doesn't really seem baremetal specific, though.

eparis · 2019-07-16T20:54:36Z

No. If a customer made a 3 node cluster out of m5.metal it should "just work". Same on large VMWare or OpenStack instances. A 3 node cluster should just work.

pkg/asset/manifests/scheduler.go

ravisantoshgudimetla · 2019-07-16T20:58:54Z

It doesn't really seem baremetal specific, though.

Ya, I agree but that's the only use-case we have currently, isn't it? If someone says that it's needed in another cloud provider then perhaps we can think of having another option in install-config.

abhinavdahiya · 2019-07-16T21:13:17Z

No. If a customer made a 3 node cluster out of m5.metal it should "just work". Same on large VMWare or OpenStack instances. A 3 node cluster should just work.

how does your comment work with User asked for no compute, I should make masters scheduable

and how do you differentiate them from users (like UPI) who didn't want their masters to be scheduable but only wanted to create their own compute nodes.

ravisantoshgudimetla · 2019-07-16T21:13:25Z

No. If a customer made a 3 node cluster out of m5.metal it should "just work". Same on large VMWare or OpenStack instances. A 3 node cluster should just work.

I think everyone agrees, if the cluster size is 3 nodes(during and after the install). The ambiguity comes in the situations where installer is not sure of user's intention, if they're trying to do UPI or want to add workers later.

eparis · 2019-07-16T23:08:27Z

And people who add workers after will get an alert and can then change the config. But they get a working cluster. Which is the goal.

abhinavdahiya · 2019-07-16T23:22:10Z

And people who add workers after will get an alert and can then change the config. But they get a working cluster. Which is the goal.

What alert are we talking about, INFO that masters will be scheduled... That's not an alert.

But they get a working cluster. Which is the goal.

But a working cluster at the expense of making their control-plane scheduled, when they were completely ready/capable of bringing up the compute nodes.

eparis · 2019-07-17T12:21:52Z

I've asked #2004 (review) to make an alert, not just an install time message. I expect it'll live with the scheduling operator, since that's where the tunable lives.

The point of the installer is to give people a working cluster. We fully support 3 node clusters and support people running workloads on masters. That is not an expense. It is the system doing what it should.

@crawford @smarterclayton we already talked this through and both of you agreed. Would you mind commenting here as well?

russellb · 2019-07-17T16:43:59Z

I've posted an alternative for discussion here: #2024

The benefit of #2024 is providing a way for the intent to be much more clear

pkg/asset/manifests/scheduler.go

openshift-ci-robot · 2019-07-30T13:38:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, crawford, russellb, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2019-07-30T14:14:36Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2019-07-30T17:28:50Z

pkg/types/validation/installconfig.go

 		allErrs = append(allErrs, ValidateMachinePool(platform, &p, poolFldPath)...)
 	}
-	if !foundPositiveReplicas {
-		logrus.Warnf("There are no compute nodes specified. The cluster will not fully initialize without compute nodes.")


Can we really remove this warning? Does this PR allow the ingress scheduler to be scheduled on control plane nodes (e.g. see here and kubernetes/kubernetes#65618)? Are we running CI on zero-compute clusters?

Yes, when you set schedulableMasters to true, the worker role gets added to the master nodes, as well.

nothing in OpenShift CI for this yet

On cloud platforms, while adding a worker role to masters will allow ingress controllers to be scheduled on the masters, the master nodes will remain excluded from load balancer target pools, breaking the practical feature.

As to whether Kube will support including masters in LB target pools, the jury is still out and there is still no KEP AFAICT.

kubernetes/kubernetes#65618
https://groups.google.com/d/msg/kubernetes-sig-network/erULVdDkLNw/hvCukZ3CBQAJ

@russelb, can you restore the warning?

Also, with a single compute, we won't trigger this schedulable-control-plane logic, so won't ingress hang with "I can't get replicas up to two"?

The warning wouldn't be appropriate for a 3-node bare metal cluster. I guess it could be restored and only emitted if the platform type is not baremetal or none because of the load balancer issue mentioned by @ironcladlou ?

Prior to this change, our install config listed a number of workers. In practice though, we don't support deploying workers at install time. We require scripts post-install to set up the baremetal-operator before workers can be deployed. This change makes it so we expliclty install a 3 node cluster, and then scale it out as a post-install step if dev-scripts was configured to do so. A side effect of this change is that the installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 This also means that we can drop the custom IngressController manifest. When the cluster is configured with "mastersSchedulable: true" in the scheduler config, the master Nodes will have both the master and worker roles set on them, making them candidates for running the default Ingress controller, without any changes. Closes issue openshift-metal3#705

This is no longer required, as we install a 3 node cluster by default. The installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 Closes issue openshift-metal3#705

wking · 2019-09-24T16:17:47Z

And since this is a day2 tunable...

Has anyone worked out the commands to run to tune this on day-2? I didn't see them rereading the PR thread.

We grew this in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], it's easier to just drop this section and keep the 'worker' role off the control-plane machines entirely. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073

russellb · 2019-09-24T21:07:19Z

And since this is a day2 tunable...

Has anyone worked out the commands to run to tune this on day-2? I didn't see them rereading the PR thread.

oc edit scheduler cluster and change mastersSchedulable to true or false

We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402

The purpose of this change was to make masters able to run workloads by default. This is needed to complete a successful deployment of a 3-node bare metal install. This particular approach was only short term, while better interfaces were developed to control this behavior. The scheduler configuration resource now includes a "mastersSchedulable" boolean, enabled here: openshift#937 This installer PR made it the default behavior if no workers were defined at install time: openshift/installer#2004 With these changes in place, the custom kubelet config for the baremetal platform is no longer necessary.

We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402

When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.

When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369. Co-authored-by: Zane Bitter <zbitter@redhat.com>

…ift#4414) When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 16, 2019

openshift-ci-robot requested review from mtnbikenc and vrutkovs July 16, 2019 20:25

russellb commented Jul 16, 2019

View reviewed changes

pkg/asset/manifests/scheduler.go Outdated Show resolved Hide resolved

russellb requested review from abhinavdahiya and eparis and removed request for vrutkovs and mtnbikenc July 16, 2019 20:26

eparis reviewed Jul 16, 2019

View reviewed changes

pkg/asset/manifests/scheduler.go Outdated Show resolved Hide resolved

russellb force-pushed the schedulable-masters branch from 5e24932 to c230d10 Compare July 16, 2019 20:40

ravisantoshgudimetla reviewed Jul 16, 2019

View reviewed changes

pkg/asset/manifests/scheduler.go Outdated Show resolved Hide resolved

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 17, 2019

russellb mentioned this pull request Jul 17, 2019

scheduler: Allow configuration of schedulable masters. #2024

Closed

abhinavdahiya reviewed Jul 18, 2019

View reviewed changes

pkg/asset/manifests/scheduler.go Outdated Show resolved Hide resolved

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2019

stbenjam mentioned this pull request Jul 30, 2019

Bump pinned release to 4.2.0-0.ci-2019-07-31-123929-kni.0 openshift-metal3/dev-scripts#710

Merged

openshift-merge-robot merged commit 208a3be into openshift:master Jul 30, 2019

wking reviewed Jul 30, 2019

View reviewed changes

russellb mentioned this pull request Jul 30, 2019

Drop custom ingress controller manifest openshift-metal3/dev-scripts#705

Closed

This was referenced Jul 31, 2019

Remove ingress controller manifest. openshift-metal3/dev-scripts#715

Merged

mastersSchedulable: true not working when set at install time openshift/machine-config-operator#1024

Closed

wking mentioned this pull request Sep 24, 2019

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

Closed

wking mentioned this pull request Oct 1, 2019

Bug 1755073: docs/user/*/install_upi: explicitly-set-control-plane-unschedulable #2440

Merged

zaneb mentioned this pull request Sep 19, 2022

OCPBUGS-1482: Don't override schedulableMasters unnecessarily openshift/assisted-service#4414

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: Use schedulable masters if no compute hosts defined. #2004

scheduler: Use schedulable masters if no compute hosts defined. #2004

russellb commented Jul 16, 2019

cgwalters commented Jul 16, 2019

eparis left a comment

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 16, 2019

russellb commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

russellb commented Jul 16, 2019

eparis commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

eparis commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 17, 2019

russellb commented Jul 17, 2019

openshift-ci-robot commented Jul 30, 2019

openshift-bot commented Jul 30, 2019

wking Jul 30, 2019

russellb Jul 30, 2019

ironcladlou Aug 5, 2019

wking Aug 5, 2019

wking Aug 5, 2019

russellb Aug 9, 2019

wking commented Sep 24, 2019

russellb commented Sep 24, 2019

scheduler: Use schedulable masters if no compute hosts defined. #2004

scheduler: Use schedulable masters if no compute hosts defined. #2004

Conversation

russellb commented Jul 16, 2019

cgwalters commented Jul 16, 2019

eparis left a comment

Choose a reason for hiding this comment

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 16, 2019

russellb commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

russellb commented Jul 16, 2019

eparis commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

ravisantoshgudimetla commented Jul 16, 2019

eparis commented Jul 16, 2019

abhinavdahiya commented Jul 16, 2019

eparis commented Jul 17, 2019

russellb commented Jul 17, 2019

openshift-ci-robot commented Jul 30, 2019

openshift-bot commented Jul 30, 2019

wking Jul 30, 2019

Choose a reason for hiding this comment

russellb Jul 30, 2019

Choose a reason for hiding this comment

ironcladlou Aug 5, 2019

Choose a reason for hiding this comment

wking Aug 5, 2019

Choose a reason for hiding this comment

wking Aug 5, 2019

Choose a reason for hiding this comment

russellb Aug 9, 2019

Choose a reason for hiding this comment

wking commented Sep 24, 2019

russellb commented Sep 24, 2019