-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: Use schedulable masters if no compute hosts defined. #2004
scheduler: Use schedulable masters if no compute hosts defined. #2004
Conversation
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where it should live but I would also like to see an info level alert configured, which informs you if this is true and you have worker nodes that security could be increased by unsetting it.
We should also fix the docs so UPI users stop explcitily setting replicas=0. They would, harmlessly default to 3 (and this code would not trigger), as I understand it if we left if unset.
Compute zero is not a great indication of making the control-plane schedulable. Cases where that's not true:
The heuristic is too vague to turn-on a potential security hole for users by default. |
There is not a known security hole. This is making the installer do what it should. Give users working clusters. And since this is a day2 tunable, they can then tune their cluster however they want. But day-1 you get something that works. That's the whole point of the installer. |
But the user didn't ask the control-plane to be schedulable, only that they will take care of the compute. And running workload on master is definitely a security hole (maybe not a CVE), master instances has elevated cloud API access. |
5e24932
to
c230d10
Compare
Customer's didn't ask to keep their workloads away from masters. That's just something we did. And something we are trying to unroll as quickly as we can. This gives people working clusters. And with proper alerting no surprises. |
The primary alternative I considered was addiing an explicit config item under |
Shouldn't this be specific to baremetal installation? It's true that on AWS and other cloud providers having workloads on master nodes is probably less ideal. |
Yes, that's another option. We could make this |
No. If a customer made a 3 node cluster out of m5.metal it should "just work". Same on large VMWare or OpenStack instances. A 3 node cluster should just work. |
Ya, I agree but that's the only use-case we have currently, isn't it? If someone says that it's needed in another cloud provider then perhaps we can think of having another option in install-config. |
how does your comment work with and how do you differentiate them from users (like UPI) who didn't want their masters to be scheduable but only wanted to create their own compute nodes. |
I think everyone agrees, if the cluster size is 3 nodes(during and after the install). The ambiguity comes in the situations where installer is not sure of user's intention, if they're trying to do UPI or want to add workers later. |
And people who add workers after will get an alert and can then change the config. But they get a working cluster. Which is the goal. |
What alert are we talking about, INFO that masters will be scheduled... That's not an alert.
But a working cluster at the expense of making their control-plane scheduled, when they were completely ready/capable of bringing up the compute nodes. |
I've asked #2004 (review) to make an alert, not just an install time message. I expect it'll live with the scheduling operator, since that's where the tunable lives. The point of the installer is to give people a working cluster. We fully support 3 node clusters and support people running workloads on masters. That is not an expense. It is the system doing what it should. @crawford @smarterclayton we already talked this through and both of you agreed. Would you mind commenting here as well? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, crawford, russellb, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
allErrs = append(allErrs, ValidateMachinePool(platform, &p, poolFldPath)...) | ||
} | ||
if !foundPositiveReplicas { | ||
logrus.Warnf("There are no compute nodes specified. The cluster will not fully initialize without compute nodes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really remove this warning? Does this PR allow the ingress scheduler to be scheduled on control plane nodes (e.g. see here and kubernetes/kubernetes#65618)? Are we running CI on zero-compute clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when you set schedulableMasters to true, the worker role gets added to the master nodes, as well.
nothing in OpenShift CI for this yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On cloud platforms, while adding a worker role to masters will allow ingress controllers to be scheduled on the masters, the master nodes will remain excluded from load balancer target pools, breaking the practical feature.
As to whether Kube will support including masters in LB target pools, the jury is still out and there is still no KEP AFAICT.
kubernetes/kubernetes#65618
https://groups.google.com/d/msg/kubernetes-sig-network/erULVdDkLNw/hvCukZ3CBQAJ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russelb, can you restore the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, with a single compute, we won't trigger this schedulable-control-plane logic, so won't ingress hang with "I can't get replicas up to two"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning wouldn't be appropriate for a 3-node bare metal cluster. I guess it could be restored and only emitted if the platform type is not baremetal
or none
because of the load balancer issue mentioned by @ironcladlou ?
Prior to this change, our install config listed a number of workers. In practice though, we don't support deploying workers at install time. We require scripts post-install to set up the baremetal-operator before workers can be deployed. This change makes it so we expliclty install a 3 node cluster, and then scale it out as a post-install step if dev-scripts was configured to do so. A side effect of this change is that the installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 This also means that we can drop the custom IngressController manifest. When the cluster is configured with "mastersSchedulable: true" in the scheduler config, the master Nodes will have both the master and worker roles set on them, making them candidates for running the default Ingress controller, without any changes. Closes issue openshift-metal3#705
Prior to this change, our install config listed a number of workers. In practice though, we don't support deploying workers at install time. We require scripts post-install to set up the baremetal-operator before workers can be deployed. This change makes it so we expliclty install a 3 node cluster, and then scale it out as a post-install step if dev-scripts was configured to do so. A side effect of this change is that the installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 This also means that we can drop the custom IngressController manifest. When the cluster is configured with "mastersSchedulable: true" in the scheduler config, the master Nodes will have both the master and worker roles set on them, making them candidates for running the default Ingress controller, without any changes. Closes issue openshift-metal3#705
Prior to this change, our install config listed a number of workers. In practice though, we don't support deploying workers at install time. We require scripts post-install to set up the baremetal-operator before workers can be deployed. This change makes it so we expliclty install a 3 node cluster, and then scale it out as a post-install step if dev-scripts was configured to do so. A side effect of this change is that the installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 This also means that we can drop the custom IngressController manifest. When the cluster is configured with "mastersSchedulable: true" in the scheduler config, the master Nodes will have both the master and worker roles set on them, making them candidates for running the default Ingress controller, without any changes. Closes issue openshift-metal3#705
This is no longer required, as we install a 3 node cluster by default. The installer will automatically adjust the scheduler configuration to make master nodes schedulable. For more details on that change, see: openshift/installer#2004 Closes issue openshift-metal3#705
Has anyone worked out the commands to run to tune this on day-2? I didn't see them rereading the PR thread. |
We grew this in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], it's easier to just drop this section and keep the 'worker' role off the control-plane machines entirely. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
|
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402
The purpose of this change was to make masters able to run workloads by default. This is needed to complete a successful deployment of a 3-node bare metal install. This particular approach was only short term, while better interfaces were developed to control this behavior. The scheduler configuration resource now includes a "mastersSchedulable" boolean, enabled here: openshift#937 This installer PR made it the default behavior if no workers were defined at install time: openshift/installer#2004 With these changes in place, the custom kubelet config for the baremetal platform is no longer necessary.
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402
When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.
When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.
When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.
When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369. Co-authored-by: Zane Bitter <zbitter@redhat.com>
…ift#4414) When generating the Ignition files, the installer already sets schdulableMasters to true when there are no worker nodes (i.e. in the SNO and compact cluster topologies. (See openshift/installer#2004) Therefore it is unnecessary to override it here (though it may be preferred to avoid a warning log from the installer). Since openshift/installer#6247, attempting to override the schedulableMasters setting causes installation to fail, because there are two manifests of the same type and name that conflict. Since we don't need to set this override when the installer would already do it, avoid doing so and triggering the error when the value is determined by the number of hosts rather than explicitly set by the user. The conflict still needs to be resolved so that the user can enable schedulableMasters, but this at least allows the SNO and compact topologies to install OpenShift 4.12 again. This partially reverts commit c45f369.
This change makes use of a new configuration item on the scheduler CR
that specifies that control plane hosts should be able to run
workloads. This option is off by default, but will now be turned on
if there are no compute machine pools with non-zero replicas defined.
This change also removes a validation and warning when no compute
hosts are defined, as an install with this configuration will now
complete successfully.