Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

wking · 2019-09-24T20:42:49Z

We grew this in c22d042 (#1649) to set the stage for changing the replicas: 0 semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (#2004) replicas: 0 for compute has also meant "add the worker role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set (see here and kubernetes/kubernetes#65618, although this may get relaxed soonish). If the router pods get scheduled on the control plane machines due to the worker role, they are not reachable from the load balancer and ingress routing breaks. @sjenning says:

pod nodeSelectors are not like taints/tolerations. They only have effect at scheduling time. They are not continually enforced.

which means that attempting to address this issue as a day-2 operation would mean removing the worker role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from here, it's easier to just drop this section and keep the worker role off the control-plane machines entirely.

We grew this in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], it's easier to just drop this section and keep the 'worker' role off the control-plane machines entirely. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073

openshift-ci-robot · 2019-09-24T20:42:54Z

@wking: This pull request references Bugzilla bug 1755073, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-09-24T20:43:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-09-24T20:45:17Z

I dunno what to do about this CI code vs. this PR. Are we ok setting replicas: 2 in our install-config.yaml on all versions, or do we need to have version-specific CI logic to run our 4.1.z recommendations on 4.1.z code, 4.2.z recommendations on 4.2.z code, etc.?

wking · 2019-09-24T20:46:16Z

@kalexand-rh: if this lands, we'll want to update openshift-docs for 4.2 as well.

abhinavdahiya · 2019-09-24T20:47:10Z

I don't think this is the right path forward for BZ 1755073, i would rather see we document the extra step of modifying the the file manifests/cluster-scheduler-02-config.yml .spec.mastersSchedulable to false on AWS and GCP UPI docs.. as we can't have control-plane running all workload.

openshift-ci-robot · 2019-09-24T22:59:56Z

@wking: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`cb31b68`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-aws-upgrade	`cb31b68`	link	`/test e2e-aws-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402

wking · 2019-10-01T19:14:55Z

i would rather see we document the extra step of modifying the the file manifests/cluster-scheduler-02-config.yml .spec.mastersSchedulable to false on AWS and GCP UPI docs.. as we can't have control-plane running all workload.

I've filed #2440 with that approach, but I prefer this one because:

We'll have to revert whatever change we make for 4.3, so that's not different between the two approaches.
This way simplifies the user workflow for 4.2 by removing a step, while Bug 1755073: docs/user/*/install_upi: explicitly-set-control-plane-unschedulable #2440 adds a step.
Bug 1755073: docs/user/*/install_upi: explicitly-set-control-plane-unschedulable #2440 directly modifies an installer-generated manifest, which we want to avoid when possible.

Still, either way should work for 4.2, so land whichever.

sdodson · 2019-10-02T15:55:18Z

/close
We've gone with #2440

openshift-ci-robot · 2019-10-02T15:55:20Z

@sdodson: Closed this PR.

In response to this:

/close
We've gone with #2440

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add 'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for changing the 'replicas: 0' semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (scheduler: Use schedulable masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for compute has also meant "add the 'worker' role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set [1,2] (although this may get relaxed soonish [3]). If the router pods get scheduled on the control plane machines due to the 'worker' role, they are not reachable from the load balancer and ingress routing breaks [4]. Seth says: > pod nodeSelectors are not like taints/tolerations. They only have > effect at scheduling time. They are not continually enforced. which means that attempting to address this issue as a day-2 operation would mean removing the 'worker' role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from [3], we can either drop the zeroing [5] or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 [2]: kubernetes/kubernetes#65618 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073 [5]: openshift#2402

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 24, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2019

openshift-ci-robot requested review from jhixson74 and patrickdillon September 24, 2019 20:43

wking mentioned this pull request Sep 25, 2019

OSDOCS-640: Adding docs for configuring proxy during installation openshift/openshift-docs#16635

Merged

wking mentioned this pull request Oct 1, 2019

Bug 1755073: docs/user/*/install_upi: explicitly-set-control-plane-unschedulable #2440

Merged

openshift-ci-robot closed this Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

wking commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

wking commented Sep 24, 2019

wking commented Sep 24, 2019

abhinavdahiya commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

wking commented Oct 1, 2019

sdodson commented Oct 2, 2019

openshift-ci-robot commented Oct 2, 2019

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

Bug 1755073: docs/user/*/install_upi: Drop compute replicas zeroing #2402

Conversation

wking commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

wking commented Sep 24, 2019

wking commented Sep 24, 2019

abhinavdahiya commented Sep 24, 2019

openshift-ci-robot commented Sep 24, 2019

wking commented Oct 1, 2019

sdodson commented Oct 2, 2019

openshift-ci-robot commented Oct 2, 2019