Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Masters should not be excluded from service load balancers #65618

Closed
ljani opened this issue Jun 29, 2018 · 51 comments
Closed

Masters should not be excluded from service load balancers #65618

ljani opened this issue Jun 29, 2018 · 51 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@ljani
Copy link

ljani commented Jun 29, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
I'm running a single node cluster on AWS EC2, ie. scheduling work on the master as well. Now that I tried to add an ELB for my service, the EC2 instance is not associated with the ELB. The ELB gets created but there are 0 EC2 instances associated with it. I'm using ELB for SSL termination, if you wonder why I'd like to load balance a single node cluster.

The service controller logs this message:

I0628 17:54:48.853175       1 event.go:221] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"nginx", UID:"59980933-7afc-11e8-9a8c-0621b993d602", APIVersion:"v1", ResourceVersion:"2052", FieldPath:""}): type: 'Warning' reason: 'UnAvailableLoadBalancer' There are no available nodes for LoadBalancer service default/nginx

What you expected to happen:
The EC2 instance is associated with the ELB.

How to reproduce it (as minimally and precisely as possible):

  • Boot an EC2 instance with correct IAM permissions
  • Install a single node cluster using kubeadm and cloud-provider=aws
  • Untaint the node
  • Schedule eg. nginx
  • Create an external load balancer for nginx
  • View the ELB in AWS console and note the Status: 0 of 0 instances in service

Anything else we need to know?:
The reason for this seems to be the node-role.kubernetes.io/master label. This blocks associating a load balancer with the node. On the other hand this changed what is included, because includeNodeFromNodeList did not check if a node is a master. I'm not sure what would be correct fix. I could try submit a PR, if you guide me how this should behave. Is my scenario even a supported one?

I think this bug should be reproducible on other clouds as well.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux ip-x.eu-central-1.compute.internal 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
  • Install tools: kubeadm
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jun 29, 2018
@ljani
Copy link
Author

ljani commented Jun 29, 2018

#33884 has area/nodecontroller, so I think the correct sig is @kubernetes/sig-node-bugs.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 29, 2018
@k8s-ci-robot
Copy link
Contributor

@ljani: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

#33884 has area/nodecontroller, so I think the correct sig is @kubernetes/sig-node-bugs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jul 5, 2018

@ljani This is by design, load balancers do not include master nodes in the available pool of backend servers. You should remove the node-role.kubernetes.io/master label from your node and ensure it's schedulable & ready in order for it to be a backend. (exact filter logic lives here)

I'm going to close this issue as it's not a bug. If you would like to see the ability to use masters as available backends for LB's then please create another ticket with a feature request.

/close

@ljani
Copy link
Author

ljani commented Jul 5, 2018

@jhorwit2 Thanks for the response. So, it's okay to remove the node-role.kubernetes.io/master label even if there's only one node in the cluster? Are there any drawbacks? I was a bit wary of that, because I thought each cluster would need at least one master and you should only remove the NoSchedule taint from it. The kubeadm guide only refers to removing the taint from the master and not the label itself, but yeah, the guide is for a single master cluster, so it would be a little contradictory.

Searching for masterless does not really yield any related results.

Anyhow, if that's a supported scenario, then I'm very happy with it. Otherwise I should open the ticket.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jul 5, 2018

It all depends on how the cluster is setup and if they depend on the labels or not. Single node clusters will probably always have some quirks.

@ljani
Copy link
Author

ljani commented Jul 5, 2018

I've been following the Creating a single master cluster with kubeadm guide and thus I'm using kubeadm. I'll try and see how it goes.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jul 5, 2018

The docs are probably incorrect when it comes to this because it wasn't tested. They should be updated.

/reopen
/remove-sig node
/sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot reopened this Jul 5, 2018
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 5, 2018
@k8s-ci-robot
Copy link
Contributor

@jhorwit2: Those labels are not set on the issue: sig/node

In response to this:

The docs are probably incorrect when it comes to this because it wasn't tested. They should be updated.

/reopen
/remove-sig node
/sig cluster-lifecycle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

neolit123 commented Jul 7, 2018

please, mind that kubeadm documentation will desist from adding cloud provider specific documentation in the future, but instead link to external resources.

/cc @kubernetes/sig-cluster-lifecycle-bugs
/cc @kubernetes/sig-cloud-provider-bugs

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jul 7, 2018
@smarterclayton
Copy link
Contributor

My comment in the other issue:

I'm not sure this is correct as we run more things on the master. If I have a service that I want to expose like an API proxy, an aggregated API server that wants its own LB, or a not quite control plane but not quite end user workload. We shouldn't bias towards the master (I.e. a lot of node port services for regular workloads should bypass the master), but this PR as it stands prevents using service load balancers with services that run on masters, which limits options like self hosting and use.

Now that node load balancing is more nuanced (with health checking endpoints), do we even need this? Masters should not be marked as healthy when pods aren't run on them, which means no traffic would go to masters?

I think excluding masters from service load balancer is a bug and needs a more nuanced design. I'm going to bump priority here because I've got users trying to run pods on masters that have service LB on them that can't use service LB.

@smarterclayton smarterclayton added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 21, 2018
@smarterclayton smarterclayton changed the title A single node cluster cannot use external load balancer Masters should not be excluded from service load balancers Aug 21, 2018
@smarterclayton
Copy link
Contributor

Changing title to be more accurate.

@smarterclayton
Copy link
Contributor

I think the correct fix here was the service LB health check support for targeting the serving pool to nodes that hold the pod, and now that we have that we don't need this hack.

@andrewsykim
Copy link
Member

To be clear, this would only apply for Services with externalTrafficPolicy: Local right? Is it okay to assume that users will always use "Local" for this case?

@andrewsykim
Copy link
Member

andrewsykim commented Aug 21, 2018

I'm sure we could add logic in service controller to include master if externalTrafficPolicy == Local, exclude otherwise but that doesn't seem like an elegant solution 🤔

@smarterclayton
Copy link
Contributor

For masters, it seems likely. I mean, in general I expect externalTrafficPolicy: Local to be the correct setting for the vast majority of LB services - is there a reason I would not want that policy in general use?

@andrewsykim
Copy link
Member

andrewsykim commented Aug 21, 2018

In general I would agree that LBs would use externalTrafficPolicy: Local. One example where I wouldn't' though is if I have a service that needs to ingress traffic both from an LB and from another service in the same cluster. In that case, I would consider using externalTrafficPolicy: Cluster in which case I may not want the master to be in the pool of LB backends. Though to be fair this may be a rare case not worth looking into.

@jhorwit2
Copy link
Contributor

@smarterclayton this should perhaps be merged with #65013

The service controller today does some filtering based on masters in interesting ways. Historically it seems that masters were marked unschedulable to be excluded from lb services. Then once labels for roles became popular that got added as well.

@jhorwit2
Copy link
Contributor

I'm sure we could add logic in service controller to include master if externalTrafficPolicy == Local, exclude otherwise but that doesn't seem like an elegant solution thinking

This gets a little hacky with how backends are updated today. We would need to keep track of two different backend sets while updating each service.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2019
@thockin thockin removed their assignment Aug 9, 2019
wking added a commit to wking/openshift-installer that referenced this issue Sep 24, 2019
We grew this in c22d042 (docs/user/aws/install_upi: Add 'sed' call
to zero compute replicas, 2019-05-02, openshift#1649) to set the stage for
changing the 'replicas: 0' semantics from "we'll make you some dummy
MachineSets" to "we won't make you MachineSets".  But that hasn't
happened yet, and since 64f96df (scheduler: Use schedulable masters
if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0' for
compute has also meant "add the 'worker' role to control-plane nodes".
That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], it's easier to just drop this
section and keep the 'worker' role off the control-plane machines
entirely.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
wking added a commit to wking/openshift-installer that referenced this issue Oct 1, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
wking added a commit to wking/openshift-installer that referenced this issue Oct 2, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
wking added a commit to wking/openshift-installer that referenced this issue Oct 2, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2019
alaypatel07 pushed a commit to alaypatel07/installer that referenced this issue Nov 13, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
jhixson74 pushed a commit to jhixson74/installer that referenced this issue Dec 6, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 7, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Contributor

wking commented May 14, 2020

This was addressed via kubernetes/enhancements#1144 (see here) and #90126.

stephenfin added a commit to shiftstack/installer that referenced this issue Jun 13, 2022
There should no longer be any issues running router pods on control
plane nodes (i.e. kubernetes/kubernetes#65618
which was resolved in
kubernetes/enhancements#1144). Remove this
limitation from the docs.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
mandre pushed a commit to shiftstack/installer that referenced this issue Sep 23, 2022
There should no longer be any issues running router pods on control
plane nodes (i.e. kubernetes/kubernetes#65618
which was resolved in
kubernetes/enhancements#1144). Remove this
limitation from the docs.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests