Adding functionality to cordon the node before destroying it. #3649

atulaggarwal · 2020-10-28T06:42:31Z

This helps load balancer to remove the node from healthy hosts (ALB does have this support).

This won't fix the issue of 502 completely as there is some time node to live even after cordoning as to serve In-Flight request.
Shamelessly copied from https://github.com/kubernetes/autoscaler/pull/3014/files

k8s-ci-robot · 2020-10-28T06:42:34Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2020-10-28T06:42:39Z

Welcome @atulaggarwal!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

atulaggarwal · 2020-10-28T06:51:17Z

Signed now

atulaggarwal · 2020-10-31T15:22:34Z

Can someone review the PR and let me know if this kind of change make sense in autoscaler?

MaciekPytel · 2020-11-03T17:00:51Z

FWIW I still believe the right solution is to make ingress controller aware of autoscaler taint (ex. ingress-gce). spec.Unschedulable has the problem of not having a clear owner, though the idea of pairing it with a taint as implemented in here somewhat mitigates this issue. (edit: Also there is no way for certain pods to tolerate it, which is potentially useful for things like logging daemonsets).

At a glance I'm not sure why we need 2 separate taints? Why not just skip default tainting logic and apply the ToBeDeletedTaint in cordoning logic? Having 2 separate writes on the node is a big downside as mutating api-server calls can be a big bottleneck when trying to scale-down large cluster rapidly.

I'm focusing on preparing patch releases now, I can do a more detailed review later this week.

atulaggarwal · 2020-11-04T06:01:26Z

Thanks for the brief review.

I am raising PR with aws load balancer to handle the taint. PR Link - Issue #1609. Checking for taint ToBeDeletedByClusterAutoscaler while … kubernetes-sigs/aws-load-balancer-controller#1610
I agree we should consolidate the taints at one place. I will work on it and make the changes. I still believe we should mark spec.Unschedulable also so that implementation of any ingress controller (unaware of autoscaler specific taint) could remove node from their target group on finding spec.Unschdulable.

Edit - Added PR link and fixed typos

atulaggarwal · 2020-12-09T04:15:08Z

@MaciekPytel - Can you review the PR once and let me know if everything looks fine?

ltagliamonte-dd · 2021-01-11T21:26:41Z

Folks why aren't we working on this? Can someone from the admin review?
this could easily solve a huge problem for tons of users.

feiskyer

I'm +1 for adding this
/approve

ltagliamonte-dd · 2021-01-12T16:59:28Z

@feiskyer thank you for the review and the approval.
What are the remaining steps to see this upstream and released?

ltagliamonte-dd · 2021-01-13T23:45:47Z

@Jeffwan do you mind review this PR as well?

mwielgus · 2021-01-14T01:30:50Z

cluster-autoscaler/utils/deletetaint/delete.go

@@ -205,6 +209,10 @@ func cleanTaint(node *apiv1.Node, client kube_client.Interface, taintKey string)
 		}

 		freshNode.Spec.Taints = newTaints
+		if cordonNode {
+			klog.V(1).Infof("Successfully uncordoned node %v by Cluster Autoscaler", freshNode.Name)


Given that the update has not yet happened this message is premature.

mwielgus · 2021-01-14T01:31:47Z

cluster-autoscaler/main.go

@@ -174,6 +174,7 @@ var (
 	awsUseStaticInstanceList           = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")
 	enableProfiling                    = flag.Bool("profiling", false, "Is debug/pprof endpoint enabled")
 	clusterAPICloudConfigAuthoritative = flag.Bool("clusterapi-cloud-config-authoritative", false, "Treat the cloud-config flag authoritatively (do not fallback to using kubeconfig flag). ClusterAPI only")
+	cordonNodeBeforeTerminate          = flag.Bool("cordon-node-before-terminating", true, "Should CA cordon nodes before terminating during downscale process")


Please launch new features initially disabled.

mwielgus · 2021-01-14T01:34:00Z

cluster-autoscaler/utils/deletetaint/delete.go

@@ -117,6 +117,10 @@ func addTaintToSpec(node *apiv1.Node, taintKey string, effect apiv1.TaintEffect)
 		Value:  fmt.Sprint(time.Now().Unix()),
 		Effect: effect,
 	})
+	if cordonNode {
+		klog.V(1).Infof("Successfully cordoned node %v by Cluster Autoscaler", node.Name)


The update has not yet completed. The message is premature.

atulaggarwal · 2021-01-14T02:30:17Z

@mwielgus - Thanks for reviewing the PR. I have made the changes as per the review comments. Please review it once more.

mwielgus · 2021-01-14T10:52:47Z

Please squash commits to just 1.

…lps load balancer to remove the node from healthy hosts (ALB does have this support). This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list. This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.

atulaggarwal · 2021-01-14T11:56:00Z

@mwielgus - Done

mwielgus

/lgtm
/approve

k8s-ci-robot · 2021-01-14T12:17:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atulaggarwal, feiskyer, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [feiskyer,mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MaciekPytel · 2021-01-14T12:20:19Z

For future reference - there is a growing number of request for CA to take additional custom actions as part of the drain (ex. #3792) and I'd like to propose extracting drain logic into processor as a long-term solution.
This will keep any additional logic encapsulated and make it so that no code related to opt-in features will be executed if those features are disabled (which should make it easier to merge any such changes as any regression risk is limited to users who opt-in).

ltagliamonte · 2021-01-14T14:32:29Z

@mwielgus @feiskyer thank you for looking into this.
When can we expect a docker release?

Jeffwan · 2021-01-14T19:49:06Z

@mwielgus @feiskyer thank you for looking into this.
When can we expect a docker release?

There's a release cycle, you will need to cherry-pick this merge to branches and next release will pick them up.
At the same time, you can build your own image to use and don't need to wait for official release if you want it immediately.

ltagliamonte-dd · 2021-01-14T20:40:19Z

@Jeffwan thank you for getting back to me, may you please tell me when the next release cycle is going to happen?
Is the project still supporting k8s 1.18 or do i have to cherry pick this myself?

spanky-medal · 2021-02-10T18:45:37Z

Does this PR resolve the issue found in #2045? I'm not clear following this thread from #3792

atulaggarwal · 2021-02-11T02:05:05Z

I don't think this would solve issue #2045. This PR justs add UnSchedulable label to node being evicted so that it could be removed from load balancer. It would not change any way scheduling of pods before deleting the node.

ltagliamonte-dd · 2021-02-22T17:21:26Z

@Jeffwan i've created a dev docker release for my staging testing, still waiting on the official release for prod.
Do you know when it is going to happen? i'm waiting for a release on k8s 1.18.
Thanks

infa-ddeore · 2021-03-01T07:51:01Z

@atulaggarwal I want to build a docker image for k8 1.5, 1.16, 1.17, 1.18 so trying to cherry-pick your commit 7670d7b6af6cf98e29f7f328d8de38072f7a0d9a into the respective branches like cluster-autoscaler-release-1.15, cluster-autoscaler-release-1.16 and so on, but there are conflicts in several files like cluster-autoscaler/utils/deletetaint/delete.go cluster-autoscaler/main.go cluster-autoscaler/config/autoscaling_options.go

would you be able to provide the fixes in other branches as well instead of master?

Jeffwan · 2021-04-05T16:33:47Z

@Jeffwan i've created a dev docker release for my staging testing, still waiting on the official release for prod.
Do you know when it is going to happen? i'm waiting for a release on k8s 1.18.
Thanks

@ltagliamonte-dd Please cherry-pick this change to release branch. @kubernetes/autoscaler-maintainers will help cut release later.

cherry pick #3649 - Adding functionality to cordon the node before destroying it.

This feature was added on k8s autoscaler on Oct 2020: kubernetes/autoscaler#3649 but kops didn't provide support to add it via the autoscaler addon This PR adds it

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 28, 2020

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2020

k8s-ci-robot requested review from feiskyer and Jeffwan October 28, 2020 06:42

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 28, 2020

atulaggarwal mentioned this pull request Oct 28, 2020

cluster-autoscaler [AWS] isn't aware of LoadBalancer inflight requests causing 502s when external traffic policy is set to Cluster #1907

Closed

feiskyer reviewed Jan 12, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2021

mwielgus suggested changes Jan 14, 2021

View reviewed changes

atulaggarwal force-pushed the cordon-node-issue-3648 branch from 839811f to 7670d7b Compare January 14, 2021 11:54

mwielgus approved these changes Jan 14, 2021

View reviewed changes

k8s-ci-robot assigned mwielgus Jan 14, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2021

k8s-ci-robot merged commit 58be2b7 into kubernetes:master Jan 14, 2021

atulaggarwal mentioned this pull request Jan 27, 2021

[AWS] Cordon node before downscaling nodes #3648

Closed

atulaggarwal deleted the cordon-node-issue-3648 branch February 23, 2021 06:06

ltagliamonte-dd mentioned this pull request Apr 6, 2021

cherry pick #3649 - Adding functionality to cordon the node before destroying it. This he… #3990

Closed

This was referenced Aug 12, 2021

cherry pick #3649 - Adding functionality to cordon the node before destroying it. #4259

Merged

cherry pick #3649 - Adding functionality to cordon the node before destroying it. #4260

Merged

k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021

Merge pull request #4259 from matthias50/cherry-pick-3649-1.20

c7978e2

cherry pick #3649 - Adding functionality to cordon the node before destroying it.

k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021

Merge pull request #4260 from matthias50/cherry-pick-3649-1.19

1489be0

cherry pick #3649 - Adding functionality to cordon the node before destroying it.

dcfranca mentioned this pull request Sep 5, 2022

Add suport to --cordon-node-before-terminating autoscaler flag kubernetes/kops#14236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding functionality to cordon the node before destroying it. #3649

Adding functionality to cordon the node before destroying it. #3649

atulaggarwal commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

atulaggarwal commented Oct 28, 2020

atulaggarwal commented Oct 31, 2020

MaciekPytel commented Nov 3, 2020 •

edited

Loading

atulaggarwal commented Nov 4, 2020 •

edited

Loading

atulaggarwal commented Dec 9, 2020

ltagliamonte-dd commented Jan 11, 2021

feiskyer left a comment

ltagliamonte-dd commented Jan 12, 2021 •

edited

Loading

ltagliamonte-dd commented Jan 13, 2021

mwielgus Jan 14, 2021

mwielgus Jan 14, 2021

mwielgus Jan 14, 2021

atulaggarwal commented Jan 14, 2021

mwielgus commented Jan 14, 2021

atulaggarwal commented Jan 14, 2021

mwielgus left a comment

k8s-ci-robot commented Jan 14, 2021

MaciekPytel commented Jan 14, 2021

ltagliamonte commented Jan 14, 2021

Jeffwan commented Jan 14, 2021

ltagliamonte-dd commented Jan 14, 2021 •

edited

Loading

spanky-medal commented Feb 10, 2021

atulaggarwal commented Feb 11, 2021

ltagliamonte-dd commented Feb 22, 2021

infa-ddeore commented Mar 1, 2021

Jeffwan commented Apr 5, 2021

Adding functionality to cordon the node before destroying it. #3649

Adding functionality to cordon the node before destroying it. #3649

Conversation

atulaggarwal commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

atulaggarwal commented Oct 28, 2020

atulaggarwal commented Oct 31, 2020

MaciekPytel commented Nov 3, 2020 • edited Loading

atulaggarwal commented Nov 4, 2020 • edited Loading

atulaggarwal commented Dec 9, 2020

ltagliamonte-dd commented Jan 11, 2021

feiskyer left a comment

Choose a reason for hiding this comment

ltagliamonte-dd commented Jan 12, 2021 • edited Loading

ltagliamonte-dd commented Jan 13, 2021

mwielgus Jan 14, 2021

Choose a reason for hiding this comment

mwielgus Jan 14, 2021

Choose a reason for hiding this comment

mwielgus Jan 14, 2021

Choose a reason for hiding this comment

atulaggarwal commented Jan 14, 2021

mwielgus commented Jan 14, 2021

atulaggarwal commented Jan 14, 2021

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 14, 2021

MaciekPytel commented Jan 14, 2021

ltagliamonte commented Jan 14, 2021

Jeffwan commented Jan 14, 2021

ltagliamonte-dd commented Jan 14, 2021 • edited Loading

spanky-medal commented Feb 10, 2021

atulaggarwal commented Feb 11, 2021

ltagliamonte-dd commented Feb 22, 2021

infa-ddeore commented Mar 1, 2021

Jeffwan commented Apr 5, 2021

MaciekPytel commented Nov 3, 2020 •

edited

Loading

atulaggarwal commented Nov 4, 2020 •

edited

Loading

ltagliamonte-dd commented Jan 12, 2021 •

edited

Loading

ltagliamonte-dd commented Jan 14, 2021 •

edited

Loading