Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding functionality to cordon the node before destroying it. #3649

Merged
merged 1 commit into from
Jan 14, 2021

Conversation

atulaggarwal
Copy link
Contributor

This helps load balancer to remove the node from healthy hosts (ALB does have this support).

This won't fix the issue of 502 completely as there is some time node to live even after cordoning as to serve In-Flight request.
Shamelessly copied from https://github.com/kubernetes/autoscaler/pull/3014/files

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 28, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @atulaggarwal!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2020
@atulaggarwal
Copy link
Contributor Author

Signed now

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 28, 2020
@atulaggarwal
Copy link
Contributor Author

Can someone review the PR and let me know if this kind of change make sense in autoscaler?

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Nov 3, 2020

FWIW I still believe the right solution is to make ingress controller aware of autoscaler taint (ex. ingress-gce). spec.Unschedulable has the problem of not having a clear owner, though the idea of pairing it with a taint as implemented in here somewhat mitigates this issue. (edit: Also there is no way for certain pods to tolerate it, which is potentially useful for things like logging daemonsets).

At a glance I'm not sure why we need 2 separate taints? Why not just skip default tainting logic and apply the ToBeDeletedTaint in cordoning logic? Having 2 separate writes on the node is a big downside as mutating api-server calls can be a big bottleneck when trying to scale-down large cluster rapidly.

I'm focusing on preparing patch releases now, I can do a more detailed review later this week.

@atulaggarwal
Copy link
Contributor Author

atulaggarwal commented Nov 4, 2020

Thanks for the brief review.

  1. I am raising PR with aws load balancer to handle the taint. PR Link - Issue #1609. Checking for taint ToBeDeletedByClusterAutoscaler while … kubernetes-sigs/aws-load-balancer-controller#1610
  2. I agree we should consolidate the taints at one place. I will work on it and make the changes. I still believe we should mark spec.Unschedulable also so that implementation of any ingress controller (unaware of autoscaler specific taint) could remove node from their target group on finding spec.Unschdulable.

Edit - Added PR link and fixed typos

@atulaggarwal
Copy link
Contributor Author

@MaciekPytel - Can you review the PR once and let me know if everything looks fine?

@ltagliamonte-dd
Copy link

Folks why aren't we working on this? Can someone from the admin review?
this could easily solve a huge problem for tons of users.

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for adding this
/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2021
@ltagliamonte-dd
Copy link

ltagliamonte-dd commented Jan 12, 2021

@feiskyer thank you for the review and the approval.
What are the remaining steps to see this upstream and released?

@ltagliamonte-dd
Copy link

@Jeffwan do you mind review this PR as well?

@@ -205,6 +209,10 @@ func cleanTaint(node *apiv1.Node, client kube_client.Interface, taintKey string)
}

freshNode.Spec.Taints = newTaints
if cordonNode {
klog.V(1).Infof("Successfully uncordoned node %v by Cluster Autoscaler", freshNode.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the update has not yet happened this message is premature.

@@ -174,6 +174,7 @@ var (
awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")
enableProfiling = flag.Bool("profiling", false, "Is debug/pprof endpoint enabled")
clusterAPICloudConfigAuthoritative = flag.Bool("clusterapi-cloud-config-authoritative", false, "Treat the cloud-config flag authoritatively (do not fallback to using kubeconfig flag). ClusterAPI only")
cordonNodeBeforeTerminate = flag.Bool("cordon-node-before-terminating", true, "Should CA cordon nodes before terminating during downscale process")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please launch new features initially disabled.

@@ -117,6 +117,10 @@ func addTaintToSpec(node *apiv1.Node, taintKey string, effect apiv1.TaintEffect)
Value: fmt.Sprint(time.Now().Unix()),
Effect: effect,
})
if cordonNode {
klog.V(1).Infof("Successfully cordoned node %v by Cluster Autoscaler", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update has not yet completed. The message is premature.

@atulaggarwal
Copy link
Contributor Author

@mwielgus - Thanks for reviewing the PR. I have made the changes as per the review comments. Please review it once more.

@mwielgus
Copy link
Contributor

Please squash commits to just 1.

…lps load balancer to remove the node from healthy hosts (ALB does have this support).

This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list.
This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.
@atulaggarwal atulaggarwal force-pushed the cordon-node-issue-3648 branch from 839811f to 7670d7b Compare January 14, 2021 11:54
@atulaggarwal
Copy link
Contributor Author

@mwielgus - Done

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atulaggarwal, feiskyer, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 58be2b7 into kubernetes:master Jan 14, 2021
@MaciekPytel
Copy link
Contributor

For future reference - there is a growing number of request for CA to take additional custom actions as part of the drain (ex. #3792) and I'd like to propose extracting drain logic into processor as a long-term solution.
This will keep any additional logic encapsulated and make it so that no code related to opt-in features will be executed if those features are disabled (which should make it easier to merge any such changes as any regression risk is limited to users who opt-in).

@ltagliamonte
Copy link

@mwielgus @feiskyer thank you for looking into this.
When can we expect a docker release?

@Jeffwan
Copy link
Contributor

Jeffwan commented Jan 14, 2021

@mwielgus @feiskyer thank you for looking into this.
When can we expect a docker release?

There's a release cycle, you will need to cherry-pick this merge to branches and next release will pick them up.
At the same time, you can build your own image to use and don't need to wait for official release if you want it immediately.

@ltagliamonte-dd
Copy link

ltagliamonte-dd commented Jan 14, 2021

@Jeffwan thank you for getting back to me, may you please tell me when the next release cycle is going to happen?
Is the project still supporting k8s 1.18 or do i have to cherry pick this myself?

@spanky-medal
Copy link

Does this PR resolve the issue found in #2045? I'm not clear following this thread from #3792

@atulaggarwal
Copy link
Contributor Author

I don't think this would solve issue #2045. This PR justs add UnSchedulable label to node being evicted so that it could be removed from load balancer. It would not change any way scheduling of pods before deleting the node.

@ltagliamonte-dd
Copy link

@Jeffwan i've created a dev docker release for my staging testing, still waiting on the official release for prod.
Do you know when it is going to happen? i'm waiting for a release on k8s 1.18.
Thanks

@atulaggarwal atulaggarwal deleted the cordon-node-issue-3648 branch February 23, 2021 06:06
@infa-ddeore
Copy link

@atulaggarwal I want to build a docker image for k8 1.5, 1.16, 1.17, 1.18 so trying to cherry-pick your commit 7670d7b6af6cf98e29f7f328d8de38072f7a0d9a into the respective branches like cluster-autoscaler-release-1.15, cluster-autoscaler-release-1.16 and so on, but there are conflicts in several files like cluster-autoscaler/utils/deletetaint/delete.go cluster-autoscaler/main.go cluster-autoscaler/config/autoscaling_options.go

would you be able to provide the fixes in other branches as well instead of master?

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 5, 2021

@Jeffwan i've created a dev docker release for my staging testing, still waiting on the official release for prod.
Do you know when it is going to happen? i'm waiting for a release on k8s 1.18.
Thanks

@ltagliamonte-dd Please cherry-pick this change to release branch. @kubernetes/autoscaler-maintainers will help cut release later.

k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021
cherry pick #3649 - Adding functionality to cordon the node before destroying it.
k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021
cherry pick #3649 - Adding functionality to cordon the node before destroying it.
dcfranca added a commit to dcfranca/kops that referenced this pull request Sep 5, 2022
This feature was added on k8s autoscaler on Oct 2020: kubernetes/autoscaler#3649
but kops didn't provide support to add it via the autoscaler addon
This PR adds it
dcfranca added a commit to dcfranca/kops that referenced this pull request Sep 5, 2022
This feature was added on k8s autoscaler on Oct 2020: kubernetes/autoscaler#3649
but kops didn't provide support to add it via the autoscaler addon
This PR adds it
dcfranca added a commit to dcfranca/kops that referenced this pull request Sep 6, 2022
This feature was added on k8s autoscaler on Oct 2020: kubernetes/autoscaler#3649
but kops didn't provide support to add it via the autoscaler addon
This PR adds it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants