Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown of machinepool nodes #3256

Conversation

akash-gautam
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:
Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG).
watch lifecycle transition of nodes in awsmachinepools
as soon as they enter the termination wait condition, cordon it and evict the pods
post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event

Which issue(s) this PR fixes
Fixes #2574

Release note:

None.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Feb 25, 2022
@k8s-ci-robot
Copy link
Contributor

@akash-gautam: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 25, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @akash-gautam. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 25, 2022
@sedefsavas
Copy link
Contributor

Hi @akash-gautam, thanks for the PR.

Few considerations here:
1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ?
2- Can this implementation be extended to be used when spot instances are supported in MachinePools?
3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here.

I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this.

cc @richardcase

clusterScope.Error(err, "non-fatal: failed to receive messages from instance state queue")
return
}
for _, msg := range resp.Messages {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, at each reconcile this checks if there are any messages. We might be too late for some termination notifications or delay scale down if this is only checked at each reconcile.

Instead, we can trigger an immediate reconcile when a message is received by checking messages in a separate go routine like the existing EventBridge implementation:

for range time.Tick(1 * time.Second) {

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can try doing something similar here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with the separate go routine do we actually have to patch the the machine pool or is there a way to kick of reconciliation some other way?

@akash-gautam
Copy link
Contributor Author

akash-gautam commented Mar 7, 2022

@sedefsavas
1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? - sorry I didn't get this question can you please explain.
2- Can this implementation be extended to be used when spot instances are supported in MachinePools? - We can but it might be a bit tricky as the spot instances won't follow the life cycle states that on-demand instances follow in an ASG, we are creating a pre-deletion hook on the machine pool asg which will prevent the instance from getting terminated for 15 mins in case of a scale in event but for spot instance we will get only a 2 min notice (even less if the interuption action is hibernate) .
3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here. - Our current implementaion is for lifecycle events of ec2 instance when it is managed by ec2 but for this we needs lifecycle events of ec2 instances which is managed by ASG. The lifecycle of ec2 instances differs based on whether they are managed by ec2 or asg. I also first tried using the instancestate package that we already had but for it the source of events is aws.ec2 which doesn't capture the lifecycle states like terminating:wait that is why I had to create a new package which is quite similar to it but caputues the events from aws.autoscaling.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2022
@dntosas
Copy link
Contributor

dntosas commented Mar 20, 2022

this is very nice feature btw 💯

@dntosas
Copy link
Contributor

dntosas commented Apr 5, 2022

@sedefsavas any news on here?

@sedefsavas
Copy link
Contributor

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well?
This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

@Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed.

In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself.

@Ankitasw
Copy link
Member

Ankitasw commented Apr 12, 2022

@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances.

@richardcase
Copy link
Member

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well?
This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

Yes i think thats a great idea @sedefsavas 👍

@akash-gautam
Copy link
Contributor Author

@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know.

@sedefsavas
Copy link
Contributor

Created an issue for the eventbridge graduation #3414

@richardcase
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2022
@akash-gautam akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from ab9ae2a to 97df6d4 Compare August 30, 2022 19:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval by writing /assign @cecilerobertmichon in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 12, 2022
…for works for instances launched by an ASG (awsmachinepool)

Signed-off-by: akash-gautam <gautamakash04@gmail.com>
add a method named CompleteLifeCycleEvent to ASGInterface interface
this method is used to complete pre-deletion lifecycle hook for ec2 instances in ASG (awsmachinepool)

Signed-off-by: akash-gautam <gautamakash04@gmail.com>
@akash-gautam akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from 97df6d4 to 9d5bcab Compare October 13, 2022 13:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2022
@akash-gautam akash-gautam force-pushed the graceful-shutdown-of-machinepool branch 3 times, most recently from 960496f to a65acae Compare October 13, 2022 14:12
watch lifecycle transition of nodes in awsmachinepools
as soon as they enter the termination wait condition, cordon it and evict the pods
post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event

Signed-off-by: akash-gautam <gautamakash04@gmail.com>
@akash-gautam akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from a65acae to 2e63b89 Compare October 13, 2022 14:24
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 13, 2022

@akash-gautam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-release-1-5 ab9ae2a11e633550e6f668f5dbecaf81ca1ab106 link true /test pull-cluster-api-provider-aws-build-release-1-5
pull-cluster-api-provider-aws-test 2e63b89 link true /test pull-cluster-api-provider-aws-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Ankitasw
Copy link
Member

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits.
/close

@k8s-ci-robot
Copy link
Contributor

@Ankitasw: Closed this PR.

In response to this:

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@akash-gautam
Copy link
Contributor Author

@Ankitasw Okay, thanks for the info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWSMachinePool graceful scale down
6 participants