Graceful shutdown of machinepool nodes #3256

akash-gautam · 2022-02-25T16:46:22Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG).
watch lifecycle transition of nodes in awsmachinepools
as soon as they enter the termination wait condition, cordon it and evict the pods
post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event

Which issue(s) this PR fixes
Fixes #2574

Release note:

None.

k8s-ci-robot · 2022-02-25T16:46:29Z

@akash-gautam: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-02-25T16:46:30Z

Hi @akash-gautam. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-03-02T08:12:59Z

Hi @akash-gautam, thanks for the PR.

Few considerations here:
1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ?
2- Can this implementation be extended to be used when spot instances are supported in MachinePools?
3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here.

I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this.

cc @richardcase

sedefsavas · 2022-03-02T08:23:50Z

exp/controllers/awsmachinepool_controller.go

+		clusterScope.Error(err, "non-fatal: failed to receive messages from instance state queue")
+		return
+	}
+	for _, msg := range resp.Messages {


If I understand correctly, at each reconcile this checks if there are any messages. We might be too late for some termination notifications or delay scale down if this is only checked at each reconcile.

Instead, we can trigger an immediate reconcile when a message is received by checking messages in a separate go routine like the existing EventBridge implementation:

cluster-api-provider-aws/exp/instancestate/awsinstancestate_controller.go

Line 132 in 50af6ec

for range time.Tick(1 * time.Second) {

WDYT?

Yes, we can try doing something similar here as well.

If we go with the separate go routine do we actually have to patch the the machine pool or is there a way to kick of reconciliation some other way?

akash-gautam · 2022-03-07T17:58:59Z

@sedefsavas
1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? - sorry I didn't get this question can you please explain.
2- Can this implementation be extended to be used when spot instances are supported in MachinePools? - We can but it might be a bit tricky as the spot instances won't follow the life cycle states that on-demand instances follow in an ASG, we are creating a pre-deletion hook on the machine pool asg which will prevent the instance from getting terminated for 15 mins in case of a scale in event but for spot instance we will get only a 2 min notice (even less if the interuption action is hibernate) .
3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here. - Our current implementaion is for lifecycle events of ec2 instance when it is managed by ec2 but for this we needs lifecycle events of ec2 instances which is managed by ASG. The lifecycle of ec2 instances differs based on whether they are managed by ec2 or asg. I also first tried using the instancestate package that we already had but for it the source of events is aws.ec2 which doesn't capture the lifecycle states like terminating:wait that is why I had to create a new package which is quite similar to it but caputues the events from aws.autoscaling.

dntosas · 2022-03-20T19:42:42Z

this is very nice feature btw 💯

dntosas · 2022-04-05T12:49:41Z

@sedefsavas any news on here?

sedefsavas · 2022-04-12T01:40:18Z

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well?
This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

@Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed.

In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself.

Ankitasw · 2022-04-12T06:41:54Z

@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances.

richardcase · 2022-04-12T08:01:21Z

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well?
This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

Yes i think thats a great idea @sedefsavas 👍

akash-gautam · 2022-04-12T11:07:02Z

@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know.

sedefsavas · 2022-04-13T03:57:47Z

Created an issue for the eventbridge graduation #3414

richardcase · 2022-06-17T07:55:45Z

/ok-to-test

k8s-ci-robot · 2022-08-30T19:38:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval by writing /assign @cecilerobertmichon in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…for works for instances launched by an ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>

add a method named CompleteLifeCycleEvent to ASGInterface interface this method is used to complete pre-deletion lifecycle hook for ec2 instances in ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>

watch lifecycle transition of nodes in awsmachinepools as soon as they enter the termination wait condition, cordon it and evict the pods post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event Signed-off-by: akash-gautam <gautamakash04@gmail.com>

k8s-ci-robot · 2022-10-13T14:28:53Z

@akash-gautam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-release-1-5	ab9ae2a11e633550e6f668f5dbecaf81ca1ab106	link	true	`/test pull-cluster-api-provider-aws-build-release-1-5`
pull-cluster-api-provider-aws-test	`2e63b89`	link	true	`/test pull-cluster-api-provider-aws-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Ankitasw · 2022-10-14T06:32:22Z

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits.
/close

k8s-ci-robot · 2022-10-14T06:32:26Z

@Ankitasw: Closed this PR.

In response to this:

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

akash-gautam · 2022-10-14T08:05:41Z

@Ankitasw Okay, thanks for the info.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Feb 25, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 25, 2022

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 25, 2022

k8s-ci-robot requested review from dthorsen and pydctw February 25, 2022 16:46

sedefsavas reviewed Mar 2, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2022

sedefsavas mentioned this pull request Apr 13, 2022

Graduation of EventBridge Feature #3414

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2022

akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from ab9ae2a to 97df6d4 Compare August 30, 2022 19:38

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 12, 2022

akash-gautam added 2 commits October 13, 2022 15:03

add a package asginstancestate which is similar to instancestate but …

69ba0e7

…for works for instances launched by an ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>

update ASGInterface interface

151ba05

add a method named CompleteLifeCycleEvent to ASGInterface interface this method is used to complete pre-deletion lifecycle hook for ec2 instances in ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>

akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from 97df6d4 to 9d5bcab Compare October 13, 2022 13:10

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2022

akash-gautam force-pushed the graceful-shutdown-of-machinepool branch 3 times, most recently from 960496f to a65acae Compare October 13, 2022 14:12

akash-gautam force-pushed the graceful-shutdown-of-machinepool branch from a65acae to 2e63b89 Compare October 13, 2022 14:24

k8s-ci-robot closed this Oct 14, 2022

fiunchinho mentioned this pull request Mar 22, 2023

Machine pool nodes are not drained during upgrade giantswarm/roadmap#2170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown of machinepool nodes #3256

Graceful shutdown of machinepool nodes #3256

akash-gautam commented Feb 25, 2022

k8s-ci-robot commented Feb 25, 2022

k8s-ci-robot commented Feb 25, 2022

sedefsavas commented Mar 2, 2022

sedefsavas Mar 2, 2022

akash-gautam Mar 7, 2022

richardcase Apr 11, 2022

akash-gautam commented Mar 7, 2022 •

edited

Loading

dntosas commented Mar 20, 2022

dntosas commented Apr 5, 2022

sedefsavas commented Apr 12, 2022

Ankitasw commented Apr 12, 2022 •

edited

Loading

richardcase commented Apr 12, 2022

akash-gautam commented Apr 12, 2022

sedefsavas commented Apr 13, 2022

richardcase commented Jun 17, 2022

k8s-ci-robot commented Aug 30, 2022

k8s-ci-robot commented Oct 13, 2022 •

edited

Loading

Ankitasw commented Oct 14, 2022

k8s-ci-robot commented Oct 14, 2022

akash-gautam commented Oct 14, 2022

Graceful shutdown of machinepool nodes #3256

Graceful shutdown of machinepool nodes #3256

Conversation

akash-gautam commented Feb 25, 2022

k8s-ci-robot commented Feb 25, 2022

k8s-ci-robot commented Feb 25, 2022

sedefsavas commented Mar 2, 2022

sedefsavas Mar 2, 2022

Choose a reason for hiding this comment

akash-gautam Mar 7, 2022

Choose a reason for hiding this comment

richardcase Apr 11, 2022

Choose a reason for hiding this comment

akash-gautam commented Mar 7, 2022 • edited Loading

dntosas commented Mar 20, 2022

dntosas commented Apr 5, 2022

sedefsavas commented Apr 12, 2022

Ankitasw commented Apr 12, 2022 • edited Loading

richardcase commented Apr 12, 2022

akash-gautam commented Apr 12, 2022

sedefsavas commented Apr 13, 2022

richardcase commented Jun 17, 2022

k8s-ci-robot commented Aug 30, 2022

k8s-ci-robot commented Oct 13, 2022 • edited Loading

Ankitasw commented Oct 14, 2022

k8s-ci-robot commented Oct 14, 2022

akash-gautam commented Oct 14, 2022

akash-gautam commented Mar 7, 2022 •

edited

Loading

Ankitasw commented Apr 12, 2022 •

edited

Loading

k8s-ci-robot commented Oct 13, 2022 •

edited

Loading