-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown of machinepool nodes #3256
Graceful shutdown of machinepool nodes #3256
Conversation
@akash-gautam: This issue is currently awaiting triage. If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @akash-gautam. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @akash-gautam, thanks for the PR. Few considerations here: I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this. cc @richardcase |
clusterScope.Error(err, "non-fatal: failed to receive messages from instance state queue") | ||
return | ||
} | ||
for _, msg := range resp.Messages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, at each reconcile this checks if there are any messages. We might be too late for some termination notifications or delay scale down if this is only checked at each reconcile.
Instead, we can trigger an immediate reconcile when a message is received by checking messages in a separate go routine like the existing EventBridge implementation:
for range time.Tick(1 * time.Second) { |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can try doing something similar here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with the separate go routine do we actually have to patch the the machine pool or is there a way to kick of reconciliation some other way?
@sedefsavas |
this is very nice feature btw 💯 |
@sedefsavas any news on here? |
How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? @Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed. In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself. |
@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances. |
Yes i think thats a great idea @sedefsavas 👍 |
@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know. |
Created an issue for the eventbridge graduation #3414 |
/ok-to-test |
ab9ae2a
to
97df6d4
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…for works for instances launched by an ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>
add a method named CompleteLifeCycleEvent to ASGInterface interface this method is used to complete pre-deletion lifecycle hook for ec2 instances in ASG (awsmachinepool) Signed-off-by: akash-gautam <gautamakash04@gmail.com>
97df6d4
to
9d5bcab
Compare
960496f
to
a65acae
Compare
watch lifecycle transition of nodes in awsmachinepools as soon as they enter the termination wait condition, cordon it and evict the pods post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event Signed-off-by: akash-gautam <gautamakash04@gmail.com>
a65acae
to
2e63b89
Compare
@akash-gautam: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same. As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. |
@Ankitasw: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Ankitasw Okay, thanks for the info. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG).
watch lifecycle transition of nodes in awsmachinepools
as soon as they enter the termination wait condition, cordon it and evict the pods
post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event
Which issue(s) this PR fixes
Fixes #2574
Release note: