-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(aws-ecs): hanging on deleting a stack with ASG capacity provider #18179
Comments
I see the same thing. It hangs for a LONG time, then finally fails. To workaround it I have to go manually terminate the ECS EC2 instance. If I had to guess it seems related to not being able to shutdown all (or "the last"?) instances properly (note I only had one at the time). |
Hi everyone, |
@fschollmeyer I have this same issue, did you manage to find a workaround? |
Hello. Attempting to remove a cluster through CloudFormation when there still are EC2 instances running results in a failure, and instance running perpetually. The stack sets up, due to the dependencies (and deletes in reverse order)
On removal, capacity provider associations and capacity providers removal results in losing the termination protection management, so currently running instances stay perpetually protected, preventing the stack from being removed. Our current workaround is to put |
A solution that seems to work for me is to create a custom resource that calls |
Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster? |
The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG. |
Hmm that's weird, both the issue and solution are the opposite way for me. Looks like this order would be the natural order CF would do without a custom resource. In my case, removing the ASG before the capacity providers works, and is even what enables it to properly remove despite instance termination protection. (#18179 (comment)) |
Yes, the order is not the issue. The issue is that Cloudformation doesn't force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones. My solution doesn't require retrying the deletion, it works in a single pass. |
FWIW, I've just noticed that my cluster deletion has failed quickly with
|
The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda. Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate. In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy. const asgForceDelete = new cr.AwsCustomResource(this, 'AsgForceDelete', {
onDelete: {
service: 'AutoScaling',
action: 'deleteAutoScalingGroup',
parameters: {
AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
ForceDelete: true
}
},
policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE
})
});
asgForceDelete.node.addDependency(this.autoScalingGroup); |
The solution above mostly works. Note that if any changes are made to the resource that causes it to be deleted and recreated it will also delete the cluster, which will of course not be recreated, leaving the stack drifting. |
After digging into this and reading through the mentioned CloudFormation issue it seems to me like this is a situation that CloudFormation is working to fix & improve. At the least we should be getting a relatively quick error from CloudFormation rather than having to wait for the timeout. From my research It wasn't clear to me if CloudFormation intends for ASG's configured with In the meantime I've created a PR that improves some of our documentation for the |
|
Hey all, I've created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123 In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown. |
Hello there fellas, I was using the workaround with force deleting the ASG using custom resource for some time and it worked great. Leately (last few weeks), I have started to get following error:
How is this possible? The ASG is fully deleted before the cluster delete is initiated (I can see it in the Cloudformation events and ASG resource is dependent on cluster). If the ASG is deleted, all of the instances should be deleted as well. See the attached screenshot of CF events as well What is with the sudden behavior change? |
Thank you @nathanpeck. Is this solution also meant to solve the problem of CloudFormation getting stuck when making updates (e.g, changing the AMI like @fschollmeyer pointed out?). From looking at the custom resource it does not look like it would have an effect on an ASG swapping instances during a rolling update? Perhaps it would for a replacing update? Thanks! |
What is the problem?
The deletion of stack with
AsgCapacityProvider
hangs unexpectedly.It is surprising as we didn't have such an issue with now deprecated
addCapacity
and we have no ECS tasks in ASG when we delete the stack.The behaviour seems to be caused by the default
enableManagedTerminationProtection = true
.See the discussion in the original closed issue and my unaddressed comment: #14732 (comment).
Reproduction Steps
Please see #14732.
In short, try to delete the stack with ECS cluster which uses
AsgCapacityProvider
defaults.What did you expect to happen?
Either:
What actually happened?
The CF stack got stuck in DELETE_IN_PROGRESS.
CDK CLI Version
2.3.0
Framework Version
2.3.0
Node.js Version
v16.8.0
OS
macOS
Language
Java
Language Version
11.0.8
Other information
Workaround
My current workaround: set
AsgCapacityProvider
enableManagedTerminationProtection = false
.Documentation questions/enhancement requests
From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):
enableManagedTerminationProtection=false
+enableManagedScaling=true
? It seems to work but is against the documentation ("If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.").The text was updated successfully, but these errors were encountered: