Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-ecs): hanging on deleting a stack with ASG capacity provider #18179

Open
metametadata opened this issue Dec 25, 2021 · 18 comments · Fixed by #23729
Open

(aws-ecs): hanging on deleting a stack with ASG capacity provider #18179

metametadata opened this issue Dec 25, 2021 · 18 comments · Fixed by #23729
Assignees
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/large Large work item – several weeks of effort p2

Comments

@metametadata
Copy link

metametadata commented Dec 25, 2021

What is the problem?

The deletion of stack with AsgCapacityProvider hangs unexpectedly.

It is surprising as we didn't have such an issue with now deprecated addCapacity and we have no ECS tasks in ASG when we delete the stack.

The behaviour seems to be caused by the default enableManagedTerminationProtection = true.

See the discussion in the original closed issue and my unaddressed comment: #14732 (comment).

Reproduction Steps

Please see #14732.

In short, try to delete the stack with ECS cluster which uses AsgCapacityProvider defaults.

What did you expect to happen?

Either:

  • CloudFormation does not hang but fails as fast as possible with an error message about the termination protection.
  • The stack is successfully deleted as there are no running ECS tasks anymore.

What actually happened?

The CF stack got stuck in DELETE_IN_PROGRESS.

CDK CLI Version

2.3.0

Framework Version

2.3.0

Node.js Version

v16.8.0

OS

macOS

Language

Java

Language Version

11.0.8

Other information

Workaround

My current workaround: set AsgCapacityProvider enableManagedTerminationProtection = false.

Documentation questions/enhancement requests

From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):

By default, an Auto Scaling Group Capacity Provider will manage the Auto Scaling Group's size for you. It will also enable managed termination protection, in order to prevent EC2 Auto Scaling from terminating EC2 instances that have tasks running on them. If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.

  1. It's not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case.
  2. What are the risks of turning this protection off? E.g. we don't want ECS tasks to shut down at random times.
  3. Is it OK to set enableManagedTerminationProtection=false + enableManagedScaling=true? It seems to work but is against the documentation ("If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.").
@metametadata metametadata added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 25, 2021
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Dec 25, 2021
@metametadata metametadata changed the title (aws-ecs): hanging on deleting a stack with ASG Capacity provider (aws-ecs): hanging on deleting a stack with ASG capacity provider Dec 25, 2021
@ryparker ryparker added the p2 label Dec 28, 2021
@jcpage
Copy link

jcpage commented Jan 8, 2022

I see the same thing. It hangs for a LONG time, then finally fails. To workaround it I have to go manually terminate the ECS EC2 instance. If I had to guess it seems related to not being able to shutdown all (or "the last"?) instances properly (note I only had one at the time).

@madeline-k madeline-k added effort/large Large work item – several weeks of effort and removed needs-triage This issue or PR still needs to be triaged. labels Jan 25, 2022
@madeline-k madeline-k removed their assignment Jan 25, 2022
@fschollmeyer
Copy link

Hi everyone,
we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster.
Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group.
Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?

@A-Mckinlay
Copy link

A-Mckinlay commented Mar 8, 2022

Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?

@fschollmeyer I have this same issue, did you manage to find a workaround?

@Ten0
Copy link

Ten0 commented Mar 8, 2022

Hello.
The issue for us comes from the following:

Attempting to remove a cluster through CloudFormation when there still are EC2 instances running results in a failure, and instance running perpetually.

The stack sets up, due to the dependencies (and deletes in reverse order)

  • Cluster
  • EC2 LaunchTemplate
  • Auto scaling group
  • Capacity provider
  • Capacity provider associations with the cluster

On removal, capacity provider associations and capacity providers removal results in losing the termination protection management, so currently running instances stay perpetually protected, preventing the stack from being removed.

Our current workaround is to put DeletionPolicy Preserve on the AWS::ECS::ClusterCapacityProviderAssociations resource. This has the deletion fail the first time because Capacity providers can't be removed due to still being referenced in the cluster, then ASG deletion succeeds because managed termination protection still works, then cluster deletion works (effectively removing the ClusterCapacityProviderAssociations). Subsequent attempt to remove the stack removes the leftover unbound capacity providers successfully.

@gshpychka
Copy link
Contributor

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

@Ten0
Copy link

Ten0 commented Mar 17, 2022

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?

@gshpychka
Copy link
Contributor

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?

The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.

@Ten0
Copy link

Ten0 commented Mar 18, 2022

The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.

Hmm that's weird, both the issue and solution are the opposite way for me. Looks like this order would be the natural order CF would do without a custom resource. In my case, removing the ASG before the capacity providers works, and is even what enables it to properly remove despite instance termination protection. (#18179 (comment))

@gshpychka
Copy link
Contributor

gshpychka commented Mar 18, 2022

Yes, the order is not the issue. The issue is that Cloudformation doesn't force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones.

My solution doesn't require retrying the deletion, it works in a single pass.

@metametadata
Copy link
Author

FWIW, I've just noticed that my cluster deletion has failed quickly with DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining.. Which is a bit better than hanging. But it also means that the workaround I've described in the first message apparently does not help.

  • My ASG has enableManagedTerminationProtection = false.

  • Versions:

ᐅ cdk --version
2.22.0 (build 1db4b16)
software.amazon.awssdk/ecs "2.17.181"
  • Log:
yuri-cluster |   0 | 6:20:41 PM | DELETE_IN_PROGRESS   | AWS::CloudFormation::Stack                    | yuri-cluster User Initiated
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::SNS::Subscription                        | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::Lambda::Permission                       | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LifecycleHook               | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::EC2::SecurityGroupIngress                | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0) 
yuri-cluster |   1 | 6:20:44 PM | DELETE_COMPLETE      | AWS::SNS::Subscription                        | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612) 
yuri-cluster |   2 | 6:20:45 PM | DELETE_COMPLETE      | AWS::EC2::SecurityGroupIngress                | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0) 
yuri-cluster |   3 | 6:20:46 PM | DELETE_COMPLETE      | AWS::AutoScaling::LifecycleHook               | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   3 | 6:20:46 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED) 
yuri-cluster |   4 | 6:20:47 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED) 
yuri-cluster |   4 | 6:20:48 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B) 
yuri-cluster |   5 | 6:20:49 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   6 | 6:20:54 PM | DELETE_COMPLETE      | AWS::Lambda::Permission                       | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528) 
yuri-cluster |   6 | 6:20:55 PM | DELETE_IN_PROGRESS   | AWS::SNS::Topic                               | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48) 
yuri-cluster |   6 | 6:20:55 PM | DELETE_IN_PROGRESS   | AWS::Lambda::Function                         | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9) 
yuri-cluster |   7 | 6:20:55 PM | DELETE_COMPLETE      | AWS::SNS::Topic                               | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   8 | 6:21:02 PM | DELETE_COMPLETE      | AWS::Lambda::Function                         | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   8 | 6:21:03 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871) 
yuri-cluster |   9 | 6:21:04 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871) 
yuri-cluster |   9 | 6:21:05 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966) 
yuri-cluster |  10 | 6:21:07 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  11 | 6:21:15 PM | DELETE_COMPLETE      | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409) 
yuri-cluster |  11 | 6:21:16 PM | DELETE_IN_PROGRESS   | AWS::ECS::CapacityProvider                    | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  12 | 6:21:39 PM | DELETE_COMPLETE      | AWS::ECS::CapacityProvider                    | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59) 
yuri-cluster |  12 | 6:21:39 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup            | asg/ASG (asgASG4D014670) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
 12 Currently in progress: yuri-cluster, asgASG4D014670
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  13 | 6:23:13 PM | DELETE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup            | asg/ASG (asgASG4D014670) 
yuri-cluster |  13 | 6:23:14 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration         | asg/LaunchConfig (asgLaunchConfig37FDE42B) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  14 | 6:23:16 PM | DELETE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration         | asg/LaunchConfig (asgLaunchConfig37FDE42B) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::IAM::InstanceProfile                     | asg/InstanceProfile (asgInstanceProfile4E44E320) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::EC2::SecurityGroup                       | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975) 
yuri-cluster |  15 | 6:23:18 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81) 
yuri-cluster |  16 | 6:23:18 PM | DELETE_COMPLETE      | AWS::EC2::SecurityGroup                       | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975) 
yuri-cluster |  16 | 6:23:19 PM | DELETE_IN_PROGRESS   | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) 
yuri-cluster |  17 | 6:23:19 PM | DELETE_COMPLETE      | AWS::IAM::InstanceProfile                     | asg/InstanceProfile (asgInstanceProfile4E44E320) 
yuri-cluster |  17 | 6:23:19 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/InstanceRole (asgInstanceRole8AC4201C) 
yuri-cluster |  17 | 6:23:20 PM | DELETE_FAILED        | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)
yuri-cluster |  18 | 6:23:21 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/InstanceRole (asgInstanceRole8AC4201C) 
yuri-cluster |  18 | 6:23:21 PM | DELETE_FAILED        | AWS::CloudFormation::Stack                    | yuri-cluster The following resource(s) failed to delete: [cluster611F8AFF]. 

Failed resources:
yuri-cluster | 6:23:20 PM | DELETE_FAILED        | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)

@elliot-nelson
Copy link

The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda.

Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate.

In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy.

    const asgForceDelete = new cr.AwsCustomResource(this, 'AsgForceDelete', {
      onDelete: {
        service: 'AutoScaling',
        action: 'deleteAutoScalingGroup',
        parameters: {
          AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
          ForceDelete: true
        }
      },
      policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
        resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE
      })
    });
    asgForceDelete.node.addDependency(this.autoScalingGroup);

@frjonsen
Copy link

frjonsen commented Nov 8, 2022

The solution above mostly works. Note that if any changes are made to the resource that causes it to be deleted and recreated it will also delete the cluster, which will of course not be recreated, leaving the stack drifting.

@ryparker
Copy link
Contributor

ryparker commented Jan 18, 2023

After digging into this and reading through the mentioned CloudFormation issue it seems to me like this is a situation that CloudFormation is working to fix & improve. At the least we should be getting a relatively quick error from CloudFormation rather than having to wait for the timeout. From my research It wasn't clear to me if CloudFormation intends for ASG's configured with managedTerminationProtection: 'ENABLED' to be automatically cleaned up by CloudFormation. It may turn out that they decide to require manually disabling instance's scale-in protection similar to how non-empty S3 buckets are handled by CloudFormation today. If we can get a definitive answer on this and it turns out to be the case then we should probably look into adding an opt-in custom-resource for ASG cleanup (similar to how we handle auto deleting objects in a Bucket via autoDeleteObjects).

In the meantime I've created a PR that improves some of our documentation for the enableManaged* options. I've also added a note of the delete behavior to the ECS README (the ECS overview doc page) and a link to this issue for anyone who is interested in workarounds such as the custom resource solution that @elliot-nelson suggested (thanks for sharing!)

@mergify mergify bot closed this as completed in 23634fd Jan 18, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@Ten0
Copy link

Ten0 commented Jan 18, 2023

@ryparker I think in "Related to but does not fix: #18179" the bot may have captured "fix: #18179" ^^
Issue should probably be reopened.

@nathanpeck
Copy link
Member

Hey all, I've created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling

You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123

In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.

@simi-obs
Copy link

Hello there fellas, I was using the workaround with force deleting the ASG using custom resource for some time and it worked great.

Leately (last few weeks), I have started to get following error:

Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be
             deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException

How is this possible? The ASG is fully deleted before the cluster delete is initiated (I can see it in the Cloudformation events and ASG resource is dependent on cluster). If the ASG is deleted, all of the instances should be deleted as well.

See the attached screenshot of CF events as well

image

What is with the sudden behavior change?

@waissbluth
Copy link

waissbluth commented Oct 3, 2024

In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.

Thank you @nathanpeck. Is this solution also meant to solve the problem of CloudFormation getting stuck when making updates (e.g, changing the AMI like @fschollmeyer pointed out?). From looking at the custom resource it does not look like it would have an effect on an ASG swapping instances during a rolling update? Perhaps it would for a replacing update?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/large Large work item – several weeks of effort p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.