Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-ecs): Can't delete a stack with ASG Capacity providers #14732

Closed
hnrc opened this issue May 17, 2021 · 7 comments
Closed

(aws-ecs): Can't delete a stack with ASG Capacity providers #14732

hnrc opened this issue May 17, 2021 · 7 comments
Assignees
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. p2

Comments

@hnrc
Copy link

hnrc commented May 17, 2021

It seem to not be possible to gracefully uninstall an ECS cluster that is associated with an ASG Capacity Provider.
CF hangs and never really finishes, unless one manually deletes the ASG.

Reproduction Steps

  1. Create an ECS cluster with:
const cluster = new Cluster(this, 'EcsCluster', {
  vpc,
  clusterName: props.clusterName,
});

const autoScalingGroup = new AutoScalingGroup(this, 'Asg', {
  vpc,
  machineImage: EcsOptimizedImage.amazonLinux2(),
  instanceType: new InstanceType('t3.micro'),
  minCapacity: 1,
  maxCapacity: 100,
});

const capacityProvider = new AsgCapacityProvider(
  this,
  'AsgCapacityProvider',
  {
    autoScalingGroup,
    capacityProviderName: props.clusterName,
  },
);
cluster.addAsgCapacityProvider(capacityProvider);
  1. Uninstall the stack (I did it through the AWS Console)
  2. Wait for it...
  3. Go grab a cup of ☕
  4. Realize that the stack deletion will never finish

What did you expect to happen?

The CF stack should be properly and gracefully removed.

What actually happened?

The CF stack got stuck in DELETE_IN_PROGRESS

AWS::EC2::InternetGateway

The internetGateway 'igw-03ec296b77d21956f' has dependencies and cannot be deleted. (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: f50851df-172c-4365-b761-e6b710f5b30b; Proxy: null)

AWS::ECS::Cluster

Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: 005e0a22-5547-44da-a51e-5e6b45b39b84; Proxy: null)'." (RequestToken: 50d43055-7cc8-6306-0de1-48c93e63cf96, HandlerErrorCode: GeneralServiceException)

AWS::AutoScaling::LaunchConfiguration

Cannot delete launch configuration lulz-cluster-AsgLaunchConfig6D4F96BB-15LZGM814H5M4 because it is attached to AutoScalingGroup lulz-cluster-AsgASGD1D7B4E2-R02BX4676AJJ (Service: AmazonAutoScaling; Status Code: 400; Error Code: ResourceInUse; Request ID: 3583ab1b-7c1a-47de-929a-67cb705f684f; Proxy: null)

AWS::AutoScaling::AutoScalingGroup:

Group did not stabilize. {current/minSize/maxSize} group size = {1/0/0}.

The stack finished deleting after I manually removed the ASG.

Environment

  • CDK CLI Version: 1.104.0
  • Framework Version: 1.104.0
  • Node.js Version: 14.16.1
  • OS: MacOS 10.15.7
  • Language (Version): Typescript 4.2.4

Other

This seems like a related discussion: aws/containers-roadmap#631 (comment)


This is 🐛 Bug Report

@hnrc hnrc added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 17, 2021
@hnrc hnrc changed the title (aws-ecs): Can't delete a stack with ASG Capacity Provider (aws-ecs): Can't delete a stack with ASG Capacity providers May 17, 2021
@peterwoodworth peterwoodworth added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label May 17, 2021
@SoManyHs SoManyHs added p2 and removed needs-triage This issue or PR still needs to be triaged. labels Jun 1, 2021
@SoManyHs
Copy link
Contributor

SoManyHs commented Jun 4, 2021

I believe the reason for this is because the default for managed termination protection is set to true. This means that Cloudformation cannot delete the ASG associated with the ASG capacity provider because the instances are protected from scale-in.

Two ways you can get around this:

  1. Manually delete the ASG using the AWS EC2 console or AWS CLI, then delete the CFN stack.
  2. Set enableManagedTerminationProtection on the ecs.AsgCapacityProvider to false. This will allow you to run cdk destroy as usual.

See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-capacityprovider-autoscalinggroupprovider.html#cfn-ecs-capacityprovider-autoscalinggroupprovider-managedterminationprotection

I am still investigating if there are other ways around this while still creating an ASG with managed termination protection, but see if either of the above works for you @hnrc !

@SoManyHs SoManyHs added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jun 4, 2021
@hnrc
Copy link
Author

hnrc commented Jun 5, 2021

I am still investigating if there are other ways around this while still creating an ASG with managed termination protection, but see if either of the above works for you @hnrc !

Thanks for looking into this.

Both workarounds work me.
I think we can live with having managed termination protection disabled so this is no longer very critical (for me personally).

@SoManyHs
Copy link
Contributor

SoManyHs commented Jun 7, 2021

Great, glad the workarounds helped! Closing this issue.

@SoManyHs SoManyHs closed this as completed Jun 7, 2021
@SoManyHs SoManyHs removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jun 7, 2021
@github-actions
Copy link

github-actions bot commented Jun 7, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

lucamilanesio pushed a commit to GerritCodeReview/aws-gerrit that referenced this issue Nov 30, 2021
Deleting stacks using ECS clusters having capacityProviders (i.e.
dual-primary and primary-replica recipes), fails with:

```
The Cluster cannot be deleted while Container Instances are active
or draining.
```

This is an issue that manifests itself as well via terraform [1] or CDK
[2].

Explicitly deleting the Autoscaling Groups _before_ the ECS cluster
deletion fixes the problem, since it ensures that no instances are
active or draining, as the error suggests.

This is safe to do, because prior to deleting the Autoscaling Groups,
every ECS service has already been destroyed, thus no instance is
actually running.

[1] hashicorp/terraform-provider-aws#4852
[2] aws/aws-cdk#14732
Bug: Issue 14698
Change-Id: I216307ef88bd7b7317706d2dc0a6a6e6fb367bd4

Change-Id: I27ece0f6971b157a474d91d7f3d9243dcff596e6
@metametadata
Copy link

metametadata commented Dec 11, 2021

@SoManyHs

Experiencing the same issue with the defaults in addAsgCapacityProvider. It was surprising as we didn't have such issue with now deprecated addCapacity and we have no ECS tasks in ASG when we delete the stack.

  1. Feature request. Ideally, CloudFormation must not hang but fail as fast as possible with an error message about the termination protection.

  2. Documentation enhancement request. From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html:

    By default, an Auto Scaling Group Capacity Provider will manage the Auto Scaling Group's size for you. It will also enable managed termination protection, in order to prevent EC2 Auto Scaling from terminating EC2 instances that have tasks running on them. If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.

    It's not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case.

  3. Question/documentation enhancement request. We'll likely have to set enableManagedTerminationProtection to false in our automated undeploy code. But what are the risks of turning this protection off? E.g. we don't want ECS tasks to shut down at random times.

  4. Question/documentation enhancement request. Is it OK to set enableManagedTerminationProtection=false + enableManagedScaling=true? It seems to work but is against the documentation ("If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.").

edit: I later created a new issue for this: #18179.

@gshpychka
Copy link
Contributor

@SoManyHs this is still an issue, since we have to use manual hacks to destroy the stack.

@nathanpeck
Copy link
Member

Hey all, I've created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling

You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123

In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.

A similar approach could be implemented in CDK. I've added a todo item for me to make a CDK specific example as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. p2
Projects
None yet
Development

No branches or pull requests

8 participants