Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] Full support for Capacity Providers in CloudFormation. #631

Open
coultn opened this issue Dec 5, 2019 · 126 comments
Open

[ECS] Full support for Capacity Providers in CloudFormation. #631

coultn opened this issue Dec 5, 2019 · 126 comments
Assignees
Labels
ECS Amazon Elastic Container Service Work in Progress

Comments

@coultn
Copy link

coultn commented Dec 5, 2019

CloudFormation does not currently have support for capacity providers in any of the ECS resource types. We will be adding this support in the near future.

@lawrencepit
Copy link

Related to this, in order to support capacity providers with managedTerminationProtection, we also need to be able to set the new-instances-protected-from-scale-in property when creating the ASG via CloudFormation. This latter property was added 4 years ago to the AWS SDK / AWS CLI, but is still not supported in CF -- hopefully full support for CP in CF is added a bit faster.

@geof2001
Copy link

geof2001 commented Jan 7, 2020

Has there been any progress made on this?

Add support for Capacity providers #1

@coultn
Copy link
Author

coultn commented Jan 7, 2020

We are working on it and will provide updates as soon as more information is available.

@psuj
Copy link

psuj commented Jan 10, 2020

Related to this, in order to support capacity providers with managedTerminationProtection, we also need to be able to set the new-instances-protected-from-scale-in property when creating the ASG via CloudFormation. This latter property was added 4 years ago to the AWS SDK / AWS CLI, but is still not supported in CF -- hopefully full support for CP in CF is added a bit faster.

Additionally, when the new-instances-protected-from-scale-in property is set on ASG, scheduled action to scale-in instances could not be executed. Feature like force-scale-in for scheduled actions would be useful if for example we have dev env and we would like to turn off instances for night and turn them back on in the morning.

@pparth
Copy link

pparth commented Jan 21, 2020

+1

@tobymiller
Copy link

When this is implemented, will it be possible to do a rolling update to the launch template under autoscaling and a change to a service in ecs, such that the new tasks run on instances from the new launch template while the old ones stay on the old instances as they roll over?

I'm struggling to achieve this with custom resources at the moment, partly as the dependencies are all in funny directions. Would be great to have it all defined declaratively in cfn.

@sopel
Copy link

sopel commented Feb 5, 2020

Cross-linking the resp. request in aws-cloudformation/cloudformation-coverage-roadmap#301

@RomanCRS
Copy link

Any ETA on this?

@pauldraper
Copy link

Does this depend on #632?

@RomanCRS
Copy link

Does this depend on #632?

I think no.

@andreaswittig
Copy link

Sadly, that's the reason why using CloudFormation is becoming more and more frustrating.

@gabegorelick
Copy link

FWIW, Terraform has supported this since shortly after the API was released: hashicorp/terraform-provider-aws#11151

Of course, it can't delete capacity providers since there's no API:
https://www.terraform.io/docs/providers/aws/r/ecs_capacity_provider.html

@RomanCRS
Copy link

RomanCRS commented Apr 6, 2020

I don't want to use, rely on and support third-party software if I have a chance to use the official product.

@Vince-Cercury
Copy link

any update?

@XBeg9
Copy link

XBeg9 commented Apr 27, 2020

same here, any updates?

@ronan-cunningham
Copy link

any update?

@darrenweiner
Copy link

the lack of Cfn support for this 6 months in is really disappointing. This puts the burden on anyone building CI/CD using Cfn to add additional and silly custom cli/sdk pieces to actually tie in capacity providers, which then have to be ripped out once the support that should be part of a point release is in place.
You can do better. Communicating timeframes would help as well.

@andreaswittig
Copy link

Have you had a deeper look into Capacity Providers and Cluster Auto Scaling? Does not match with my requirements at all. Does not scale down properly. Does not work with CloudFormation rolling updates for the ASG. So missing CloudFormation support is not the only problem here. :)

@coultn
Copy link
Author

coultn commented May 6, 2020

Have you had a deeper look into Capacity Providers and Cluster Auto Scaling? Does not match with my requirements at all. Does not scale down properly. Does not work with CloudFormation rolling updates for the ASG. So missing CloudFormation support is not the only problem here. :)

Thanks for the feedback - can you explain more what you mean by "does not scale down properly"?

@darrenweiner
Copy link

coultn: Here's what I think is a common use case: A CI/CD pipeline where services are spun up on an ASG backed EC2 cluster.
Services do not pre-exists, the CI/CD creates them.
Currently, you can not use cfn to create a capacity provider enabled service.
If the underlying cluster doesn't have the memory or cpu, I would expect that when a new service is deployed, it would add another ec2 and deploy the new service..but there's no way to do that currently. I suppose what might work right now is: Deploy the service with no capacity provider, perhaps with a quantity of 0 so it stabilizes, then via the cli, update the service to use a capacity provider, then another cli call to increase the quantity to 1....but that seems like hoop jumps.
With regards to down scaling, in reading the documentation, it seems a bit unclear on exactly how this is meant to work: If the goal is to optimize resources, I would actually want the cp to be intelligent enough to a) determine that the cluster is currently overprovisioned and b) if so, drain EC2 accordingly and have the ASG terminate the drained instance...all with standard, appropriate cooldown periods, etc.

@coultn
Copy link
Author

coultn commented May 6, 2020

Currently, you can not use cfn to create a capacity provider enabled service.

Thanks for the feedback! We are working on full support for capacity providers in CloudFormation, and we definitely understand the need for that. However, I do want to point out that you can actually create a capacity-provider enabled service in CloudFormation today. You can accomplish this by first configuring a default capacity provider strategy for the cluster. This default capacity provider strategy will be used by any service you create that does not specify a launch type. Next, when you create your service in CloudFormation, do not include the LaunchType parameter. The service will use the capacity provider strategy defined by the cluster, and will auto-scale from zero instances if necessary.

With regards to down scaling, in reading the documentation, it seems a bit unclear on exactly how this is meant to work: If the goal is to optimize resources, I would actually want the cp to be intelligent enough to a) determine that the cluster is currently overprovisioned and b) if so, drain EC2 accordingly and have the ASG terminate the drained instance...all with standard, appropriate cooldown periods, etc.

Understood. In the first version of ECS cluster auto scaling, we took a more conservative route where instances would not scale in unless no tasks are running on them. We are looking at the idea of automating an "instance drainer" that will automatically find underutilized instances and set them to draining. With ECS cluster auto scaling, those instances would automatically shut down once no tasks are running on them. It's possible to do this already today, but you would need to implement your own Lambda function (or similar) to do the evaluation of the instance and call the ECS API to set the instance to the DRAINING state.

@darrenweiner
Copy link

Really awesome feedback, thank you. As far as the workaround for setting it at Cluster creation, I'll take a look at that..easy enough to implement for QA/Dev..a little trickier for existing prod environments.

Trying to avoid custom tooling since...this seems sooo close to being a solid solution.

Any timing on better cfn support? I know that's a different, probably very overwhelmed team, but would be nice to see some improvements here. ECS rocks, and once this is dialed in, it's going to really round out the offering.

Will keep checking for ECS updates!

@verbitan
Copy link

verbitan commented Apr 9, 2021

@taylorb-syd I tried out the AWS::ECS::ClusterCapacityProviderAssociations this morning using the CloudFormation below, but I'm still getting an error on deletion.

Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining.
Resources:
  Cluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Ref AWS::StackName

  ClusterCapacityProvider:
    Type: AWS::ECS::CapacityProvider
    DependsOn: Cluster
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup
        ManagedScaling:
          Status: ENABLED
          TargetCapacity: 100
        ManagedTerminationProtection: DISABLED

  ClusterCapacityProviderAssociation:
    Type: AWS::ECS::ClusterCapacityProviderAssociations
    Properties:
      Cluster: !Ref Cluster
      CapacityProviders:
        - !Ref ClusterCapacityProvider
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !Ref ClusterCapacityProvider
          Weight: 1

@rhlarora84
Copy link

In my case, the stack after update remains in UPDATE_COMPLETE_CLEANUP_IN_PROGRESS
The stack is trying to delete the old ASG group, and the desired capacity/min/max is set to 0 but the Instances are still active (may be because of In-scale protection but that is a required for Managed Scaling to work)

image
image

@ptwohig
Copy link

ptwohig commented Apr 13, 2021

@ipmb To be fair. I missed it the first time. It's a little screwy to call the parameter "AutoScalingGroupArn" but accept either the short name or the ARN.

@chase1124
Copy link

For the future thousands that will read the property name and whose brain will process away "or short name", please add an additional property that doesn't have "Arn" in the name and properly describe this in the CloudFormation docs. This is such a ridiculous violation of common sense design

@a-zich
Copy link

a-zich commented May 14, 2021

@verbitan
The problem about active or draining instances (aka bug) kept me busy for a week. But I finally found a way to convince CloudFormation to to as it should. Your script was almost right. One more thing is necessary: the DependsOn to the cluster at the AutoScalingGroup.
This will make CloudFormation delete the ASG completely before it tries to delete the Cluster. Thus no instances exist and the cluster will delete just fine.

  ecsCluster:
    Type: AWS::ECS::Cluster

  ecsClusterCpAssoc:
    Type: "AWS::ECS::ClusterCapacityProviderAssociations"
    Properties:
      Cluster: !Ref ecsCluster
      CapacityProviders:
        - !Ref ecsCp
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !Ref ecsCp
          Weight: 100
      
  ecsCp:
    Type: AWS::ECS::CapacityProvider
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref ecsClusterASG
        ManagedScaling:
          MaximumScalingStepSize: 1
          MinimumScalingStepSize: 1
          Status: ENABLED
          TargetCapacity: 90
  
  ecsClusterASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    DependsOn: ecsCluster
    Properties: 
      MinSize: 1
      MaxSize: 4
      DesiredCapacity: 2
      HealthCheckType: EC2
      LaunchTemplate:
        LaunchTemplateId: !Ref ecsLaunchTemplate
        Version: !GetAtt ecsLaunchTemplate.LatestVersionNumber
      VPCZoneIdentifier: ....

@verbitan
Copy link

verbitan commented May 14, 2021

@a-zich That's worked nicely for the delete, thank you!

However, when I tried to change something in the AWS::AutoScaling::AutoScalingGroup, for example the LaunchConfigurationName, it throws this error against the AWS::ECS::ClusterCapacityProviderAssociations resource.

Resource handler returned message: "Out of retries. Last encountered error was: The specified capacity provider is in use and cannot be removed.

So it's getting closer, but still isn't ready for me to use in production.

@redflag-cloud
Copy link

When updating the AWS::AutoScaling::AutoScalingGroup, I get a error during the update of the resource AWS::ECS::ClusterCapacityProviderAssociations

Error: The specified capacity provider is in use and cannot be removed. (Service: AmazonECS; Status Code: 400; Error Code: ResourceInUseException)

Is there any workaround how I can update the ASG?

@taylorb-syd
Copy link

When updating the AWS::AutoScaling::AutoScalingGroup, I get a error during the update of the resource AWS::ECS::ClusterCapacityProviderAssociations

Error: The specified capacity provider is in use and cannot be removed. (Service: AmazonECS; Status Code: 400; Error Code: ResourceInUseException)

Is there any workaround how I can update the ASG?

What kind of update are you doing? Are you doing a AutoScalingReplacingUpdate update in this circumstance?

This is gonna be difficult because you need to "drain" the instances in the old ASG before the PutClusterCapacityProviders API will let you remove the old ASG, and this won't happen until the cleanup phase of CloudFormation stack.

So frankly, without some out of band management I can't see a way to do this with replacement updates. Rolling updates should work however, for 2 reasons:

  1. You're not replacing the Capacity Provider with a new one (this is what is actually calling PutClusterCapacityProvider).
  2. CloudFormation manages the removal/draining of instances, meaning by the time the stack is complete all tasks will be migrated to the new Launch Template.

Caveats:

  1. Schedule your updates for the lowest traffic period you can.
  2. Suspend All Processes except Launch and Terminate. This is critical in this case because the Capacity Provider will quickly determine it needs to add more capacity everytime you terminate/drain an instance and CloudFormation rolling updates assume there is no activity happening when you run it.
  3. If using a draining lifecycle hook, make sure you are not running any long running batch processes using RunTask as the rolling update does not play well with lifecycle hooks that delay instance termination by too much. The maximum stopTimeout of 2 minutes should be fine however.
  4. You make sure the maximum batch size is: floor(f(service) = minimum number of running tasks - minimum viable running tasks). Some of will note that this function could equal zero. If it does, you need to increase the minimum running tasks on your services until it doesn't otherwise you cannot specify a max batch size that will not result in an outage for one of your services.^ (This assumes you are not binpacking your tasks, and running with distinctInstances. If you're binpacking make sure you have a spread directive, like AZ, before binpacking and set the Max Batch size to 1 and hope for the best)

^ As an example, say you have a web service that requires 2 active tasks (i.e. minimum viable running) to serve the traffic you'll get during your outage window, and the minimum in the service auto-scaling policy is 4, and another service where the minimum active tasks to service is 1 and the minimum in the auto-scaling policy is 6. In order to ensure there is not an interruption of service you will need to set the Max Batch size to 2, since that is the maximum number of tasks (and thus instances) you can terminate is 2 without risking putting a service below minimum viable running.

@Komorebi-E
Copy link

Komorebi-E commented Oct 27, 2021

For an ECS Cluster that already has a manually created Capacity Provider, it is not possible to use the AWS::ECS::ClusterCapacityProviderAssociations with a CFN defined Capacity Provider, because CloudFormation gives the following error:

"Resource handler returned message: "The cluster already contains capacity provider associations" (RequestToken: *, HandlerErrorCode: AlreadyExists)"

However, if the Association resource is commented out, it is possible to create a Capacity Provider with CFN that shows as a managed resource for the stack (as expected). But not in the ECS Console for Capacity Providers, or when editing a service to add/change one.

Checking the existing Capacity Providers with aws ecs describe-capacity-providers, the managed CP is listed, and shows as ACTIVE, ENABLED and, linked to the ASG.

Without the CFN Association resource though, the Capacity Provider has no CloudWatch metric or alarms for CapacityProviderReservation and no Dynamic Scaling policy attached to the ASG.

It is even possible to attach an ECS service to the CFN managed CP:

aws ecs update-service --cluster dev-scaling-test --service scaling-test-svc --force-new-deployment --capacity-provider-strategy capacityProvider=dev-scaling-test-CapacityProviderV2-**,weight=100,base=0

This results in a new ECS Deployment for the service, but no tasks are created (nothing PROVISIONING/PENDING).

As it is not possible to unlink a Capacity Provider from an ECS Service, the only option with non-dynamic service names, is to delete the services and redeploy them without a Capacity Provider. See this #838 for not being able to unlink Services/CPs.

This has been raised as a support case to AWS.

@verbitan
Copy link

verbitan commented Nov 9, 2021

@a-zich That's worked nicely for the delete, thank you!

However, when I tried to change something in the AWS::AutoScaling::AutoScalingGroup, for example the LaunchConfigurationName, it throws this error against the AWS::ECS::ClusterCapacityProviderAssociations resource.

Resource handler returned message: "Out of retries. Last encountered error was: The specified capacity provider is in use and cannot be removed.

So it's getting closer, but still isn't ready for me to use in production.

@taylorb-syd's answer above was the final piece for me to get everything working!

For reference, below is a snippit of my final working template for capacity providers with managed termination protection enabled.

Resources:
  Cluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Ref AWS::StackName

  ClusterCapacityProvider:
    Type: AWS::ECS::CapacityProvider
    DependsOn: Cluster
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup
        ManagedScaling:
          Status: ENABLED
          TargetCapacity: 100
        ManagedTerminationProtection: ENABLED

  ClusterCapacityProviderAssociation:
    Type: AWS::ECS::ClusterCapacityProviderAssociations
    Properties:
      Cluster: !Ref Cluster
      CapacityProviders:
        - !Ref ClusterCapacityProvider
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !Ref ClusterCapacityProvider
          Weight: 1
          
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    DependsOn: Cluster
    Properties:
      LaunchConfigurationName: !Ref LaunchConfiguration
      MinSize: !Ref MinASGSize
      MaxSize: !Reg MaxASGSize
      NewInstancesProtectedFromScaleIn: true
      ... etc ...
    CreationPolicy:
      ResourceSignal:
        Count: !Ref MinASGSize
        Timeout: PT5M
    UpdatePolicy:
      AutoScalingRollingUpdate:
        MinSuccessfulInstancesPercent: 100
        WaitOnResourceSignals: true
        PauseTime: PT5M
        SuspendProcesses:
          # Suspend everything except Launch and Terminate.
          - AddToLoadBalancer
          - AlarmNotification
          - AZRebalance
          - HealthCheck
          - ReplaceUnhealthy
          - ScheduledActions
          
  LaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: !Ref ECSOptimisedAMI
      InstanceType: !Ref EC2InstanceType
      ... etc ...

This has been an issue for us for years, so I'm very happy to finally get this one sorted. Special thanks to @a-zich, @anoopkapoor, @coultn and @taylorb-syd for your help.

@zsimjee
Copy link

zsimjee commented Nov 16, 2021

Hi, just wanted to point out that this is also failing in CDK. I have an example that's super close to what's in the documentation, but I get the same "Out of retries...." error

@taylorb-syd
Copy link

Hi, just wanted to point out that this is also failing in CDK. I have an example that's super close to what's in the documentation, but I get the same "Out of retries...." error

HI @zsimjee, if you could either raise a Support Case with AWS or provide more details here we can investigate and determine if there is an issue with the CDK implementation. Please also provide your CDK version and other details. Thanks!

@Ten0
Copy link

Ten0 commented Mar 1, 2022

Hi,
I'm trying to create an extra CapacityProvider and link it to my cluster in addition to an already linked CapacityProvider:

  • ClusterStack (Cluster + CapacityProvider + ClusterCapacityProviderAssociations)
  • Service Stack (CapacityProvider + ClusterCapacityProviderAssociations + Service)
    That because that specific service cannot use the default capacity provider and needs its own (it needs a larger instance than we use as default).

However, this results in the following error when setting up the service stack's ClusterCapacityProviderAssociations:

[Cluster name] already exists in stack arn:aws:cloudformation:eu-west-1:603365952889:stack/[Cluster stack name]/0cdb9130-98ed-11ec-90fa-06d60c44e8d1

Can there only be a single ClusterCapacityProviderAssociations?
If so, how do we link multiple capacity providers in several times?

@AbhishekNautiyal AbhishekNautiyal self-assigned this Mar 3, 2022
@sanjeev-mb
Copy link

when this will be closed .. waiting for 2+ years.

@taylorb-syd
Copy link

Hi, I'm trying to create an extra CapacityProvider and link it to my cluster in addition to an already linked CapacityProvider:

* ClusterStack (Cluster + CapacityProvider + ClusterCapacityProviderAssociations)

* Service Stack (CapacityProvider + ClusterCapacityProviderAssociations + Service)
  That because that specific service cannot use the default capacity provider and needs its own (it needs a larger instance than we use as default).

However, this results in the following error when setting up the service stack's ClusterCapacityProviderAssociations:

[Cluster name] already exists in stack arn:aws:cloudformation:eu-west-1:603365952889:stack/[Cluster stack name]/0cdb9130-98ed-11ec-90fa-06d60c44e8d1

Can there only be a single ClusterCapacityProviderAssociations? If so, how do we link multiple capacity providers in several times?

The resource as designed currently is an interface for the PutClusterCapacityProviders API call, which is idempotent. To fix this, we would need stateful operations at the API level, i.e.: AddClusterCapacityProvider and UpdateClusterCapacityProvider RemoveClusterCapacityProvider API calls where you can update a single entry in the Cluster Capacity Providers associations. Our architecture for CloudFormation does not allow us to deal with this stateful information transparently because we have an underlying assumption of parallelism and idempotency. This assumption will require considerable engineering to work through, and it is better for the ECS team to instead bake this capacity into their API rather than trying to shoehorn this capability into CloudFormation.

So yes, there is a 1:1 relationship between an AWS::ECS::Cluster and a AWS::ECS::ClusterCapacityProviderAssociations. To overcome this limitation, you can use a Custom Resource, which reads the configuration, and adds/removes entries as appropriate before submitting a new configuration but you will need to make sure you have some kind of stateful mechanism, like a lock entry in a database somewhere, to ensure that you are performing these operations atomically.

@Nevon
Copy link

Nevon commented Jul 28, 2022

However, I do want to point out that you can actually create a capacity-provider enabled service in CloudFormation today. You can accomplish this by first configuring a default capacity provider strategy for the cluster. This default capacity provider strategy will be used by any service you create that does not specify a launch type.

This works when creating a new service, but how do you do this for existing services that have a launch type set?

Removing the launch type from the cloudformation template makes the service continue to use the existing launch type and not use the capacity provider strategy, and the ECS Service still doesn't have a way to set a capacity provider strategy via CloudFormation, 2 years later.

Updating each service manually using aws ecs update-service ... --capacity-provider-strategy $(aws ecs describe-cluster ... | jq '.clusters[0].defaultCapacityProviderStrategy[0].capacityProvider') sounds like it should do the trick, except you have to force a new deployment, using the --force-new-deployment flag. Fair enough, except if you're using CodeDeploy then it just says it can't force a new deployment when using the CodeDeploy deployment controller:

Cannot force a new deployment on services with a CODE_DEPLOY deployment controller. Use AWS CodeDeploy to trigger a new deployment.

@BwL1289
Copy link

BwL1289 commented Feb 8, 2024

Is there progress being made on this? Ticket has been open 4+ years.

FWIW I'm eternally grateful to the hardworking engineers who make all this possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Work in Progress
Projects
None yet
Development

No branches or pull requests