Skip to content

Commit

Permalink
feat!(ecs/services): deployment failure handling (#154)
Browse files Browse the repository at this point in the history
* feat(ecs/services): added wait_for_steady_state variable, enabled by default

* feat(ecs/services): enable deployment circuit breaker

* feat(ecs/services): configurable deployment timeout

* docs(ecs/services): deployment failure handling

* fix(ecs/services): require at least aws 4.22.0

* ci: update providers
  • Loading branch information
mskrajnowski authored Oct 26, 2023
1 parent 74c7195 commit 6da014c
Show file tree
Hide file tree
Showing 13 changed files with 172 additions and 12 deletions.
27 changes: 25 additions & 2 deletions ecs/services/statsd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,40 @@

Adds a statsd server, using cloudwatch agent, to each ECS instance

## Deployment failures

To mark the ECS service update as a failure if something goes wrong the defaults are:

- `wait_for_steady_state = true` to wait for the service to be deployed
- `deployment_timeout = "10m"` wait 10 minutes, if it takes longer terraform will mark the update as a failure
- `deployment_rollback = true` enables roll back using the [ECS deployment circuit breaker](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-circuit-breaker.html)

`deployment_rollback` and `deployment_timeout` are independent of each other,
because we have no control over ECS deployment circuit breaker thresholds.
There's also no way at the moment to wait for deployment failure.

This means that you have to make sure `deployment_timeout` is lower than
the time it takes for ECS to mark the deployment as failed. Otherwise the
rollback might cause the service to enter steady state, which will then
be picked up by terraform and it will mark terraform apply as a success,
even though the deployment failed.

https://github.com/hashicorp/terraform-provider-aws/issues/20858

<!-- prettier-ignore-start -->
<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 0.12, <2.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 2.40.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 4.22.0 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 2.40.0 |
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 4.22.0 |

## Modules

Expand Down Expand Up @@ -44,9 +64,12 @@ No modules.
| <a name="input_collection_interval"></a> [collection\_interval](#input\_collection\_interval) | How often should the metrics be collected in seconds | `number` | `10` | no |
| <a name="input_create"></a> [create](#input\_create) | Whether any resources should be created | `bool` | `true` | no |
| <a name="input_debug"></a> [debug](#input\_debug) | Whether to enable cloudwatch agent debug mode | `bool` | `false` | no |
| <a name="input_deployment_rollback"></a> [deployment\_rollback](#input\_deployment\_rollback) | Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker | `bool` | `true` | no |
| <a name="input_deployment_timeout"></a> [deployment\_timeout](#input\_deployment\_timeout) | Timeout for updating the ECS service | `string` | `"10m"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name for the service and task definition | `string` | `"statsd"` | no |
| <a name="input_port"></a> [port](#input\_port) | Port to listen on on each ECS instance | `number` | `8125` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Tags to add to resources | `map(string)` | `{}` | no |
| <a name="input_wait_for_steady_state"></a> [wait\_for\_steady\_state](#input\_wait\_for\_steady\_state) | Wait for the service to reach a steady state | `bool` | `true` | no |

## Outputs

Expand Down
13 changes: 13 additions & 0 deletions ecs/services/statsd/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,19 @@ resource "aws_ecs_service" "service" {
task_definition = aws_ecs_task_definition.task[0].arn
launch_type = "EC2"
scheduling_strategy = "DAEMON"

wait_for_steady_state = var.wait_for_steady_state

deployment_circuit_breaker {
enable = var.deployment_rollback
rollback = var.deployment_rollback
}

timeouts {
create = var.deployment_timeout
update = var.deployment_timeout
delete = var.deployment_timeout
}
}

# outputs
Expand Down
18 changes: 18 additions & 0 deletions ecs/services/statsd/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,21 @@ variable "debug" {
type = bool
default = false
}

variable "wait_for_steady_state" {
description = "Wait for the service to reach a steady state"
type = bool
default = true
}

variable "deployment_timeout" {
description = "Timeout for updating the ECS service"
type = string
default = "10m"
}

variable "deployment_rollback" {
description = "Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker"
type = bool
default = true
}
2 changes: 1 addition & 1 deletion ecs/services/statsd/versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ terraform {
required_version = ">= 0.12, <2.0"

required_providers {
aws = ">= 2.40.0"
aws = ">= 4.22.0"
}
}
27 changes: 25 additions & 2 deletions ecs/services/web/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,40 @@

Creates an ECS service exposed to the internet using an Application Load Balancer.

## Deployment failures

To mark the ECS service update as a failure if something goes wrong the defaults are:

- `wait_for_steady_state = true` to wait for the service to be deployed
- `deployment_timeout = "10m"` wait 10 minutes, if it takes longer terraform will mark the update as a failure
- `deployment_rollback = true` enables roll back using the [ECS deployment circuit breaker](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-circuit-breaker.html)

`deployment_rollback` and `deployment_timeout` are independent of each other,
because we have no control over ECS deployment circuit breaker thresholds.
There's also no way at the moment to wait for deployment failure.

This means that you have to make sure `deployment_timeout` is lower than
the time it takes for ECS to mark the deployment as failed. Otherwise the
rollback might cause the service to enter steady state, which will then
be picked up by terraform and it will mark terraform apply as a success,
even though the deployment failed.

https://github.com/hashicorp/terraform-provider-aws/issues/20858

<!-- prettier-ignore-start -->
<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 0.12, <2.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 2.42.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 4.22.0 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 2.42.0 |
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 4.22.0 |

## Modules

Expand Down Expand Up @@ -62,6 +82,8 @@ Creates an ECS service exposed to the internet using an Application Load Balance
| <a name="input_create"></a> [create](#input\_create) | Should resources be created | `bool` | `true` | no |
| <a name="input_deployment_max_percent"></a> [deployment\_max\_percent](#input\_deployment\_max\_percent) | The upper limit (as a percentage of the service's desiredCount) of the number of running tasks that can be running in a service during a deployment. Rounded down to get the maximum number of running tasks. | `number` | `200` | no |
| <a name="input_deployment_min_percent"></a> [deployment\_min\_percent](#input\_deployment\_min\_percent) | The lower limit (as a percentage of the service's desiredCount) of the number of running tasks that must remain running and healthy in a service during a deployment. Rounded up to get the minimum number of running tasks. | `number` | `50` | no |
| <a name="input_deployment_rollback"></a> [deployment\_rollback](#input\_deployment\_rollback) | Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker | `bool` | `true` | no |
| <a name="input_deployment_timeout"></a> [deployment\_timeout](#input\_deployment\_timeout) | Timeout for updating the ECS service | `string` | `"10m"` | no |
| <a name="input_deregistration_delay"></a> [deregistration\_delay](#input\_deregistration\_delay) | Connection draining time in seconds. | `number` | `30` | no |
| <a name="input_desired_count"></a> [desired\_count](#input\_desired\_count) | The number of instances of the task definition to place and keep running. | `number` | `2` | no |
| <a name="input_healthcheck_interval"></a> [healthcheck\_interval](#input\_healthcheck\_interval) | How often, in seconds, healtchecks should be sent. | `number` | `5` | no |
Expand All @@ -81,6 +103,7 @@ Creates an ECS service exposed to the internet using an Application Load Balance
| <a name="input_task_definition_arn"></a> [task\_definition\_arn](#input\_task\_definition\_arn) | ECS task definition ARN to run as a service | `string` | n/a | yes |
| <a name="input_unhealthy_threshold"></a> [unhealthy\_threshold](#input\_unhealthy\_threshold) | The number of consecutive health check failures required before considering the target unhealthy | `number` | `2` | no |
| <a name="input_vpc_id"></a> [vpc\_id](#input\_vpc\_id) | VPC id | `string` | n/a | yes |
| <a name="input_wait_for_steady_state"></a> [wait\_for\_steady\_state](#input\_wait\_for\_steady\_state) | Wait for the service to reach a steady state | `bool` | `true` | no |

## Outputs

Expand Down
13 changes: 13 additions & 0 deletions ecs/services/web/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,25 @@ resource "aws_ecs_service" "service" {
deployment_minimum_healthy_percent = var.deployment_min_percent
iam_role = var.role_arn

wait_for_steady_state = var.wait_for_steady_state

load_balancer {
target_group_arn = aws_lb_target_group.service[0].arn
container_name = local.container
container_port = var.container_port
}

deployment_circuit_breaker {
enable = var.deployment_rollback
rollback = var.deployment_rollback
}

timeouts {
create = var.deployment_timeout
update = var.deployment_timeout
delete = var.deployment_timeout
}

lifecycle {
create_before_destroy = true
}
Expand Down
17 changes: 17 additions & 0 deletions ecs/services/web/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,20 @@ variable "unhealthy_threshold" {
default = 2
}

variable "wait_for_steady_state" {
description = "Wait for the service to reach a steady state"
type = bool
default = true
}

variable "deployment_timeout" {
description = "Timeout for updating the ECS service"
type = string
default = "10m"
}

variable "deployment_rollback" {
description = "Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker"
type = bool
default = true
}
2 changes: 1 addition & 1 deletion ecs/services/web/versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ terraform {
required_version = ">= 0.12, <2.0"

required_providers {
aws = ">= 2.42.0"
aws = ">= 4.22.0"
}
}
27 changes: 25 additions & 2 deletions ecs/services/worker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,40 @@

Creates an ECS service for background workers

## Deployment failures

To mark the ECS service update as a failure if something goes wrong the defaults are:

- `wait_for_steady_state = true` to wait for the service to be deployed
- `deployment_timeout = "10m"` wait 10 minutes, if it takes longer terraform will mark the update as a failure
- `deployment_rollback = true` enables roll back using the [ECS deployment circuit breaker](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-circuit-breaker.html)

`deployment_rollback` and `deployment_timeout` are independent of each other,
because we have no control over ECS deployment circuit breaker thresholds.
There's also no way at the moment to wait for deployment failure.

This means that you have to make sure `deployment_timeout` is lower than
the time it takes for ECS to mark the deployment as failed. Otherwise the
rollback might cause the service to enter steady state, which will then
be picked up by terraform and it will mark terraform apply as a success,
even though the deployment failed.

https://github.com/hashicorp/terraform-provider-aws/issues/20858

<!-- prettier-ignore-start -->
<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 0.12, <2.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 2.40.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 4.22.0 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 2.40.0 |
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 4.22.0 |

## Modules

Expand Down Expand Up @@ -46,10 +66,13 @@ Creates an ECS service for background workers
| <a name="input_create"></a> [create](#input\_create) | Should resources be created | `bool` | `true` | no |
| <a name="input_deployment_max_percent"></a> [deployment\_max\_percent](#input\_deployment\_max\_percent) | The upper limit (as a percentage of the service's desiredCount) of the number of running tasks that can be running in a service during a deployment. Rounded down to get the maximum number of running tasks. | `number` | `200` | no |
| <a name="input_deployment_min_percent"></a> [deployment\_min\_percent](#input\_deployment\_min\_percent) | The lower limit (as a percentage of the service's desiredCount) of the number of running tasks that must remain running and healthy in a service during a deployment. Rounded up to get the minimum number of running tasks. | `number` | `50` | no |
| <a name="input_deployment_rollback"></a> [deployment\_rollback](#input\_deployment\_rollback) | Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker | `bool` | `true` | no |
| <a name="input_deployment_timeout"></a> [deployment\_timeout](#input\_deployment\_timeout) | Timeout for updating the ECS service | `string` | `"10m"` | no |
| <a name="input_desired_count"></a> [desired\_count](#input\_desired\_count) | The number of instances of the task definition to place and keep running. | `number` | `2` | no |
| <a name="input_launch_type"></a> [launch\_type](#input\_launch\_type) | The launch type on which to run your service. Either EC2 or FARGATE. | `string` | `"EC2"` | no |
| <a name="input_name"></a> [name](#input\_name) | ECS service name | `string` | n/a | yes |
| <a name="input_task_definition_arn"></a> [task\_definition\_arn](#input\_task\_definition\_arn) | ECS task definition ARN to run as a service | `string` | n/a | yes |
| <a name="input_wait_for_steady_state"></a> [wait\_for\_steady\_state](#input\_wait\_for\_steady\_state) | Wait for the service to reach a steady state | `bool` | `true` | no |

## Outputs

Expand Down
13 changes: 13 additions & 0 deletions ecs/services/worker/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,19 @@ resource "aws_ecs_service" "service" {
deployment_maximum_percent = var.deployment_max_percent
deployment_minimum_healthy_percent = var.deployment_min_percent

wait_for_steady_state = var.wait_for_steady_state

deployment_circuit_breaker {
enable = var.deployment_rollback
rollback = var.deployment_rollback
}

timeouts {
create = var.deployment_timeout
update = var.deployment_timeout
delete = var.deployment_timeout
}

lifecycle {
create_before_destroy = true
}
Expand Down
17 changes: 17 additions & 0 deletions ecs/services/worker/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,20 @@ variable "deployment_max_percent" {
default = 200
}

variable "wait_for_steady_state" {
description = "Wait for the service to reach a steady state"
type = bool
default = true
}

variable "deployment_timeout" {
description = "Timeout for updating the ECS service"
type = string
default = "10m"
}

variable "deployment_rollback" {
description = "Whether ECS should roll back to the previous version when it detects a failure using deployment circuit breaker"
type = bool
default = true
}
2 changes: 1 addition & 1 deletion ecs/services/worker/versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ terraform {
required_version = ">= 0.12, <2.0"

required_providers {
aws = ">= 2.40.0"
aws = ">= 4.22.0"
}
}
6 changes: 3 additions & 3 deletions versions.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ terraform {
required_providers {
aws = "4.67.0"

archive = "2.1.0"
null = "3.1.0"
random = "3.1.0"
archive = "2.4.0"
null = "3.2.1"
random = "3.5.1"
tls = "4.0.4"
}
}

0 comments on commit 6da014c

Please sign in to comment.