aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

mildred · 2018-01-25T09:47:24Z

Short story: we know that AWS is throttling our API requests. Sometimes we timeout on creating a security group. The problem is however that subsequent terraform runs are failing because the security group was created but is not completely present in tfstate. Security group rules are not recorded in tfstate.

Terraform Version

terraform 0.11.2
aws provider ersion 1.7.1

Affected Resource(s)

aws_security_group

There might be a problem on how terraform handles resources that fails. perhaps on failure this resource should be tainted so subsequent runs succeeds.

Terraform Configuration Files

resource "aws_security_group" "base_sg" {
  name        = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
  description = "Basic Security Group for ${var.sqsc_project_name} ${var.environment}"
  vpc_id      = "${data.aws_vpc.main.id}"

  tags {
    Name        = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
    Environment = "${var.environment}"
    Project     = "${var.sqsc_project_name}"
    ProjectUuid = "${var.sqsc_project_uuid}"
  }
}

resource "aws_security_group_rule" "base_sg_ingress_ssh" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "base_sg_ingress_http" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
}

// ...

resource "aws_security_group_rule" "base_sg_egress" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = ["0.0.0.0/0"]
}

Debug Output

This is a transident error with terraform running in an automated environment. We do not have debug output for this run at the moment.

However, we run terraform multiples times, and the first time we run it, we have the following error

1 error(s) occurred:

* aws_security_group.base_sg: 1 error(s) occurred:

* aws_security_group.base_sg: timeout while waiting for state to become 'success' (timeout: 5m0s)

Then all subsequent terraform apply executions fails with:

1 error(s) occurred:

* aws_security_group_rule.base_sg_egress: 1 error(s) occurred:

* aws_security_group_rule.base_sg_egress: [WARN] A duplicate Security Group rule was found on (sg-048b7c7e). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists

Full logs here: https://gist.github.com/mildred/9245356ec1ef599f91eb15f2bd9a6666

Expected Behavior

Terraform should taint the security group if it fails on it due to a timeout so next run will create it anew. Or perhaps just taint the security_group_rules within it. Or it should register the security group rules properly in the tfstate.

Actual Behavior

Terraform timeouts then fails to create the resource because a rule it thought was not present is created.

Steps to Reproduce

Run terraform enough to be throttled by AWS

Important Factoids

Our API requests are being throttled by AWS (we asked the support for that and they confirmed it)
We increased max_retried setting for the aws provider to 40

The text was updated successfully, but these errors were encountered:

mildred · 2018-01-29T11:07:28Z

Perhaps an easy way to mitigate this problem could be to allow configurable timeouts on aws_security_group resources. As it is now, it is not configurable:

[ERR] Error decoding timeout: Timeout Key (create) is not supported

domdom82 · 2018-02-20T09:17:00Z

I am having a similar issue. Our TF deployments fail intermittently on aws_security_group timeouts. However, it is always a different security group so there is no clear pattern. IMHO just adding a timeout is only half the story if rate throttling indeed is the root cause. Then, TF should gradually decrease the rate at which it fires API calls while waiting to the timeout at the same time

domdom82 · 2018-02-22T10:07:11Z

Just found there is already a max_retries option on the provider here.

However, it still keeps running into the 5min timeout even-though I bumped max_retries. I think the global timeout does not respect the max_retries setting.

domdom82 · 2018-02-27T09:18:22Z

Did some more investigation: I am getting these error messages from time to time:

* aws_security_group.slave: timeout while waiting for state to become 'success' (timeout: 5m0s)

However, when I look at the code of aws_security_group.resourceAwsSecurityGroupCreate in https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_security_group.go#L277 you can see there is a hard-coded time out of 10 minutes for SG creation:

	// Wait for the security group to truly exist
	log.Printf(
		"[DEBUG] Waiting for Security Group (%s) to exist",
		d.Id())
	stateConf := &resource.StateChangeConf{
		Pending: []string{""},
		Target:  []string{"exists"},
		Refresh: SGStateRefreshFunc(conn, d.Id()),
		Timeout: 10 * time.Minute,
	}

	resp, err := stateConf.WaitForState()
	if err != nil {
		return fmt.Errorf(
			"Error waiting for Security Group (%s) to become available: %s",
			d.Id(), err)
	}

So I am wondering: Could this longer timeout be overshadowed by a generic 5 minute timeout so that it does not even apply?

domdom82 · 2018-03-01T15:10:11Z

I dug even deeper and found out that the real timeout occurs not in the security_group but in the tag on the security group:

	resp, err := stateConf.WaitForState()  // this works just fine and adheres the 10 minute timeout
	if err != nil {
		return fmt.Errorf(
			"Error waiting for Security Group (%s) to become available: %s",
			d.Id(), err)
	}

	if err := setTags(conn, d); err != nil {      // this fails
		return err
	}

setTags in turn calls tags.go:

	// Set tags
	if len(remove) > 0 {
		err := resource.Retry(5*time.Minute, func() *resource.RetryError {
			log.Printf("[DEBUG] Removing tags: %#v from %s", remove, d.Id())
(...)
	if len(create) > 0 {
			err := resource.Retry(5*time.Minute, func() *resource.RetryError {
				log.Printf("[DEBUG] Creating tags: %s for %s", create, d.Id())

which has a hard-coded timeout of 5 minutes. Since tags can be on any resource, this timeout acts in fact as a hidden timeout on all AWS resources that have tags. Not cool.

domdom82 · 2018-03-01T15:48:19Z

@mildred this becomes even worse as timeouts can currently only be defined on resources but tags are not modelled as resources. So a potential fix would have to "inherit" the timeout from the resource that wants to be tagged or otherwise use a default value. hmm.

domdom82 · 2018-03-02T10:28:36Z

Opened two PRs that should fix both problems mentioned in this issue.

2rs2ts · 2018-03-13T22:49:46Z

We actually see that this happens with creating EIP tags as well. I think #3586 will help facilitate fixing that but we'll need something like #3599 but for EIPs to see it through. Should I make a new issue for that?

fullybonkers · 2018-03-14T21:22:34Z

Does anyone have a hack/workaround to handle this problem (and destroy the resource somehow) while a fix is being developed?

mildred · 2018-03-20T21:55:17Z

I didn't follow up on this issue, sorry. Our solution was to automatically retry failed terraform execution once. It worked for most of the occurrences but we still have a small number of cases where terraform is failing and the second run it continues to fail because the first failure made a half-resource.

This specific case can be identified by the first terraform rule showing an error like:

aws_security_group_rule.base_sg_ingress_ssh: Error finding matching ingress Security Group Rule (sgrule-1225671002) for Group sg-1d6d1567

And the next terraform is failing with:

aws_security_group_rule.base_sg_ingress_ssh: [WARN] A duplicate Security Group rule was found on (sg-1d6d1567). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See Consider security groups with source security groups when hashing terraform#2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 22, to port: 22, ALLOW" already exists

We could dig this up to a hardcoded 5m timeout too on the DescribeSecurityGroups API call. The code here is obvious:

https://github.com/terraform-providers/terraform-provider-aws/blob/faeda15f5259d4925824774f1419f69b0f000d5c/aws/resource_aws_security_group.go#L324-L350

The solution is probably not to have configurable timeouts, but for terraform to honor throttling algorithms for AWS requests. In probably multiple places too...

We contacted AWS support and they told us:

Please be aware as I was indicated recently you have plenty of throttled calls during the time frame you mentioned, not only in relation to pointed API call "DescribeSecurityGroups". To avoid similar issues in the future, please follow this workaround related to situations where high amount of API calls affects efficiency. To deal with it is crucial to implement exponential back off and API retry logic, as well as jitter.

This is a method to stagger your API calls by having each server have a random sleep interval (this is called jitter), and make sure that if it gets throttled or an error occurs, it will wait for a small random interval before trying again (this is implemented as exponential back off). A more detailed description of how to implement this can be found in the documentation found at [1]. You can find more details on implementing these under those links [2] and [3].

[1] https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
[2] http://docs.aws.amazon.com/general/latest/gr/api-retries.html
[3] https://www.awsarchitectureblog.com/2015/03/backoff.html

mildred · 2018-03-20T21:56:30Z

We now have implemented saving debug output on our failed terraform calls. I'll soon attach full debug output here.

mildred · 2018-03-26T07:50:41Z

Did not attach full debug output because it contained credentials. However, there is some debug output I posted in #3586 (comment) that supports the need of the pull request #3911:

I have the following terraform error:

* aws_security_group_rule.base_sg_ingress_services: Error finding matching ingress Security Group Rule (sgrule-4087802998) for Group sg-8eb99cf4

The debug logs tell me terraform performs the following request:

2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [INFO] Security Group ID: sg-f284a188
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for Security Group (sg-f284a188) to exist
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for state to become: [exists]
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ REQUEST POST-SIGN ]-----------------------------
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: POST / HTTP/1.1
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Host: ec2.eu-west-1.amazonaws.com
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: User-Agent: aws-sdk-go/1.12.62 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.2
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Length: 70
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Type: application/x-www-form-urlencoded; charset=utf-8
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: X-Amz-Date: 20180322T050116Z
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Accept-Encoding: gzip
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Action=DescribeSecurityGroups&GroupId.1=sg-8eb99cf4&Version=2016-11-15
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------

And the next response in the logs for ec2/DescribeSecurityGroups I get is:

2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ RESPONSE ]--------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: HTTP/1.1 503 Service Unavailable
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Connection: close
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Transfer-Encoding: chunked
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Date: Thu, 22 Mar 2018 05:01:16 GMT
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Server: AmazonEC2
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>936c5bbe-1957-4911-a928-8d08d3b7d1d3</RequestID></Response>

I would say this confirms the rate limiting is causing the error

When being throttled on AWS requests, read requests are the first ones to be throttled, and reading security group rules can take longer than 5m to complete. Transform the hard timeout of 5m with a configurable timeout to avoid this problem. Fixes part of hashicorp#3128

domdom82 · 2018-04-04T07:40:38Z

@mildred terraform does implement an exponential backoff algorithm, for example in waitForState here: https://github.com/hashicorp/terraform/blob/master/helper/resource/state.go#L51

However, I think the algorithm is not used in every API call throughout.

heldersepu · 2018-07-02T20:58:10Z

@terraform-providers/terraform-provider-aws
Any update on this issue?
Can someone please confirm that the exponential backoff algorithm is used on every call?

mildred · 2018-07-03T07:39:50Z

@domdom82 Here the problem is not that the exponential backoff algorithm is not in use, it's that the terraform resource doesn't have a configurable timeout, and that in my case the default timeout is too low.

heldersepu · 2018-07-03T14:09:31Z

@mildred in your case have you seen Throttling: Rate exceeded ?
I do get that every once in a while with different resources

When being throttled on AWS requests, read requests are the first ones to be throttled, and reading security group rules can take longer than 5m to complete. Transform the hard timeout of 5m with a configurable timeout to avoid this problem. Fixes part of hashicorp#3128

mildred · 2018-08-13T09:26:12Z

No, I'm getting those AWS errors (in the debug output):

<Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>b699395c-cdd1-40ce-a17c-3045da933370</RequestID></Response>

When being throttled on AWS requests, read requests are the first ones to be throttled, and reading security group rules can take longer than 5m to complete. Transform the hard timeout of 5m with a configurable timeout to avoid this problem. Fixes part of hashicorp#3128

github-actions · 2022-03-16T17:43:09Z

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

When being throttled on AWS requests, read requests are the first ones to be throttled, and reading security group rules can take longer than 5m to complete. Transform the hard timeout of 5m with a configurable timeout to avoid this problem. Fixes part of hashicorp#3128

github-actions · 2022-05-16T02:23:15Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jen20 added bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. labels Jan 25, 2018

This was referenced Mar 1, 2018

tags should retry without time bounds on EC2 throttling #3586

Merged

add configurable timeouts to security groups #3599

Merged

ctso mentioned this issue Mar 13, 2018

Add support for timeouts on aws_eip resource #3769

Merged

mildred mentioned this issue Mar 21, 2018

ECR creation sometimes fails with Resource '...' does not have attribute 'repository_url' for variable '...' #3849

Closed

mildred mentioned this issue Mar 26, 2018

Allow configurable timeout when reading security group rule #3911

Closed

github-actions bot added the stale Old or inactive issues managed by automation, if no further action taken these will get closed. label Mar 16, 2022

github-actions bot closed this as completed Apr 15, 2022

github-actions bot locked as resolved and limited conversation to collaborators May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

mildred commented Jan 25, 2018

mildred commented Jan 29, 2018

domdom82 commented Feb 20, 2018

domdom82 commented Feb 22, 2018

domdom82 commented Feb 27, 2018

domdom82 commented Mar 1, 2018

domdom82 commented Mar 1, 2018

domdom82 commented Mar 2, 2018

2rs2ts commented Mar 13, 2018

fullybonkers commented Mar 14, 2018

mildred commented Mar 20, 2018

mildred commented Mar 20, 2018

mildred commented Mar 26, 2018

domdom82 commented Apr 4, 2018

heldersepu commented Jul 2, 2018

mildred commented Jul 3, 2018

heldersepu commented Jul 3, 2018 •

edited

Loading

mildred commented Aug 13, 2018

github-actions bot commented Mar 16, 2022

github-actions bot commented May 16, 2022

aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

Comments

mildred commented Jan 25, 2018

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

mildred commented Jan 29, 2018

domdom82 commented Feb 20, 2018

domdom82 commented Feb 22, 2018

domdom82 commented Feb 27, 2018

domdom82 commented Mar 1, 2018

domdom82 commented Mar 1, 2018

domdom82 commented Mar 2, 2018

2rs2ts commented Mar 13, 2018

fullybonkers commented Mar 14, 2018

mildred commented Mar 20, 2018

mildred commented Mar 20, 2018

mildred commented Mar 26, 2018

domdom82 commented Apr 4, 2018

heldersepu commented Jul 2, 2018

mildred commented Jul 3, 2018

heldersepu commented Jul 3, 2018 • edited Loading

mildred commented Aug 13, 2018

github-actions bot commented Mar 16, 2022

github-actions bot commented May 16, 2022

heldersepu commented Jul 3, 2018 •

edited

Loading