tags should retry without time bounds on EC2 throttling #3586

domdom82 · 2018-03-01T17:18:30Z

this PR addresses the "security_group timeout due to tag timeout" part of issue #3128

since tags are not resources in the sense of terraform, they have no configurable timeouts per se.

in order to avoid hard-coded timeouts on tags, I have provided this PR which tries to use the Update timeout of the resource that is being tagged. If no timeout is defined for that resource, the regular default of ResourceData is used. This should at least provide some means of configuring timeouts on tags via the to-be-tagged resource.

domdom82 · 2018-03-08T08:28:20Z

hi @bflad any chance we can get this in for 1.11 ? timeouts on tags are hitting us pretty hard these days.

bflad · 2018-03-08T14:59:30Z

Can you provide debug logs that show that you're hitting EC2 rate limiting and not masking some other error?

domdom82 · 2018-03-08T15:41:34Z

@bflad sure can. I also described in #3128 that we are hitting a 5 minute timeout on a security_group create but its timeout is at 10 minutes:

* aws_security_group.slave: timeout while waiting for state to become 'success' (timeout: 5m0s)

So when digging deeper we found the hard-coded timeout of 5 minutes on tags.setTags:

// Set tags
if len(remove) > 0 {
   err := resource.Retry(5*time.Minute, func() *resource.RetryError {

So then we bumped the hard-coded timeout to 10 minutes - same as the security_group itself - for testing. And it worked just fine repeatedly. This got me to think we could make this a bit smarter than just bumping a hard timeout and instead make it dependent on the resource that wants to be tagged.

The main issue I see currently is that people run into timeouts on certain resources, then bump their timeouts to fix it but then wonder why their deployment still fails because there is another "hidden" timeout on the tag of their resource which they cannot change atm.

2rs2ts · 2018-03-13T22:53:42Z

It seems to me that this effectively doubles the timeout setting you're actually using for the resource. What do you think of taking the timeout from the schema, minus the time elapsed since initiating the create/update function, and use that for the tag timeout?

domdom82 · 2018-03-14T14:59:21Z

@2rs2ts good points. How would you pass the start time? In the ResourceData.meta map?

2rs2ts · 2018-03-14T18:10:40Z

@domdom82 I don't know, I'm not really familiar with the code, I just thought of the idea. Sorry I'm not of much help 😅

mildred · 2018-03-22T17:08:47Z

edit: I was probably mistaken to post this debug output or this PR. I have in fact a related problem but that do not appear to be exactly the same. See #3128 (comment)

@bflad I cannot share you the full debug logs I have (since it contains credentials) but I have the following terraform error:

* aws_security_group_rule.base_sg_ingress_services: Error finding matching ingress Security Group Rule (sgrule-4087802998) for Group sg-8eb99cf4

The debug logs tell me terraform performs the following request:

2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [INFO] Security Group ID: sg-f284a188
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for Security Group (sg-f284a188) to exist
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for state to become: [exists]
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ REQUEST POST-SIGN ]-----------------------------
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: POST / HTTP/1.1
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Host: ec2.eu-west-1.amazonaws.com
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: User-Agent: aws-sdk-go/1.12.62 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.2
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Length: 70
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Type: application/x-www-form-urlencoded; charset=utf-8
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: X-Amz-Date: 20180322T050116Z
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Accept-Encoding: gzip
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Action=DescribeSecurityGroups&GroupId.1=sg-8eb99cf4&Version=2016-11-15
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------

And the next response in the logs for ec2/DescribeSecurityGroups I get is:

2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ RESPONSE ]--------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: HTTP/1.1 503 Service Unavailable
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Connection: close
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Transfer-Encoding: chunked
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Date: Thu, 22 Mar 2018 05:01:16 GMT
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Server: AmazonEC2
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>936c5bbe-1957-4911-a928-8d08d3b7d1d3</RequestID></Response>

I would say this confirms the rate limiting is causing the error

domdom82 · 2018-04-04T07:29:27Z

@mildred same here. I think it is not the tag creation itself because tags are very small entities that don't take long to create, however if you are rate throttled while you are creating multiple resources at a time (in my case many security groups along with rules and tags) it can happen that you run into an early timeout (in my case the hard-coded 5 minutes on tags) - even though you might have set a longer timeout on the parent resource (e.g. 10 minutes on security groups).

2rs2ts · 2018-07-03T18:36:21Z

bump, what's the status of this PR?

domdom82 · 2018-07-04T08:07:12Z

@2rs2ts I'd love to see it merged. Tag timeouts are one of the most annoying things in our CI pipeline right now. It happens especially often on large sets of security groups getting deployed in one TF file.
Since @stjimmy88 approved I don't know what's blocking this merge tbh.

domdom82 · 2018-11-02T13:06:02Z

@bflad bump for merge

bflad · 2018-11-09T04:25:55Z

Hi @domdom82 👋 Sorry for the delayed response here.

In #6409, we introduce a helper function (isResourceTimeoutError(err)) that checks to see if the error returned by resource.Retry() is strictly just a timeout error based on time like when the SDK is stuck its own retry logic and never returns (e.g. throttling errors). We could leverage that new function in this scenario here by calling it after the current time-based retry loop to retry the call one last time. This effectively removes the time element (which is guesswork on the operators part) and switches it to SDK-based retries for throttling (we default to 20 which will backoff in excess of 30 minutes in many cases, but is configurable via max_retries per-provider).

What do you think?

domdom82 · 2018-11-09T13:54:55Z

@bflad I think this is a great idea. Ideally, I wouldn't have to configure timeouts on a per-resource basis but only have a provider-level setting. As you said it, it is guesswork by the operator to tweak those timeouts manually and there is never the right setting.

bflad · 2018-11-09T20:12:24Z

isResourceTimeoutError() is merged and available now - would you mind tweaking this PR and we can get this into the next release? If you don't have time, no big deal, I can add a commit after yours too. Thanks so much for your help and hopefully this gets less annoying to workaround. 😄

aws/tags.go

domdom82 · 2018-11-14T08:10:12Z

@bflad LGTM? I also renamed the PR to match the code change more accurately.

aws/tags.go

domdom82 · 2018-11-14T16:19:52Z

bumped the beast a final time 🤞

bflad

LGTM, thanks @domdom82! 🚀 (We could return early on !isResourceTimeoutError() to remove the additional nesting but that's more of a nitpick)

(Test failures unrelated)

Tests failed: 2, passed: 245

bflad · 2018-11-15T00:52:54Z

This has been released in version 1.44.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

ghost · 2020-04-02T17:25:05Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

ghost added the size/XS Managed by automation to categorize the size of a PR. label Mar 1, 2018

bflad added bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. labels Mar 1, 2018

bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Mar 8, 2018

2rs2ts mentioned this pull request Mar 13, 2018

aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource #3128

Closed

ctso mentioned this pull request Mar 13, 2018

Add support for timeouts on aws_eip resource #3769

Merged

stjimmy88 approved these changes Apr 9, 2018

View reviewed changes

bflad added this to the v1.44.0 milestone Nov 9, 2018

domdom82 added 2 commits November 12, 2018 15:23

tags should inherit timeout from tagged resources

1cbe263

Implement retry on tags using isResourceTimeoutError

db7189e

domdom82 force-pushed the tags_timeout_from_resource branch from d0066f2 to f13c950 Compare November 12, 2018 15:25

bflad reviewed Nov 14, 2018

View reviewed changes

aws/tags.go Outdated Show resolved Hide resolved

ghost added size/S Managed by automation to categorize the size of a PR. and removed size/M Managed by automation to categorize the size of a PR. labels Nov 14, 2018

domdom82 force-pushed the tags_timeout_from_resource branch from 7dd771f to 21a9845 Compare November 14, 2018 07:43

Retry without time bounds on EC2 throttling

1256563

domdom82 force-pushed the tags_timeout_from_resource branch from 21a9845 to 1256563 Compare November 14, 2018 07:45

domdom82 changed the title ~~tags should inherit timeout from tagged resources~~ tags should retry without time bounds on EC2 throttling Nov 14, 2018

bflad reviewed Nov 14, 2018

View reviewed changes

aws/tags.go Show resolved Hide resolved

Return err if errored but not isResourceTimeoutError

5df2e9a

bflad approved these changes Nov 14, 2018

View reviewed changes

bflad merged commit c3c296c into hashicorp:master Nov 14, 2018

bflad added a commit that referenced this pull request Nov 14, 2018

Update CHANGELOG for #3586

137cd3c

mdlavin mentioned this pull request Nov 21, 2018

V1.46.0 patched lifeomic mdlavin/terraform-provider-aws#3

Closed

ghost locked and limited conversation to collaborators Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tags should retry without time bounds on EC2 throttling #3586

tags should retry without time bounds on EC2 throttling #3586

domdom82 commented Mar 1, 2018

domdom82 commented Mar 8, 2018

bflad commented Mar 8, 2018

domdom82 commented Mar 8, 2018

2rs2ts commented Mar 13, 2018

domdom82 commented Mar 14, 2018

2rs2ts commented Mar 14, 2018

mildred commented Mar 22, 2018 •

edited

Loading

domdom82 commented Apr 4, 2018

2rs2ts commented Jul 3, 2018

domdom82 commented Jul 4, 2018

domdom82 commented Nov 2, 2018

bflad commented Nov 9, 2018

domdom82 commented Nov 9, 2018

bflad commented Nov 9, 2018

domdom82 commented Nov 14, 2018

domdom82 commented Nov 14, 2018

bflad left a comment

bflad commented Nov 15, 2018

ghost commented Apr 2, 2020

tags should retry without time bounds on EC2 throttling #3586

tags should retry without time bounds on EC2 throttling #3586

Conversation

domdom82 commented Mar 1, 2018

domdom82 commented Mar 8, 2018

bflad commented Mar 8, 2018

domdom82 commented Mar 8, 2018

2rs2ts commented Mar 13, 2018

domdom82 commented Mar 14, 2018

2rs2ts commented Mar 14, 2018

mildred commented Mar 22, 2018 • edited Loading

domdom82 commented Apr 4, 2018

2rs2ts commented Jul 3, 2018

domdom82 commented Jul 4, 2018

domdom82 commented Nov 2, 2018

bflad commented Nov 9, 2018

domdom82 commented Nov 9, 2018

bflad commented Nov 9, 2018

domdom82 commented Nov 14, 2018

domdom82 commented Nov 14, 2018

bflad left a comment

Choose a reason for hiding this comment

bflad commented Nov 15, 2018

ghost commented Apr 2, 2020

mildred commented Mar 22, 2018 •

edited

Loading