provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

vancluever · 2017-04-05T23:58:14Z

Hey all,

NOTE: TF v0.8.8 in this case right now, but looks like it applies to master too.

~~Referencing #5460 and the number of issues that it fixed, I think I discovered another edge case that it might be happening.~~ Not too sure if this is exactly related to this, as it's more waiter timeout value related and not a particular problem with the waiter.

I'm currently working on a large VPC deploy of something like ~70 subnets, all with routes and the other lovely things that they need to be interconnected. All in all it's around 365 resources, with some of that being entries in our IP database (so the figure is probably a little under 300).

Getting the same intermittent error as was described in the other issues:

aws_route_table.private_route_tables.47: timeout while waiting for state to become 'success' (timeout: 2m0s)

After retrying the TF run a couple of times, everything succeeds, assuming because it has managed to make its way through the rest of the resources without being throttled.

Looking the route table code I didn't see much that would be causing this, although in the debug log there seemed to be a decent amount of chatter regarding tags I found curious. Checking builtin/providers/aws/tags.go, there are a number of retry waiters that are, coincidentally sitting at a 2 min timeout.

This may be the culprit, because on another run, when I turned concurrency down to 2, failures were happening on subnet resources versus route table resources.

What I'm guessing is that the large volume of API requests (write requests at that) is causing a large amount of throttling on the AWS side, more than what would normally be expected, and hence a longer timeout on tags is probably warranted in order to fix this issue.

I'll try to put together a repro in the next day or so as an acceptance test just to make sure before sending in a PR.

PS: There seems to be other cases as well where this is being affected - builtin/providers/aws/resource_aws_route_table_association.go seems to have a waiter that I hit after fixing the ones in tags. This could be a more systemic issue with 2 minute waiters and we might want to raise all of these to maybe 5 minutes instead. Raising the timeout here, and the timeout in tags, resolved the issue.

The text was updated successfully, but these errors were encountered:

radeksimko · 2017-04-06T07:36:18Z

we might want to raise all of these to maybe 5 minutes instead

I agree.

I saw this in our nightly acceptance tests today too:

=== RUN   TestAccAWSRouteTable_vpcPeering
--- FAIL: TestAccAWSRouteTable_vpcPeering (194.95s)
    testing.go:344: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.
        
        Error: Error applying: 1 error(s) occurred:
        
        * aws_route_table.foo (destroy): 1 error(s) occurred:
        
        * aws_route_table.foo: Error waiting for route table (rtb-88da1cee) to become destroyed: timeout while waiting for resource to be gone (last state: 'ready', timeout: 2m0s)

kamsz · 2017-05-03T07:06:02Z

I'm also experiencing this issue.

zzzuzik · 2017-05-10T04:00:36Z

Deploying > 100 instances. Pretty often get randomly errors on different objects, like:
aws_route_table_association.xxx: timeout while waiting for state to become 'success' (timeout: 2m0s)

zzzuzik · 2017-05-10T16:35:13Z

@radeksimko Thank you for the fix.

here another troublemaker:
aws_eip.xxx Failure associating EIP: timeout while waiting for state to become 'success' (timeout: 1m0s)

zzzuzik · 2017-05-13T05:40:32Z

Cheers @radeksimko !

ghost · 2020-04-12T02:24:15Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

radeksimko added bug provider/aws labels Apr 6, 2017

radeksimko mentioned this issue May 10, 2017

provider/aws: Increase timeouts for Route Table retries #14345

Merged

radeksimko closed this as completed in #14345 May 10, 2017

radeksimko mentioned this issue May 11, 2017

provider/aws: Increase EIP update timeout #14381

Merged

ghost locked and limited conversation to collaborators Apr 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

vancluever commented Apr 5, 2017 •

edited

Loading

radeksimko commented Apr 6, 2017

kamsz commented May 3, 2017

zzzuzik commented May 10, 2017

zzzuzik commented May 10, 2017

zzzuzik commented May 13, 2017

ghost commented Apr 12, 2020

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

Comments

vancluever commented Apr 5, 2017 • edited Loading

radeksimko commented Apr 6, 2017

kamsz commented May 3, 2017

zzzuzik commented May 10, 2017

zzzuzik commented May 10, 2017

zzzuzik commented May 13, 2017

ghost commented Apr 12, 2020

vancluever commented Apr 5, 2017 •

edited

Loading