Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

Closed
vancluever opened this issue Apr 5, 2017 · 6 comments · Fixed by #14345
Closed

provider/aws: Timeout waiting for state to become 'success' on large deploys #13407

vancluever opened this issue Apr 5, 2017 · 6 comments · Fixed by #14345

Comments

@vancluever
Copy link
Contributor

vancluever commented Apr 5, 2017

Hey all,

  • NOTE: TF v0.8.8 in this case right now, but looks like it applies to master too.

Referencing #5460 and the number of issues that it fixed, I think I discovered another edge case that it might be happening. Not too sure if this is exactly related to this, as it's more waiter timeout value related and not a particular problem with the waiter.

I'm currently working on a large VPC deploy of something like ~70 subnets, all with routes and the other lovely things that they need to be interconnected. All in all it's around 365 resources, with some of that being entries in our IP database (so the figure is probably a little under 300).

Getting the same intermittent error as was described in the other issues:

aws_route_table.private_route_tables.47: timeout while waiting for state to become 'success' (timeout: 2m0s)

After retrying the TF run a couple of times, everything succeeds, assuming because it has managed to make its way through the rest of the resources without being throttled.

Looking the route table code I didn't see much that would be causing this, although in the debug log there seemed to be a decent amount of chatter regarding tags I found curious. Checking builtin/providers/aws/tags.go, there are a number of retry waiters that are, coincidentally sitting at a 2 min timeout.

This may be the culprit, because on another run, when I turned concurrency down to 2, failures were happening on subnet resources versus route table resources.

What I'm guessing is that the large volume of API requests (write requests at that) is causing a large amount of throttling on the AWS side, more than what would normally be expected, and hence a longer timeout on tags is probably warranted in order to fix this issue.

I'll try to put together a repro in the next day or so as an acceptance test just to make sure before sending in a PR.

PS: There seems to be other cases as well where this is being affected - builtin/providers/aws/resource_aws_route_table_association.go seems to have a waiter that I hit after fixing the ones in tags. This could be a more systemic issue with 2 minute waiters and we might want to raise all of these to maybe 5 minutes instead. Raising the timeout here, and the timeout in tags, resolved the issue.

@radeksimko
Copy link
Member

we might want to raise all of these to maybe 5 minutes instead

I agree.

I saw this in our nightly acceptance tests today too:

=== RUN   TestAccAWSRouteTable_vpcPeering
--- FAIL: TestAccAWSRouteTable_vpcPeering (194.95s)
    testing.go:344: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.
        
        Error: Error applying: 1 error(s) occurred:
        
        * aws_route_table.foo (destroy): 1 error(s) occurred:
        
        * aws_route_table.foo: Error waiting for route table (rtb-88da1cee) to become destroyed: timeout while waiting for resource to be gone (last state: 'ready', timeout: 2m0s)

@kamsz
Copy link

kamsz commented May 3, 2017

I'm also experiencing this issue.

@zzzuzik
Copy link

zzzuzik commented May 10, 2017

Deploying > 100 instances. Pretty often get randomly errors on different objects, like:
aws_route_table_association.xxx: timeout while waiting for state to become 'success' (timeout: 2m0s)

@zzzuzik
Copy link

zzzuzik commented May 10, 2017

@radeksimko Thank you for the fix.

here another troublemaker:
aws_eip.xxx Failure associating EIP: timeout while waiting for state to become 'success' (timeout: 1m0s)

@zzzuzik
Copy link

zzzuzik commented May 13, 2017

Cheers @radeksimko !

@ghost
Copy link

ghost commented Apr 12, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants