-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provider/aws: Timeout waiting for state to become 'success' on large deploys #13407
Comments
I agree. I saw this in our nightly acceptance tests today too:
|
I'm also experiencing this issue. |
Deploying > 100 instances. Pretty often get randomly errors on different objects, like: |
@radeksimko Thank you for the fix. here another troublemaker: |
Cheers @radeksimko ! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Hey all,
Referencing #5460 and the number of issues that it fixed, I think I discovered another edge case that it might be happening.Not too sure if this is exactly related to this, as it's more waiter timeout value related and not a particular problem with the waiter.I'm currently working on a large VPC deploy of something like ~70 subnets, all with routes and the other lovely things that they need to be interconnected. All in all it's around 365 resources, with some of that being entries in our IP database (so the figure is probably a little under 300).
Getting the same intermittent error as was described in the other issues:
After retrying the TF run a couple of times, everything succeeds, assuming because it has managed to make its way through the rest of the resources without being throttled.
Looking the route table code I didn't see much that would be causing this, although in the debug log there seemed to be a decent amount of chatter regarding tags I found curious. Checking
builtin/providers/aws/tags.go
, there are a number of retry waiters that are, coincidentally sitting at a 2 min timeout.This may be the culprit, because on another run, when I turned concurrency down to 2, failures were happening on subnet resources versus route table resources.
What I'm guessing is that the large volume of API requests (write requests at that) is causing a large amount of throttling on the AWS side, more than what would normally be expected, and hence a longer timeout on tags is probably warranted in order to fix this issue.
I'll try to put together a repro in the next day or so as an acceptance test just to make sure before sending in a PR.
PS: There seems to be other cases as well where this is being affected -
builtin/providers/aws/resource_aws_route_table_association.go
seems to have a waiter that I hit after fixing the ones in tags. This could be a more systemic issue with 2 minute waiters and we might want to raise all of these to maybe 5 minutes instead. Raising the timeout here, and the timeout in tags, resolved the issue.The text was updated successfully, but these errors were encountered: