Add a way to handle the API rate limit on AWS #1051

netors · 2015-02-25T20:06:08Z

If you have a big infrastructure, you will hit the AWS API limit when trying to plan your infrastructure.

Find a way to work around this limitation into the provider itself in some global way

pearkes · 2015-02-25T20:48:44Z

We currently implement parallelism that helps throttle connections with a semaphore. This could potentially be a configurable size.

mitchellh · 2015-02-25T21:22:07Z

@pearkes I think we should introduce a global provider-wise semaphore in the provider as well to artificially rate limit those resources to avoid this. The global parallelism semaphore helps but isn't meant to solve this problem.

pearkes · 2015-02-26T00:21:43Z

@mitchellh Yea, that's a good point. Provider A may not mind but B wants 30/req minute max.

radeksimko · 2015-02-28T23:15:31Z

This should also be configurable per provider config as e.g. in AWS each account may have different limits which actually aren't public anywhere (not even via API), but you may ask AWS to increase it = we should allow customers having higher limits to bootstrap the infrastructure faster if the API allows it.

mitchellh · 2015-03-02T18:37:24Z

@radeksimko Sounds fair!

willmcg · 2015-03-11T14:56:20Z

I'm running into these API throttling limits deploying a configuration using master right now on a reasonably sized configuration.

Some kind of rate limiting of API calls is definitely required and a rate limit that could be set a-priori as part of the configuration would definitely help. However, Terraform needs to handle the corner case of hitting this limit and automatically retrying and backing off its requests.Terraform cannot assume it is the only consumer of the provider API request budget because for AWS the request rate limit is account-wide and other applications may be depleting the budget independently. I would be happy if it retried failed requests and issued warnings that would prompt me to adjust a rate limit in the config. Extra bonus points if it could automatically modulate the request rate up/down when limits are hit.

Here is an example apply that I needed to run twice to get it to complete:

$ terraform apply meta/terraform/test/
aws_vpc.vpc: Creating...
  cidr_block:                "" => "10.0.0.0/16"
  default_network_acl_id:    "" => "<computed>"
  default_security_group_id: "" => "<computed>"
  enable_dns_hostnames:      "" => "1"
  enable_dns_support:        "" => "1"
  main_route_table_id:       "" => "<computed>"
  tags.#:                    "" => "2"
  tags.Deployment:           "" => "blah"
  tags.Name:                 "" => "vpc"
aws_vpc.vpc: Creation complete
aws_internet_gateway.igw: Creating...
  tags.#:          "0" => "2"
  tags.Deployment: "" => "blah"
.
.
.
aws_elb.front: Creation complete
aws_security_group.compute: Error: 1 error(s) occurred:

* Request limit exceeded.
aws_network_acl.public.2: Creation complete
aws_network_acl.public.1: Error: 1 error(s) occurred:

*
aws_network_acl.public.0: Error: 1 error(s) occurred:

*
Error applying plan:

4 error(s) occurred:

* 1 error(s) occurred:

* 1 error(s) occurred:

* Request limit exceeded.
* 1 error(s) occurred:

* Resource 'aws_launch_configuration.compute' not found for variable 'aws_launch_configuration.compute.name'
* 1 error(s) occurred:

* Resource 'aws_security_group.nat' not found for variable 'aws_security_group.nat.id'
* 2 error(s) occurred:

* 1 error(s) occurred:

*
* 1 error(s) occurred:

*

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

radeksimko · 2015-03-11T15:49:42Z

Terraform cannot assume it is the only consumer of the provider API request budget because for AWS the request rate limit is account-wide and other applications may be depleting the budget independently.

True, but AWS is making us blind here... as they don't provide any useful stats like many other APIs from Github, Twitter, Google etc.

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1372700873

If we would have such things, we could make Terraform actually clever. :)

willmcg · 2015-03-11T16:03:21Z

Right... quite annoying and they really only tell you that you need to implement exponential back-off when you receive the RequestLimitExceeded error as their answer to questions on the matter.

I've reached a point now where I cannot deploy my configuration anymore with 0.37 or master due to hitting this API throttling error constantly. Seems to depend on where the apply run errors out with the limit exceeded error and now more often than not it leaves the partially deployed configuration in an unrecoverable state that requires me to manually destroy the VPC from the AWS console and start over. Other times I can run apply 3-4 times and it eventually succeeds after deploying more resources incrementally on each run.

This is a show stopper for my automated deployments so my next stop will be with AWS support to see if I can get my API throttling limit bumped up for my account until Terraform can gracefully handle backing off its requests.

CpuID · 2015-03-12T00:53:40Z

The request limits definitely differ per API call. If you have any way to set some sane attempt limits and retry thresholds on a per API call basis this would go a long way. Might be difficult with the move to aws-sdk-go unless their doing something already.

Example - RunInstances tends to baulk around 20-30/sec but it will throttle you quite hard for 5-10 seconds. Whereas DescribeInstances will allow a lot more.

AWS does define 3 categories in their documentation, could list the medium/high complexity calls and treat them differently to the default maybe?

In my experience AWS don't want to reveal thresholds, to avoid people abusing them.

Nathan Sullivan
Sent from a mobile device

On 12 Mar 2015, at 2:03 am, Will McGovern notifications@github.com wrote:

Right... quite annoying and they really only tell you that you need to implement exponential back-off when you receive the RequestLimitExceeded error as their answer to questions on the matter.

I've reached a point now where I cannot deploy my configuration anymore with 0.37 or master due to hitting this API throttling error constantly. Seems to depend on where the apply run errors out with the limit exceeded error and now more often than not it leaves the partially deployed configuration in an unrecoverable state that requires me to manually destroy the VPC from the AWS console and start over. Other times I can run apply 3-4 times and it eventually succeeds after deploying more resources incrementally on each run.

This is a show stopper for my automated deployments so my next stop will be with AWS support to see if I can get my API throttling limit bumped up for my account until Terraform can gracefully handle backing off its requests.

—
Reply to this email directly or view it on GitHub.

willmcg · 2015-03-12T04:09:59Z

I talked with our AWS support people today and they basically said that the EC2 API rate limits are already the highest of all services. They will not raise them for an account even if you have a very expensive support agreement. An application not implementing back-off was definitely not regarded as a anything even close to sufficient reason for them to even consider raising limits.

Because my configuration would basically not ever deploy without hitting the API limits on my account and crapping out... sometimes leaving everything in a bad state... I did some hackery in the ec2 request code in aws-sdk-go to add an exponential back-off for requests that hit the rate limit and now I have it working reliably without ever failing on API rate limit errors. Never written a line of golang in my life until today so it was a gross hack and only deals with the particular case that was blocking me.

A proper implementation would put a more general retry mechanism in place around the request code that is aware of the different kinds of request error responses that should be retried in EC2. The AWS docs have a table that lists the different errors that need to be retried (5xx server errors, RateLimitExceeded, ConcurrentTagAccess, some DependencyViolation cases due to eventual consistency, etc.).

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html

The retry logic differs across the different services so you need to deal with each one individually.

ocxo · 2015-03-24T18:56:14Z

@willmcg can you share more details on how you implemented the back-off?

willmcg · 2015-03-24T19:19:17Z

It was some simple horrible hackery in aws-sdk-go/aws/ec2.go in to EC2Client Do() method that put a retry loop around the request logic to keep retrying the request with brain-dead back-off delay in the case where the EC2 API call returned the "RequestLimitExceeded" error:

if ec2Err.Code == "RequestLimitExceeded" {
    time.Sleep(time.Duration(count) * time.Second)
    continue
}

Actually it is such a hack that it is really only linear back-off and not even exponential at present :-)

I see in the source JSON for the APIs there is a _retry.json that actually details all the retry conditions but it does not look like the golang request code uses this information for retry policy on failed requests. This would be the right way to handle retry rather than my hackery.

jschneiderhan · 2015-04-01T17:49:38Z

It looks like master of aws-sdk-go implements retries with an exponential back-off on Throttling errors: https://github.com/awslabs/aws-sdk-go/blob/master/aws/service.go#L124-L142. The fork that terraform is using doesn't include it, but perhaps this will help once it catches up with the upstream. If the version of aws-sdk-go used was more up-to-date, would it solve the problem, or would it still make sense to have something in terraform controlling the number of requests being generated?

Either way I'm very interested in helping move this forward. I'm new to golang, but very interested in learning.

franklinwise · 2015-04-19T00:35:28Z

Happy to help, how can we move this forward?

ocxo · 2015-04-20T12:52:37Z

I think most if not all the work has been done to support moving to the official aws sdk go which implements backoff/retry. This should be in the next release.

clstokes · 2015-04-20T15:58:49Z

+1 as this is really annoying and impactful.

jschneiderhan · 2015-04-20T18:34:28Z

I'm really hoping https://github.com/awslabs/aws-sdk-go/blob/a79c7d95c012010822e27aaa5551927f5e8a6ab6/aws/service.go#L134 helps, but I'm concerned that the default max retries is too low at 3. In my case it's the AutoScaling API that is throwing rate limit exceeded errors, and I've see the command retry up to 9 times before it succeeds. Granted, I've been using an older version with some custom retry logic added in while I waited for the aws-sdk-go library catch up to upstream, but I copy/pasted the logic from the upstream aws-sdk-go repo, so the behavior should be similar.

franklinwise · 2015-04-21T18:04:10Z

@fromonesrc - When is the next release?

ocxo · 2015-04-21T18:07:03Z

Looks like it will be in 0.5.0 (https://github.com/hashicorp/terraform/blob/master/CHANGELOG.md#050-unreleased) but I don't know when that will ship.

promorphus · 2015-04-27T03:08:18Z

Any word on when 0.5.0 will be released? And is it possible to use the rate limiting by building what's currently in the repo or is the rate limiting a feature that hasn't been developed yet but is on the roadmap?

davedash · 2015-04-27T20:12:40Z

So I'm running into this issue. Seems like Route53 is VERY aggressive with throttling. I can't get plan to successfully return.

Anybody have a work-around? Otherwise I might have to downgrade temporarily.

fishnix · 2015-04-29T14:43:06Z

I just started hitting this as well 😒 Much needed 👍

zadunn · 2015-04-29T14:44:44Z

We are hitting this as well.

jgillis01 · 2015-04-29T19:44:23Z

I was able to hack around it by doing the following:
sudo tc qdisc add dev enp0s20u1u3 root netem delay 1000ms

This would basically delay all outbound traffic on my workstation by 1 second. There may be a more elegant solution in using tc with iptables.

koendc · 2015-05-03T13:47:51Z

When running terraform from master with the retry logic enabled, we were still hit by the API rate limits. After increasing MaxRetries to 11, we were no longer experiencing the issue. It looks like the default of 3 retries is not enough.

In #1787, the number of retries is made configurable, with a default of 11 times (ie a delay of 61 seconds for the last retry).

promorphus · 2015-05-04T04:34:30Z

@koendc, are you running on AWS or some other provider? I can change the number of retries for Openstack, but can't for AWS.

koendc · 2015-05-04T05:26:09Z

I should have made myself a bit more clear:

I'm running on AWS
aws-sdk-go, used by terraform now has a retry logic. The default is 3 retries. By compiling terraform from the master, you'll get the retry logic.
Even with this retry logic and the 3 retries, we were encountering rate limit errors.
I changed the terraform code to make the maximum number of retries configurable and I set the default max_retries to 11.
After this change, we were no longer encountering the errors.

ghost · 2020-05-03T01:57:37Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

pearkes added the enhancement label Feb 25, 2015

mitchellh added the provider/aws label Feb 25, 2015

willmcg mentioned this issue Mar 11, 2015

Failed destroy causes unresolved resource references in output variables #1172

Closed

rhelmer mentioned this issue Mar 11, 2015

fix bug 1118297 - terraform config for Socorro mozilla/socorro-infra#29

Merged

jschneiderhan mentioned this issue Apr 2, 2015

Rebase hashicorp/aws-sdk-go, migrate back to official AWS library #1361

Closed

3 tasks

koendc mentioned this issue May 3, 2015

aws: make MaxRetries for API calls configurable #1787

Merged

mitchellh closed this as completed in #1787 May 4, 2015

darron mentioned this issue Apr 8, 2016

RequestLimitExceeded while refreshing 500/1000 aws_instance instances #6096

Closed

hashibot mentioned this issue Jun 13, 2017

RequestLimitExceeded while refreshing 500/1000 aws_instance instances hashicorp/terraform-provider-aws#138

Closed

teh-username mentioned this issue Nov 28, 2017

Conform to Vultr's rate limit of 2 req/s squat/terraform-provider-vultr#17

Merged

phillbaker mentioned this issue Jul 24, 2019

Feature Request: Handle MailGun's Rate limit phillbaker/terraform-provider-mailgunv3#15

Closed

ghost locked and limited conversation to collaborators May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to handle the API rate limit on AWS #1051

Add a way to handle the API rate limit on AWS #1051

netors commented Feb 25, 2015

pearkes commented Feb 25, 2015

mitchellh commented Feb 25, 2015

pearkes commented Feb 26, 2015

radeksimko commented Feb 28, 2015

mitchellh commented Mar 2, 2015

willmcg commented Mar 11, 2015

radeksimko commented Mar 11, 2015

willmcg commented Mar 11, 2015

CpuID commented Mar 12, 2015

willmcg commented Mar 12, 2015

ocxo commented Mar 24, 2015

willmcg commented Mar 24, 2015

jschneiderhan commented Apr 1, 2015

franklinwise commented Apr 19, 2015

ocxo commented Apr 20, 2015

clstokes commented Apr 20, 2015

jschneiderhan commented Apr 20, 2015

franklinwise commented Apr 21, 2015

ocxo commented Apr 21, 2015

promorphus commented Apr 27, 2015

davedash commented Apr 27, 2015

fishnix commented Apr 29, 2015

zadunn commented Apr 29, 2015

jgillis01 commented Apr 29, 2015

koendc commented May 3, 2015

promorphus commented May 4, 2015

koendc commented May 4, 2015

ghost commented May 3, 2020

Add a way to handle the API rate limit on AWS #1051

Add a way to handle the API rate limit on AWS #1051

Comments

netors commented Feb 25, 2015

pearkes commented Feb 25, 2015

mitchellh commented Feb 25, 2015

pearkes commented Feb 26, 2015

radeksimko commented Feb 28, 2015

mitchellh commented Mar 2, 2015

willmcg commented Mar 11, 2015

radeksimko commented Mar 11, 2015

willmcg commented Mar 11, 2015

CpuID commented Mar 12, 2015

willmcg commented Mar 12, 2015

ocxo commented Mar 24, 2015

willmcg commented Mar 24, 2015

jschneiderhan commented Apr 1, 2015

franklinwise commented Apr 19, 2015

ocxo commented Apr 20, 2015

clstokes commented Apr 20, 2015

jschneiderhan commented Apr 20, 2015

franklinwise commented Apr 21, 2015

ocxo commented Apr 21, 2015

promorphus commented Apr 27, 2015

davedash commented Apr 27, 2015

fishnix commented Apr 29, 2015

zadunn commented Apr 29, 2015

jgillis01 commented Apr 29, 2015

koendc commented May 3, 2015

promorphus commented May 4, 2015

koendc commented May 4, 2015

ghost commented May 3, 2020