-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837
Comments
Thanks @jaapvanblaaderen for reporting and apologies for this regression. The retry mechanism changed from SDK v2 to SDK v3, and was supposed to be smarter. But clearly something did not work. Looking into it. |
Thanks @mrgrain, please let me know if I can assist in any way. |
With the patched version from my PR, your reproduction example now succeeds! 🥳 |
That sounds great! |
) The LogRetention Custom Resource used to be able to handle server-side throttling, when a lot of requests to the CloudWatch Logs service are made at the same time. Handling of this error case got lost during the migration to SDK v3. If we have (read: a lot) `LogRetention` Custom Resources in a _single_ Stack, CloudFormation apparently applies some internal breaks to the amount of parallelism. For example it appears that resources are batched in smaller groups that need to be completed before the next group is provisioned. And within the groups there appears to be a ever so slight delay between individual resources. Together this is enough to avoid rate limiting in most circumstances. **Therefore, in practice this issues only occurs when multiple stacks are deployed in parallel.** To test this scenario, I have added support for `integ-runner` to deploy all stacks of a test case concurrently. Support for arbitrary command args already existed, but needed to explicitly include the `concurrency` option. I then create an integration test that deploys 3 stacks à 25 LogRetention resources. This triggers the error cases described in #26837. The fix itself is twofold: - Pass the `maxRetries` prop value to the SDK client to increase the number of attempts of the SDK internal throttling handling. But also enforce a minimum for these retries since they might catch additional retryable failures that our custom outer loop does not account for. - Explicitly catch `ThrottlingException` errors in the outer retry loop. Closes #26837 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
The issue was resolved with updating to 2.93.0 on our side but it starts to occur again without changing the version. Is it reproducible again on your side @jaapvanblaaderen? We have the same setup as described in the issue. |
Thanks @nikolay-radkov for the report! Have you tried setting a higher number for |
Never mind, this still needs to be released. The release of the fix should be imminent. |
I can confirm that after upgrading to 2.94, the test project can be deployed without rate limit errors. We will upgrade our services to 2.94 coming weeks, but so far so good. Nice job! |
Thanks, but should have worked in first place! |
Sure, will keep you posted. |
Describe the bug
Deploying multiple stacks with a lot of Lambda functions, results in a lot of CloudWatch Log groups being created by CDK. CDK provides retry options through the logRetentionRetryOptions property of a Lambda Function construct to prevent rate exceeded errors from happening.
Until
CDK 2.89.0
this worked fine. InCDK 2.90.0
, the retry implementation changed in this PR and the rate exceeded errors occur again.Expected Behavior
The retry mechanism should just function properly, same way it did in older,
CDK 2.89
and lower, versions. Stacks should deploy without issues.Current Behavior
Stacks cannot be deployed with
CDK 2.90.0
and up as they error with aRate exceeded
error:Reproduction Steps
I created a small test project to reproduce the issue: https://github.com/jaapvanblaaderen/log-retention-rate-limit. With this simple setup, the issue can be easily reproduced when deploying the stacks in parallel as described in the readme of the repo. I even increased the amount of retries to
50
but then it still fails.Testing it against
2.89.0
(which works fine) is possible by checking out the cdk-289 tag.Possible Solution
Revert the retry mechanism as implemented in
CDK 2.89.0
or troubleshoot why the new retry mechanism doesn't work properly.Additional Information/Context
I'm actually a bit surprised and disappointed that this issue popped up again. I fixed this issue myself in CDK 3 years ago and now we have this regression on it which causes it to fail.
CDK CLI Version
2.90.0
Framework Version
No response
Node.js Version
18
OS
OSX
Language
Typescript
Language Version
No response
Other information
Related issues:
The text was updated successfully, but these errors were encountered: