Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

Closed
jaapvanblaaderen opened this issue Aug 22, 2023 · 11 comments · Fixed by #26858
Closed

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

jaapvanblaaderen opened this issue Aug 22, 2023 · 11 comments · Fixed by #26858
Assignees
Labels
@aws-cdk/aws-lambda Related to AWS Lambda bug This issue is a bug. effort/medium Medium work item – several days of effort p0

Comments

@jaapvanblaaderen
Copy link
Contributor

jaapvanblaaderen commented Aug 22, 2023

Describe the bug

Deploying multiple stacks with a lot of Lambda functions, results in a lot of CloudWatch Log groups being created by CDK. CDK provides retry options through the logRetentionRetryOptions property of a Lambda Function construct to prevent rate exceeded errors from happening.

Until CDK 2.89.0 this worked fine. In CDK 2.90.0, the retry implementation changed in this PR and the rate exceeded errors occur again.

Expected Behavior

The retry mechanism should just function properly, same way it did in older, CDK 2.89 and lower, versions. Stacks should deploy without issues.

Current Behavior

Stacks cannot be deployed with CDK 2.90.0 and up as they error with a Rate exceeded error:

LogRetentionRateLimitStack3 |  57/183 | 11:34:35 AM | CREATE_FAILED        | Custom::LogRetention        | hello_11/LogRetention (hello11LogRetention0EC20DD0) Received response status [FAILED] from custom resource. Message returned: Rate exceeded (RequestId: b84c54c4-3053-487a-82b9-48671904d05a)

Reproduction Steps

I created a small test project to reproduce the issue: https://github.com/jaapvanblaaderen/log-retention-rate-limit. With this simple setup, the issue can be easily reproduced when deploying the stacks in parallel as described in the readme of the repo. I even increased the amount of retries to 50 but then it still fails.

Testing it against 2.89.0 (which works fine) is possible by checking out the cdk-289 tag.

Possible Solution

Revert the retry mechanism as implemented in CDK 2.89.0 or troubleshoot why the new retry mechanism doesn't work properly.

Additional Information/Context

I'm actually a bit surprised and disappointed that this issue popped up again. I fixed this issue myself in CDK 3 years ago and now we have this regression on it which causes it to fail.

CDK CLI Version

2.90.0

Framework Version

No response

Node.js Version

18

OS

OSX

Language

Typescript

Language Version

No response

Other information

Related issues:

@jaapvanblaaderen jaapvanblaaderen added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023
@github-actions github-actions bot added the @aws-cdk/aws-lambda Related to AWS Lambda label Aug 22, 2023
@mrgrain mrgrain self-assigned this Aug 22, 2023
@mrgrain
Copy link
Contributor

mrgrain commented Aug 22, 2023

Thanks @jaapvanblaaderen for reporting and apologies for this regression. The retry mechanism changed from SDK v2 to SDK v3, and was supposed to be smarter. But clearly something did not work. Looking into it.

@jaapvanblaaderen
Copy link
Contributor Author

Thanks @mrgrain, please let me know if I can assist in any way.

@pahud pahud added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023
@peterwoodworth peterwoodworth added p0 and removed p2 labels Aug 22, 2023
@mrgrain
Copy link
Contributor

mrgrain commented Aug 24, 2023

With the patched version from my PR, your reproduction example now succeeds! 🥳

@jaapvanblaaderen
Copy link
Contributor Author

That sounds great!

@mergify mergify bot closed this as completed in #26858 Aug 24, 2023
mergify bot pushed a commit that referenced this issue Aug 24, 2023
)

The LogRetention Custom Resource used to be able to handle server-side throttling, when a lot of requests to the CloudWatch Logs service are made at the same time.
Handling of this error case got lost during the migration to SDK v3.

If we have (read: a lot) `LogRetention` Custom Resources in a _single_ Stack, CloudFormation apparently applies some internal breaks to the amount of parallelism. For example it appears that resources are batched in smaller groups that need to be completed before the next group is provisioned. And within the groups there appears to be a ever so slight delay between individual resources. Together this is enough to avoid rate limiting in most circumstances.

**Therefore, in practice this issues only occurs when multiple stacks are deployed in parallel.**

To test this scenario, I have added support for `integ-runner` to deploy all stacks of a test case concurrently.
Support for arbitrary command args already existed, but needed to explicitly include the `concurrency` option.

I then create an integration test that deploys 3 stacks à 25 LogRetention resources.
This triggers the error cases described in #26837.

The fix itself is twofold:
- Pass the `maxRetries` prop value to the SDK client to increase the number of attempts of the SDK internal throttling handling. But also enforce a minimum for these retries since they might catch additional retryable failures that our custom outer loop does not account for.
- Explicitly catch `ThrottlingException` errors in the outer retry loop.

Closes #26837

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@nikolay-radkov
Copy link

The issue was resolved with updating to 2.93.0 on our side but it starts to occur again without changing the version. Is it reproducible again on your side @jaapvanblaaderen? We have the same setup as described in the issue.

@mrgrain
Copy link
Contributor

mrgrain commented Aug 30, 2023

Thanks @nikolay-radkov for the report! Have you tried setting a higher number for maxRetries? The default might well be too conservative.

@mrgrain
Copy link
Contributor

mrgrain commented Sep 1, 2023

@jaapvanblaaderen Did you ever get around to confirming if this has been resolved for you?

Never mind, this still needs to be released. The release of the fix should be imminent.

@wwjhu
Copy link

wwjhu commented Sep 3, 2023

I can confirm that after upgrading to 2.94, the test project can be deployed without rate limit errors. We will upgrade our services to 2.94 coming weeks, but so far so good. Nice job!

@mrgrain
Copy link
Contributor

mrgrain commented Sep 4, 2023

I can confirm that after upgrading to 2.94, the test project can be deployed without rate limit errors. We will upgrade our services to 2.94 coming weeks, but so far so good. Nice job!

Thanks, but should have worked in first place!
Once you have upgraded your services, could you please let me know if it works with a real world project and if you have to customize maxRetries? We are very open to adjusting the default number of retries to one that's working for most.

@wwjhu
Copy link

wwjhu commented Sep 4, 2023

Once you have upgraded your services, could you please let me know if it works with a real world project and if you have to customize maxRetries? We are very open to adjusting the default number of retries to one that's working for most.

Sure, will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-lambda Related to AWS Lambda bug This issue is a bug. effort/medium Medium work item – several days of effort p0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants