-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2
has chance to create and permanently upload corrupted asset to S3
#23290
Comments
cdk deploy --concurrency 2
has chance to create and permamently upload corrupted asset to S3cdk deploy --concurrency 2
has chance to create and permanently upload corrupted asset to S3
I think that this bug is core cdk asset bundling functionality problem, not just aws-logs or logRetention, but shared assets are much more likely to have this happen |
|
Oh no @comcalvi this issue was closed automaticaly because we close our internal fix which mentioned this. I didn't know that this is tied up automaticaly through github. I will reopen this issue again. |
I'm not able to reproduce this @vaclavbarta, but maybe I'm not trying to reproduce it quite right. I think a GitHub repo to demonstrate exactly what you are trying to do would be helpful in getting to the bottom of this 🙂 |
@peterwoodworth i create some POC repository with readme https://github.com/littlebart/cdk-issue-23290 |
Maybe this will help, it takes 1-3 tries to find the problem. But today I can't hit the corrupted file on local. Nevertheless half the installs failed 💥 It is tricky bug 😄 Have a nice day. |
Is there anything you need from me? Was example proof of concept ok for reproduce it? Can i help with something? 🙌 We wanted to use concurrent running deploy for our project to consolidate time waiting for pipeline, which is in our scenario case about 40% faster, but we must sadly fallback to the sequential deploy due this bug (the risk is higher than the benefit from faster deployment) I know that there are higher priority bugs Happy new year 🙂 |
I just wanted to chime in and say this is killing us as well — we've got a CI/CD pipeline that uses the --concurrency flag and it fails about 20% of the time on zipfile corruption. |
Thanks for confirmation @rscharer |
I filed a very simple proposed fix here: #23677, but it was rejected by the linter. Good luck writing a test for a race condition 😢 |
@rscharer unfortunately, also i don't have any clue how to write tests for this 🧐 #23677 The fix you write make sense. Maybe @tomas-mazak could you help us? |
#23677) Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID. Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver. ---- ### All Submissions: * [ X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) ### Adding new Construct Runtime Dependencies: * [ ] This PR adds new construct runtime dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-construct-runtime-dependencies) ### New Features * [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)? * [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)? *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
#23677) Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID. Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.
#23677) Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID. Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.
This is a re-roll of #23677 which was reverted in #23994 because the `randomUUID()` function from the original solution was not available in Node versions below 14.17 (and we advertise compatibility with Node 14.*). We didn't actually need a UUID, just any random string, so replace it with a function that generates a random string in a different way. ---------- Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random string. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Describe the bug
I suspect that there is possible race condition in cdk bundling code, this code do not handle any parallel compression of the same asset at same time correctly (bundle into same file)
https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L11-L17 problem seems line 14
I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with
cdk deploy --concurrency 2 ...
or more concurrency value.At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps inside whole account in the future (see possible solutions)
Have a nice day ;)
Expected Behavior
Paralell zipping with
--concurrency 2
option will not produce corrupted assets, do not upload it into asset bucket and recovery from this error in future. More attention to the atomicity and independency of the operations during parallel processing would be goodCurrent Behavior
Yesterday we deploy our aws-cdk project with github pipelines, flow is
cdk synth cdk deploy --all --concurrency 2 --app 'cdk.out/assembly-stage-prod/'
Project also use Stage construct and every stage contains two independent Stacks with API Gateways and NodejsFunctions constructs with logRetention props
After code change, which adds using logRetention, our pipeline fails with this error and hangs about 30 minutes
Cloudformation event waiting "indefinitely" on event for more than half hour on this (every *LogRetention resources stucks in CREATE_IN_PROGRESS)
I try to cancel cloudformation deploy manualy, but this also hangs for maybe 1 hour in
UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS
, after this crazy time we finaly reached into theUPDATE_ROLLBACK_COMPLETE
state. Every deploys before works fine.I repeat this with same and another stage (prod2) into same account. However, the ENOENT file error did not occur again but the project always hangs with same behavior (always failing at funcLogRetention4D3714B1, timeout, ROLLBACK). Pipeline and manual deploy doesn't matter, IAM seems correct.
For deploying into another account, another stage environment (from another cdk.out builded localy or from pipeline, works ok) with or without use
--concurrency 2
works always ok and after minutes we deploy our application successfully here.But this one prod accounts has different deploy/asset history and fails
Conclusion:
After forensics analyze i suspects that problem is in account persistently (other AWS accounts with same code of this project works flawlessly and deploy in minutes successfully). Complete uninstall/reinstall of app also doesn't help.
Today i try analyze asset
eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
from more acounts and i catch there was same size files same name as error aboveBut the checksums are different
Both work with
unzip -l
File
asset_corrupted
contains content only forindex.d.ts
(unzippable) butindex.js
is missing and unzip fails (inside zip are binary mostly zero bytes)asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
After hours of analysis i catch that cdk accidentaly create and upload corrupted asset for CustomResource lambda used for handling Logretention. Asset zip is listable but cannot be unziped (contains zero data) so the Cloudformation deploys hangs without any meaningfull error. Cdk also never repairs this file in
cdk-hnb659fds-assets-*
S3 bucket, because has same key name for all stages and environments.I suspect a race condition in some cdk bundling code and by searching github repo for existence
_tmp
i found only this suspicious code, which seems do not handle any parallel compression of the same asset at same time (bundle into/from same file).https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L14
Reproduction Steps
I can't reproduce it now again but maybe only time mattersI create repo with code example which most of time failing https://github.com/littlebart/cdk-issue-23290Account with corrupted asset is broken until the code of LogRetenction CustomResource will be changed or bundling process change his filename or zip (if process is deterministic it might last very long time) or was deleted manualy from S3 cdk-asset bucket
Possible Solution
I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with
cdk deploy --concurrency 2 ...
or more concurrency value (not constant ._tmp)At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps permanently inside whole account in the future. Maybe some self-recovery mechanism using content checksum comparison of eg. x-amz-meta-checksum metadata can helps prevents unrecoverable error of this type.
Additional Information/Context
No response
CDK CLI Version
2.53.0
Framework Version
No response
Node.js Version
16.18.1
OS
Linux Ubuntu
Language
Typescript
Language Version
Typescript 4.9.3
Other information
No response
The text was updated successfully, but these errors were encountered: