Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2 has chance to create and permanently upload corrupted asset to S3 #23290

Closed
vaclavbarta opened this issue Dec 9, 2022 · 12 comments · Fixed by #23677 or #24026
Assignees
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@vaclavbarta
Copy link

vaclavbarta commented Dec 9, 2022

Describe the bug

I suspect that there is possible race condition in cdk bundling code, this code do not handle any parallel compression of the same asset at same time correctly (bundle into same file)

https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L11-L17 problem seems line 14

I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with cdk deploy --concurrency 2 ... or more concurrency value.

At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps inside whole account in the future (see possible solutions)

Have a nice day ;)

Expected Behavior

Paralell zipping with --concurrency 2 option will not produce corrupted assets, do not upload it into asset bucket and recovery from this error in future. More attention to the atomicity and independency of the operations during parallel processing would be good

Current Behavior

Yesterday we deploy our aws-cdk project with github pipelines, flow is

cdk synth
cdk deploy --all --concurrency 2 --app 'cdk.out/assembly-stage-prod/'

Project also use Stage construct and every stage contains two independent Stacks with API Gateways and NodejsFunctions constructs with logRetention props

  const func = new NodejsFunction(this, 'func', {
      runtime: lambda.Runtime.NODEJS_16_X,
      entry: path.join(__dirname, 'src/lambda.ts'),
      handler: 'handler',
      logRetention: RetentionDays.ONE_MONTH,
    });

After code change, which adds using logRetention, our pipeline fails with this error and hangs about 30 minutes

[0%] start: Publishing 624c688f90a2958730a01ea851f96aeebfec60adb98217e2590973a0613788d9:***-eu-central-1
[20%] fail: ENOENT: no such file or directory, rename '/home/runner/work/REDACTED/cdk.out/assembly-REDACTED/.cache/eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip._tmp' -> '/home/runner/work/REDACTED/cdk.out/assembly-REDACTED/.cache/eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip'
[33%] success: Published eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8:***-eu-central-1

...
[100%] success: Published 749868b79ce52eeb1e213bea4612695d0c6d34aaec83a4a62c1e367ea85e9f83:***-eu-central-1

 ❌  stage-prod/ApiGwEndpoints (stage-prod-ApiGwEndpoints) failed: Error: Failed to publish one or more assets. See the error messages above for more information.
    at publishAssets (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/util/asset-publishing.ts:60:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at CloudFormationDeployments.publishStackAssets (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:572:7)
    at CloudFormationDeployments.deployStack (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:419:7)
    at deployStack2 (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/cdk-toolkit.ts:264:24)
    at /home/runner/work/REDACTED/node_modules/aws-cdk/lib/deploy.ts:39:11
    at run (/home/runner/work/REDACTED/node_modules/p-queue/dist/index.js:163:29)
    stage-prod-ApiGwEndpoints: creating CloudFormation changeset...
...

Cloudformation event waiting "indefinitely" on event for more than half hour on this (every *LogRetention resources stucks in CREATE_IN_PROGRESS)

stage-prod-ApiGwEndpoints |  5/11 | 2:17:09 PM | CREATE_IN_PROGRESS   | Custom::LogRetention           | stage-prod/ApiGwEndpoints/func/LogRetention (funcLogRetention4D3714B1) 
5/11 Currently in progress: stage-prod-ApiGwEndpoints, funcLogRetention4D3714B1

I try to cancel cloudformation deploy manualy, but this also hangs for maybe 1 hour in UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS, after this crazy time we finaly reached into the UPDATE_ROLLBACK_COMPLETE state. Every deploys before works fine.

I repeat this with same and another stage (prod2) into same account. However, the ENOENT file error did not occur again but the project always hangs with same behavior (always failing at funcLogRetention4D3714B1, timeout, ROLLBACK). Pipeline and manual deploy doesn't matter, IAM seems correct.

For deploying into another account, another stage environment (from another cdk.out builded localy or from pipeline, works ok) with or without use --concurrency 2 works always ok and after minutes we deploy our application successfully here.

But this one prod accounts has different deploy/asset history and fails

Conclusion:
After forensics analyze i suspects that problem is in account persistently (other AWS accounts with same code of this project works flawlessly and deploy in minutes successfully). Complete uninstall/reinstall of app also doesn't help.

Today i try analyze asset eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip from more acounts and i catch there was same size files same name as error above

stat *.zip
  File: asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Size: 1643      	Blocks: 8          IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 11134444    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ username)   Gid: ( 1000/ username)
Access: 2022-12-09 00:16:26.439200220 +0100
Modify: 2022-12-08 23:39:08.405444014 +0100
Change: 2022-12-09 00:16:08.659185874 +0100
 Birth: -
  File: asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Size: 1643      	Blocks: 8          IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 11132877    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ username)   Gid: ( 1000/ username)
Access: 2022-12-09 00:16:26.439200220 +0100
Modify: 2022-12-08 23:41:38.541560741 +0100
Change: 2022-12-09 00:16:18.171193552 +0100
 Birth: -

But the checksums are different

md5sum asset_*.zip
4262ceb38e4a978ac30eb305c9b495ce  asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
5fd153d5c1cd5334cd81991054b2603c  asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip

Both work with unzip -l

unzip -l asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
Archive:  asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      128  01-01-1980 00:00   index.d.ts
     4194  01-01-1980 00:00   index.js
---------                     -------
     4322                     2 files

unzip -l asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
Archive:  asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      128  01-01-1980 00:00   index.d.ts
     4194  01-01-1980 00:00   index.js
---------                     -------
     4322                     2 files

File asset_corrupted contains content only for index.d.ts (unzippable) but index.js is missing and unzip fails (inside zip are binary mostly zero bytes)

asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip

After hours of analysis i catch that cdk accidentaly create and upload corrupted asset for CustomResource lambda used for handling Logretention. Asset zip is listable but cannot be unziped (contains zero data) so the Cloudformation deploys hangs without any meaningfull error. Cdk also never repairs this file in cdk-hnb659fds-assets-* S3 bucket, because has same key name for all stages and environments.

I suspect a race condition in some cdk bundling code and by searching github repo for existence _tmp i found only this suspicious code, which seems do not handle any parallel compression of the same asset at same time (bundle into/from same file).

https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L14

Reproduction Steps

I can't reproduce it now again but maybe only time matters I create repo with code example which most of time failing https://github.com/littlebart/cdk-issue-23290

Account with corrupted asset is broken until the code of LogRetenction CustomResource will be changed or bundling process change his filename or zip (if process is deterministic it might last very long time) or was deleted manualy from S3 cdk-asset bucket

Possible Solution

I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with cdk deploy --concurrency 2 ... or more concurrency value (not constant ._tmp)

At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps permanently inside whole account in the future. Maybe some self-recovery mechanism using content checksum comparison of eg. x-amz-meta-checksum metadata can helps prevents unrecoverable error of this type.

Additional Information/Context

No response

CDK CLI Version

2.53.0

Framework Version

No response

Node.js Version

16.18.1

OS

Linux Ubuntu

Language

Typescript

Language Version

Typescript 4.9.3

Other information

No response

@vaclavbarta vaclavbarta added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 9, 2022
@github-actions github-actions bot added the @aws-cdk/aws-logs Related to Amazon CloudWatch Logs label Dec 9, 2022
@vaclavbarta vaclavbarta changed the title (cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2 has chance to create and permamently upload corrupted asset to S3 (cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2 has chance to create and permanently upload corrupted asset to S3 Dec 9, 2022
@vaclavbarta
Copy link
Author

I think that this bug is core cdk asset bundling functionality problem, not just aws-logs or logRetention, but shared assets are much more likely to have this happen

@github-actions
Copy link

github-actions bot commented Dec 9, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@vaclavbarta
Copy link
Author

Oh no @comcalvi this issue was closed automaticaly because we close our internal fix which mentioned this. I didn't know that this is tied up automaticaly through github. I will reopen this issue again.

@vaclavbarta vaclavbarta reopened this Dec 9, 2022
@peterwoodworth
Copy link
Contributor

I'm not able to reproduce this @vaclavbarta, but maybe I'm not trying to reproduce it quite right. I think a GitHub repo to demonstrate exactly what you are trying to do would be helpful in getting to the bottom of this 🙂

@peterwoodworth peterwoodworth added p1 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. effort/small Small work item – less than a day of effort needs-reproduction This issue needs reproduction. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 13, 2022
@vaclavbarta
Copy link
Author

@peterwoodworth i create some POC repository with readme https://github.com/littlebart/cdk-issue-23290

@vaclavbarta
Copy link
Author

@peterwoodworth I'm creating some POC repository using the readme https://github.com/littlebart/cdk-issue-23290

Maybe this will help, it takes 1-3 tries to find the problem. But today I can't hit the corrupted file on local. Nevertheless half the installs failed 💥

It is tricky bug 😄

Have a nice day.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 14, 2022
@vaclavbarta
Copy link
Author

Is there anything you need from me? Was example proof of concept ok for reproduce it? Can i help with something? 🙌

We wanted to use concurrent running deploy for our project to consolidate time waiting for pipeline, which is in our scenario case about 40% faster, but we must sadly fallback to the sequential deploy due this bug (the risk is higher than the benefit from faster deployment)

I know that there are higher priority bugs

Happy new year 🙂

@rscharer
Copy link
Contributor

I just wanted to chime in and say this is killing us as well — we've got a CI/CD pipeline that uses the --concurrency flag and it fails about 20% of the time on zipfile corruption.

@vaclavbarta
Copy link
Author

I just wanted to chime in and say this is killing us as well — we've got a CI/CD pipeline that uses the --concurrency flag and it fails about 20% of the time on zipfile corruption.

Thanks for confirmation @rscharer

@rscharer
Copy link
Contributor

rscharer commented Jan 13, 2023

I filed a very simple proposed fix here: #23677, but it was rejected by the linter. Good luck writing a test for a race condition 😢

@vaclavbarta
Copy link
Author

@rscharer unfortunately, also i don't have any clue how to write tests for this 🧐 #23677 The fix you write make sense. Maybe @tomas-mazak could you help us?

@mergify mergify bot closed this as completed in #23677 Jan 18, 2023
mergify bot pushed a commit that referenced this issue Jan 18, 2023
#23677)

Resolves #23290 

A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID.

Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.

----

### All Submissions:

* [ X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Construct Runtime Dependencies:

* [ ] This PR adds new construct runtime dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-construct-runtime-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

rix0rrr pushed a commit that referenced this issue Feb 6, 2023
#23677)

Resolves #23290

A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID.

Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.
rix0rrr pushed a commit that referenced this issue Feb 6, 2023
#23677)

Resolves #23290

A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID.

Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.
mergify bot pushed a commit that referenced this issue Feb 6, 2023
This is a re-roll of #23677 which was reverted in #23994 because the `randomUUID()` function from the original solution was not available in Node versions below 14.17 (and we advertise compatibility with Node 14.*).

We didn't actually need a UUID, just any random string, so replace it with a function that generates a random string in a different way.

----------

Resolves #23290

A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random string.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@pahud pahud removed the needs-reproduction This issue needs reproduction. label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
5 participants