(cdk-assets): Parallel zipping same asset with `cdk deploy --concurrency 2` has chance to create and permanently upload corrupted asset to S3 #23290

vaclavbarta · 2022-12-09T02:28:31Z

Describe the bug

I suspect that there is possible race condition in cdk bundling code, this code do not handle any parallel compression of the same asset at same time correctly (bundle into same file)

https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L11-L17 problem seems line 14

I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with cdk deploy --concurrency 2 ... or more concurrency value.

At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps inside whole account in the future (see possible solutions)

Have a nice day ;)

Expected Behavior

Paralell zipping with --concurrency 2 option will not produce corrupted assets, do not upload it into asset bucket and recovery from this error in future. More attention to the atomicity and independency of the operations during parallel processing would be good

Current Behavior

Yesterday we deploy our aws-cdk project with github pipelines, flow is

cdk synth
cdk deploy --all --concurrency 2 --app 'cdk.out/assembly-stage-prod/'

Project also use Stage construct and every stage contains two independent Stacks with API Gateways and NodejsFunctions constructs with logRetention props

  const func = new NodejsFunction(this, 'func', {
      runtime: lambda.Runtime.NODEJS_16_X,
      entry: path.join(__dirname, 'src/lambda.ts'),
      handler: 'handler',
      logRetention: RetentionDays.ONE_MONTH,
    });

After code change, which adds using logRetention, our pipeline fails with this error and hangs about 30 minutes

[0%] start: Publishing 624c688f90a2958730a01ea851f96aeebfec60adb98217e2590973a0613788d9:***-eu-central-1
[20%] fail: ENOENT: no such file or directory, rename '/home/runner/work/REDACTED/cdk.out/assembly-REDACTED/.cache/eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip._tmp' -> '/home/runner/work/REDACTED/cdk.out/assembly-REDACTED/.cache/eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip'
[33%] success: Published eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8:***-eu-central-1

...
[100%] success: Published 749868b79ce52eeb1e213bea4612695d0c6d34aaec83a4a62c1e367ea85e9f83:***-eu-central-1

 ❌  stage-prod/ApiGwEndpoints (stage-prod-ApiGwEndpoints) failed: Error: Failed to publish one or more assets. See the error messages above for more information.
    at publishAssets (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/util/asset-publishing.ts:60:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at CloudFormationDeployments.publishStackAssets (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:572:7)
    at CloudFormationDeployments.deployStack (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:419:7)
    at deployStack2 (/home/runner/work/REDACTED/node_modules/aws-cdk/lib/cdk-toolkit.ts:264:24)
    at /home/runner/work/REDACTED/node_modules/aws-cdk/lib/deploy.ts:39:11
    at run (/home/runner/work/REDACTED/node_modules/p-queue/dist/index.js:163:29)
    stage-prod-ApiGwEndpoints: creating CloudFormation changeset...
...

Cloudformation event waiting "indefinitely" on event for more than half hour on this (every *LogRetention resources stucks in CREATE_IN_PROGRESS)

stage-prod-ApiGwEndpoints |  5/11 | 2:17:09 PM | CREATE_IN_PROGRESS   | Custom::LogRetention           | stage-prod/ApiGwEndpoints/func/LogRetention (funcLogRetention4D3714B1) 
5/11 Currently in progress: stage-prod-ApiGwEndpoints, funcLogRetention4D3714B1

I try to cancel cloudformation deploy manualy, but this also hangs for maybe 1 hour in UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS, after this crazy time we finaly reached into the UPDATE_ROLLBACK_COMPLETE state. Every deploys before works fine.

I repeat this with same and another stage (prod2) into same account. However, the ENOENT file error did not occur again but the project always hangs with same behavior (always failing at funcLogRetention4D3714B1, timeout, ROLLBACK). Pipeline and manual deploy doesn't matter, IAM seems correct.

For deploying into another account, another stage environment (from another cdk.out builded localy or from pipeline, works ok) with or without use --concurrency 2 works always ok and after minutes we deploy our application successfully here.

But this one prod accounts has different deploy/asset history and fails

Conclusion:
After forensics analyze i suspects that problem is in account persistently (other AWS accounts with same code of this project works flawlessly and deploy in minutes successfully). Complete uninstall/reinstall of app also doesn't help.

Today i try analyze asset eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip from more acounts and i catch there was same size files same name as error above

stat *.zip
  File: asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Size: 1643      	Blocks: 8          IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 11134444    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ username)   Gid: ( 1000/ username)
Access: 2022-12-09 00:16:26.439200220 +0100
Modify: 2022-12-08 23:39:08.405444014 +0100
Change: 2022-12-09 00:16:08.659185874 +0100
 Birth: -
  File: asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Size: 1643      	Blocks: 8          IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 11132877    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ username)   Gid: ( 1000/ username)
Access: 2022-12-09 00:16:26.439200220 +0100
Modify: 2022-12-08 23:41:38.541560741 +0100
Change: 2022-12-09 00:16:18.171193552 +0100
 Birth: -

But the checksums are different

md5sum asset_*.zip
4262ceb38e4a978ac30eb305c9b495ce  asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
5fd153d5c1cd5334cd81991054b2603c  asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip

Both work with unzip -l

unzip -l asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
Archive:  asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      128  01-01-1980 00:00   index.d.ts
     4194  01-01-1980 00:00   index.js
---------                     -------
     4322                     2 files

unzip -l asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
Archive:  asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      128  01-01-1980 00:00   index.d.ts
     4194  01-01-1980 00:00   index.js
---------                     -------
     4322                     2 files

File asset_corrupted contains content only for index.d.ts (unzippable) but index.js is missing and unzip fails (inside zip are binary mostly zero bytes)

asset_good__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
asset_corrupted__eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip

After hours of analysis i catch that cdk accidentaly create and upload corrupted asset for CustomResource lambda used for handling Logretention. Asset zip is listable but cannot be unziped (contains zero data) so the Cloudformation deploys hangs without any meaningfull error. Cdk also never repairs this file in cdk-hnb659fds-assets-* S3 bucket, because has same key name for all stages and environments.

I suspect a race condition in some cdk bundling code and by searching github repo for existence _tmp i found only this suspicious code, which seems do not handle any parallel compression of the same asset at same time (bundle into/from same file).

https://github.com/aws/aws-cdk/blame/2ed006e50b15dfca96395d442ccee648abdbb374/packages/cdk-assets/lib/private/archive.ts#L14

Reproduction Steps

~~I can't reproduce it now again but maybe only time matters~~ I create repo with code example which most of time failing https://github.com/littlebart/cdk-issue-23290

Account with corrupted asset is broken until the code of LogRetenction CustomResource will be changed or bundling process change his filename or zip (if process is deterministic it might last very long time) or was deleted manualy from S3 cdk-asset bucket

Possible Solution

I can suggest some unique/randomized temporaryOutputFile name postfix to prevents racecondition when used with cdk deploy --concurrency 2 ... or more concurrency value (not constant ._tmp)

At the same time, it would be appropriate to somehow ensure that such an error does not permanently break the app or even any other installations of other apps permanently inside whole account in the future. Maybe some self-recovery mechanism using content checksum comparison of eg. x-amz-meta-checksum metadata can helps prevents unrecoverable error of this type.

Additional Information/Context

No response

CDK CLI Version

2.53.0

Framework Version

No response

Node.js Version

16.18.1

OS

Linux Ubuntu

Language

Typescript

Language Version

Typescript 4.9.3

Other information

No response

The text was updated successfully, but these errors were encountered:

vaclavbarta · 2022-12-09T09:41:48Z

I think that this bug is core cdk asset bundling functionality problem, not just aws-logs or logRetention, but shared assets are much more likely to have this happen

github-actions · 2022-12-09T19:36:02Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

vaclavbarta · 2022-12-09T21:49:21Z

Oh no @comcalvi this issue was closed automaticaly because we close our internal fix which mentioned this. I didn't know that this is tied up automaticaly through github. I will reopen this issue again.

peterwoodworth · 2022-12-13T21:34:31Z

I'm not able to reproduce this @vaclavbarta, but maybe I'm not trying to reproduce it quite right. I think a GitHub repo to demonstrate exactly what you are trying to do would be helpful in getting to the bottom of this 🙂

vaclavbarta · 2022-12-14T06:09:18Z

@peterwoodworth i create some POC repository with readme https://github.com/littlebart/cdk-issue-23290

vaclavbarta · 2022-12-14T06:48:22Z

@peterwoodworth I'm creating some POC repository using the readme https://github.com/littlebart/cdk-issue-23290

Maybe this will help, it takes 1-3 tries to find the problem. But today I can't hit the corrupted file on local. Nevertheless half the installs failed 💥

It is tricky bug 😄

Have a nice day.

vaclavbarta · 2023-01-03T19:31:35Z

Is there anything you need from me? Was example proof of concept ok for reproduce it? Can i help with something? 🙌

We wanted to use concurrent running deploy for our project to consolidate time waiting for pipeline, which is in our scenario case about 40% faster, but we must sadly fallback to the sequential deploy due this bug (the risk is higher than the benefit from faster deployment)

I know that there are higher priority bugs

Happy new year 🙂

rscharer · 2023-01-12T15:39:04Z

I just wanted to chime in and say this is killing us as well — we've got a CI/CD pipeline that uses the --concurrency flag and it fails about 20% of the time on zipfile corruption.

vaclavbarta · 2023-01-12T19:05:14Z

I just wanted to chime in and say this is killing us as well — we've got a CI/CD pipeline that uses the --concurrency flag and it fails about 20% of the time on zipfile corruption.

Thanks for confirmation @rscharer

rscharer · 2023-01-13T15:08:16Z

I filed a very simple proposed fix here: #23677, but it was rejected by the linter. Good luck writing a test for a race condition 😢

vaclavbarta · 2023-01-14T02:55:02Z

@rscharer unfortunately, also i don't have any clue how to write tests for this 🧐 #23677 The fix you write make sense. Maybe @tomas-mazak could you help us?

#23677) Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID. Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver. ---- ### All Submissions: * [ X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) ### Adding new Construct Runtime Dependencies: * [ ] This PR adds new construct runtime dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-construct-runtime-dependencies) ### New Features * [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)? * [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)? *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2023-01-18T10:42:04Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

#23677) Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random UUID. Please note that I was unable to run the integration tests in this instance, which are likely necessary given that this change impacts the behavior of the archiver.

This is a re-roll of #23677 which was reverted in #23994 because the `randomUUID()` function from the original solution was not available in Node versions below 14.17 (and we advertise compatibility with Node 14.*). We didn't actually need a UUID, just any random string, so replace it with a function that generates a random string in a different way. ---------- Resolves #23290 A very simple fix for the issue where builds with `--concurrency` specified can lead to corrupt archives. Rather than use the outputFile as the basis for the temp file name we simply use a random string. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

vaclavbarta added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 9, 2022

github-actions bot added the @aws-cdk/aws-logs Related to Amazon CloudWatch Logs label Dec 9, 2022

github-actions bot assigned comcalvi Dec 9, 2022

vaclavbarta closed this as completed Dec 9, 2022

vaclavbarta reopened this Dec 9, 2022

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 14, 2022

rscharer mentioned this issue Jan 13, 2023

fix(cdk-assets): concurrent asset builds can leave a corrupted archive #23677

Merged

3 tasks

mergify bot closed this as completed in #23677 Jan 18, 2023

rix0rrr mentioned this issue Feb 6, 2023

fix(cdk-assets): asset concurrency leaves a corrupted archive #24026

Merged

mrgrain mentioned this issue Feb 23, 2023

aws-cdk: Asset publishing fails every second time #24298

Closed

This was referenced Apr 25, 2023

fix(cdk-assets): concurrent asset building may occurs ENOENT: no such file or directory, unlink #25293

Closed

(cdk-assets): cdk depoly --concurrency N still sometimes causes ENOENT error to occur #25380

Closed

diranged mentioned this issue Sep 28, 2023

(assets): force re-generation of assets via the CLI #12800

Closed

jk2l mentioned this issue Aug 7, 2024

(assets): force re-generation of assets via the CLI (Re-open issue #12800) #31051

Open

2 tasks

pahud removed the needs-reproduction This issue needs reproduction. label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(cdk-assets): Parallel zipping same asset with `cdk deploy --concurrency 2` has chance to create and permanently upload corrupted asset to S3 #23290

(cdk-assets): Parallel zipping same asset with `cdk deploy --concurrency 2` has chance to create and permanently upload corrupted asset to S3 #23290

vaclavbarta commented Dec 9, 2022 •

edited

Loading

vaclavbarta commented Dec 9, 2022

github-actions bot commented Dec 9, 2022

vaclavbarta commented Dec 9, 2022

peterwoodworth commented Dec 13, 2022

vaclavbarta commented Dec 14, 2022

vaclavbarta commented Dec 14, 2022

vaclavbarta commented Jan 3, 2023

rscharer commented Jan 12, 2023

vaclavbarta commented Jan 12, 2023

rscharer commented Jan 13, 2023 •

edited

Loading

vaclavbarta commented Jan 14, 2023

github-actions bot commented Jan 18, 2023

(cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2 has chance to create and permanently upload corrupted asset to S3 #23290

(cdk-assets): Parallel zipping same asset with cdk deploy --concurrency 2 has chance to create and permanently upload corrupted asset to S3 #23290

Comments

vaclavbarta commented Dec 9, 2022 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

vaclavbarta commented Dec 9, 2022

github-actions bot commented Dec 9, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

vaclavbarta commented Dec 9, 2022

peterwoodworth commented Dec 13, 2022

vaclavbarta commented Dec 14, 2022

vaclavbarta commented Dec 14, 2022

vaclavbarta commented Jan 3, 2023

rscharer commented Jan 12, 2023

vaclavbarta commented Jan 12, 2023

rscharer commented Jan 13, 2023 • edited Loading

vaclavbarta commented Jan 14, 2023

github-actions bot commented Jan 18, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

(cdk-assets): Parallel zipping same asset with `cdk deploy --concurrency 2` has chance to create and permanently upload corrupted asset to S3 #23290

(cdk-assets): Parallel zipping same asset with `cdk deploy --concurrency 2` has chance to create and permanently upload corrupted asset to S3 #23290

vaclavbarta commented Dec 9, 2022 •

edited

Loading

rscharer commented Jan 13, 2023 •

edited

Loading