Garbage Collection for Assets #64

eladb · 2018-12-12T08:11:04Z

Description

Assets which are uploaded to the CDK's S3 bucket and ECR repository are never deleted. This will incur costs for users in the long term. We should come up with a story on how those should be garbage collected safely.

Initially we should offer cdk gc which will track down unused assets (e.g. by tracing them back from deployed stacks) and offering users to delete them. We can offer an option to automatically run this after every deployment (either in CLI or through CI/CD). Later we can even offer a construct that you deploy to your environment and it can do that for you.

Proposed usage:

cdk gc [ENVIRONMENT...] [--list] [--type=s3|ecr]

Examples:

This command will find all orphaned S3 and ECR assets in a specific AWS environment and will delete them:

cdk gc aws://ACCOUNT/REGION

This command will garbage collect all assets in all environments that belong to the current CDK app (if cdk.json exists):

cdk gc

Just list orphaned assets:

cdk gc --list

Roles

Role	User
Proposed by	@eladb
Author(s)	@kaizen3031593
API Bar Raiser	@njlynch
Stakeholders	@rix0rrr @nija-at

See RFC Process for details

Workflow

Author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue so that the RFC table in README gets
updated.

The text was updated successfully, but these errors were encountered:

sam-goodwin · 2018-12-12T08:36:07Z

How do we generate keys when uploading? Just random?

If it were deterministic, we could use lifecycle rules of object versions: https://aws.amazon.com/about-aws/whats-new/2014/05/20/amazon-s3-now-supports-lifecycle-rules-for-versioning/

That way objects are only considered for deletion if they are not 'current'.

sam-goodwin · 2018-12-12T08:39:34Z

Now that I think about it, versioning would only help if we were regularly re-uploading files. That's probably not the case?

eladb · 2018-12-12T08:41:36Z

Life cycle rules sounds like a good option for sure. The object keys are based on the hash of the contents of the asset, so to avoid uploading in case the content hasn't changed (code)

rix0rrr · 2018-12-12T09:05:53Z

I'm not sure this works if there are two stacks, and only one is deployed with a new version of the asset.

In my mind, the other stack should still point to the old version of the asset (should not be automatically updated), but now the asset will be aged out after a while and the undeployed stack will magically break after a certain time.

Alternative idea (but much more involved and potentially not Free Tier): a Lambda that runs ever N hours which enumerates all stacks, detects the assets that are in use (from stack parameters or in some other way) and clears out the rest?

rix0rrr · 2018-12-12T09:06:49Z

Now that I think about it, versioning would only help if we were regularly re-uploading files. That's probably not the case?
I thought you meant expiration for non-current versions. The latest version still stays alive indefinitely.

But I think this runs afoul of reuse across stacks.

eladb · 2018-12-12T09:08:23Z

👍 on garbage collecting lambda that runs every week or so, with ability to opt out and some cost warnings on docs and online

sam-goodwin · 2018-12-12T09:11:10Z

Seems risky. What if a deployment happen while crawling?

eladb · 2018-12-12T09:15:25Z

Yeah perhaps only collect old assets (month old) and we can salt the object key such that if a whole month had passed, it will be a new object?

Thinking out loud.... requires a design

thekevinbrown · 2020-06-22T01:23:43Z

I've got quite a few assets in my bucket now after a month or so of deploying from CD.

How do I determine which ones are in use, even manually? I can't seem to figure out the correlation between the names of the files in S3 and anything else I could use to determine what's being used. The lambdas don't point back at them in any way I can see.

I want to eventually write a script to do this safely for my use case, but absent a way of telling what's being used I'm stuck.

ansman · 2021-02-11T19:09:06Z

S3 now has lifecycle rules that can automatically delete objects a number of days after creation which might be a solution too.

mlafeldt · 2021-02-11T19:21:00Z

How do I determine which ones are in use, even manually?

The ones in use are those referenced in active CloudFormation stacks deployed via CDK.

Those stack templates will include something like this:

    "GenFeedFunc959C5085": {
      "Type": "AWS::Lambda::Function",
      "Properties": {
        "Code": {
          "S3Bucket": {
            "Fn::Sub": "cdk-xxx-assets-${AWS::AccountId}-${AWS::Region}"
          },
          "S3Key": "5946c35f6797cf45370f262c1e5992bc54d8e7dd824e3d5aa32152a2d1e85e5d.zip"
        },

S3 now has lifecycle rules that can automatically delete objects a number of days after creation which might be a solution too.

Unfortunately, that won't help since old objects might still be in use, e.g. when a Lambda wasn't deployed in a while.

(It doesn't help that all assets are stored in the same "folder" either.)

ansman · 2021-02-11T19:22:21Z

Doesn't lambda copy the resources during deployment?

mlafeldt · 2021-02-11T19:23:44Z

Doesn't lambda copy the resources during deployment?

The Lambda service will cache functions, but AFAIK there's no guarantee that they will be cached forever.

ansman · 2021-02-11T19:31:33Z

I'm fairly certain that Lambda only reads the assets during deployment and that they aren't needed afterwards. You can for example deploy to lambda without using S3 for smaller assets and those aren't stored in there.

mlafeldt · 2021-02-11T19:41:21Z

Would love to read about the behavior in the docs somewhere. That 50 MB upload limit must exist for a reason. Haven't found anything so far though.

ansman · 2021-02-11T19:59:22Z

I can't find any concrete resources on this but I haven't found any docs mentioning that you cannot delete the object after deletion. Also, my lambdas doesn't have any permission to read my staging bucket nor does it mention that the object is from S3 so I doubt it's required to keep the object around.

kjpgit · 2021-04-07T21:15:35Z

I would suggest writing s3 objects as transient/YYYY/MM/DD/HASH.zip and having a lifecycle policy to remove transient/* files after 3 days. You'd get caching for builds done in the same day. This is similar to the salted hash suggested, but a lot more explicit, observable, and not subject to collisions. Also, the hash could stay the same day to day, so as to not trigger a spurious Lambda redeploy.

The main issue here is you need to pick your date / prefix only once, and stick with that, for the whole build process. You don't want to upload files at Tue 23:59 and a later process/function looks in Wednesday for the object. Perhaps just having an --transient-asset-prefix argument to cdk deploy would be enough?

Option 2 for S3 is to check the last-modifed date on objects, and just force a re-upload every N days. Then you could have a lifecycle policy to delete old objects (> N+1 days), and that shouldn't race with deploys. You'd need to be 100% sure the re-upload doesn't race with S3's lifecycle process, however. That's why I prefer the immutable objects in option 1, there is no chance of a race.

ECR is different because it's not transient, it is the long term backing store for Lambda. Just spitballing here, but if there was a "transient ECR repo" with a 7 day deletion policy, you could push new builds to that, and then during cloudformation deploy, those images would then be "copied/hardlinked" to a "runtime ECR repo" with lifetime managed by cloudformation, e.g. removed upon stack update/delete. Maybe the same thing could be accomplished if cloudformation could set ECR Tags that "pin" images in the transient repo that are in use, and tagged images are excluded from lifecycle rules. However, to avoid races, builds have to push something (e.g. at least some tiny change) to the transient ECR repo to refresh the image age (at least if the image is older than a day), so it won't be deleted right before cloudformation starts tracking / pinning it.

polothy · 2021-05-06T18:27:28Z

Word of caution: doing just a time based deletion on assets is a little risky. We have had this scenario play out:

A stack is deployed and is healthy.
The stack's assets are removed from S3.
The stack is then updated at a later date (new assets in S3), but the update fails and triggers a rollback.
The rollback fails because CFN looks for the old assets in the prior template.
This puts your stack into the UPDATE_ROLLBACK_FAILED state. For us, we were able to get out of it by carefully skipping some rollback on resources. It's a rather scary state to have a stack in TBH.

So, unsure how you would even accomplish this, but ideally don't delete any S3 assets that are referenced in existing CFN templates.

polothy · 2021-05-06T18:33:45Z

So, unsure how you would even accomplish this, but ideally don't delete any S3 assets that are referenced in existing CFN templates.

Sorry, was reading quickly - sound like the RFC is going to try to do this, so yah!

mrpackethead · 2023-12-18T08:29:52Z

@awsmjs , since when does the cdk teams priorities not align with Amazon Leadership principals?

This is deliberately causing AWS customers to spend money unnessarily.

sholtomaud · 2023-12-18T08:47:27Z

hey @awsmjs here is an idea ... umm ... maybe hire some more staff in either/both the cdk team and/or the AI team and fix this shit. $8k/yr for a CDK Landingzone is just dumb, now we don't even get garbage collection because what? Don't cry poor. Fix it and don't piss people off. You have responsibilities now so bad luck.

vincent-dm · 2023-12-18T09:38:06Z

No need for the aggressive tone; we don't know if @awsmjs has any authority over hiring.

But I do agree that, since CDK is a tool of the for-profit AWS product, maintainers should aim higher than the average volunteer-run open-source project.

I understand that staffing is a constraint and priorities have to be made, but CDK team should then inform whoever is responsible for staffing, that the CDK team size is inadequate vis-à-vis its responsibilities. And also, put such tickets on some backlog then, instead of closing them as if they do not represent a real need from the customers who are -in the end- the ones paying for this.

a-h · 2023-12-18T09:48:02Z

For S3 assets, I now have 1.8TB of junk in my test account. The staging account probably has 1/3 of that amount, and the production account 1/3 of the amount of the staging account.

So, in total, I'm probably wasting around 2.5TB of S3 storage on this project.

$0.023 per GB x 2.5TB * 1024GB = 2560GB * $0.023 = $58.88, or around $700 per year.

It's not a lot of money, especially compared to CloudWatch or AWS Config, but it is a waste of some money and I don't like the idea that this will continue increasing, forever, with no prospect of being reduced.

CDK's NPM stats show 1,209,474 downloads per week. Out of those, I assume a lot are waste, but I can make a guess that 1 in 20 downloads give me a usage count of around 60k teams using CDK every week. Out of those teams, I'm probably in the top 3% of use, so that's 1800 teams with the same or greater level of use than me.

If they're in the same position, that's $700 per year * 1800 teams = around $1,260,000 of waste. I don't know what the value of other roadmap items look like, but it looks like it's worth putting 2 people on it for a year to me - although it would be a net loss for AWS revenue. 😁

For ECR, I'm using my custom cleaner https://github.com/a-h/cdk-ecr-asset-cleaner which means I'm still at 28 Docker containers. I would like for AWS to take that implementation and use it for inspiration to build an official CDK version.

mrpackethead · 2023-12-18T09:49:33Z

hey @awsmjs here is an idea ... umm ... maybe hire some more staff in either/both the cdk team and/or the AI team and fix this shit. $8k/yr for a CDK Landingzone is just dumb, now we don't even get garbage collection because what? Don't cry poor. Fix it and don't piss people off. You have responsibilities now so bad luck.

Yes. all of that +1

moltar · 2023-12-18T09:49:46Z

I wholly agree with the majority here - this is a real pain and should be addressed ASAP.

However, the incentives are not there, as AWS has a direct benefit from wasted storage. Thus systems thinking will suggest that there is no incentive to prioritize fixing this. The only lever we have as a community is to find a counter-incentive that will apply pressure. I think posting comments will not cut it. I think it's a good start to voice your opinion, but I am just saying it won't be enough to change the status quo.

Meanwhile, for those who are hurting, and might be interested in alternative solutions today, I recommend looking at these two solutions:

ToolkitCleaner - from @jogold, a big contributor to this project. We've been using this solution for a year now, and thus far have not encountered any issues.
App Staging Synthesizer - this is an alternative synthesizer, which may not be compatible with every workflow yet, notably pipelines, but worth a shot if you can use it. I think it is fine to use at least for sandbox and throwaway accounts. It has the "Auto Delete Staging Assets on Deletion" and "Lifecycle Rules on ECR Repositories" features, which help with old assets.

mrpackethead · 2023-12-18T09:50:39Z

No need for the aggressive tone; we don't know if @awsmjs has any authority over hiring.

But I do agree that, since CDK is a tool of the for-profit AWS product, maintainers should aim higher than the average volunteer-run open-source project.

I understand that staffing is a constraint and priorities have to be made, but CDK team should then inform whoever is responsible for staffing, that the CDK team size is inadequate vis-à-vis its responsibilities. And also, put such tickets on some backlog then, instead of closing them as if they do not represent a real need from the customers who are -in the end- the ones paying for this.

People have been polite for long enough. We spend millions of dollars a year with AWS. Time to put customers interests first.

moltar · 2023-12-18T10:04:40Z

💡 Idea for a counter-incentive

As a community, we'll create a CDK construct and a shared CF template, which will provide minimal access to get the total bucket size of the asset bucket. I think this can be done through just a CloudWatch metric (BucketSizeBytes), and filter by a tag.

Then we will create a central account where we will keep a tally of the sizes of all of the buckets out there of all of the participants of this experiment. I understand that many will not be able to participate due to corporate and security limitations, so it's not the full picture, of course.

We will have a public website, where we will host the BIG NUMBER, and explain the situation.

This will allow us, as a community, to measure the impact of this misfeature. If the impact is significant, then I think it would be easy to get the attention of the shareholders through the grapevine (Twitter/X, blog posts, conference talks, press) and thus apply pressure on AWS to allocate more resources to fix this.

eddie-atkinson · 2023-12-18T11:16:28Z

Look folks I appreciate that this is frustrating, but can we please not flame someone who's trying to do their job.

After a few months of reflection the solution for my use case of this construct was to separate the build and deploy stages of my service.

Essentially I had two CDK stacks, one of which provisioned an ECR repo to which I could apply a life cycle policy on images, and the other which deployed my service.

Then when it came to deployment I simply deployed my first stack, parsed out the ECR repo ARN and piped it into the call to cdk deploy of the other stack using CDK config. That way no CloudFormation dependency was created between the two stacks, but I could also build an deploy an image to a service in one fell swoop. This also has the neat side effect that if you need to redeploy your service due to a config change you don't need to rebuild your image which was the most time consuming part of the process for me.

This approach is more in line with the build, release, run philosophy of the 12 Factor App, which I broadly agree with.

kornicameister · 2024-01-23T22:49:18Z

Personally I had refined default bootstrap solution that prepares accounts with cdk toolkit. That includes reasonable retention policies and I find it to a very good job.

vkdot · 2024-03-07T11:01:09Z

It seems several solutions clean up CDK staging bucket from the perspective of listing deployed CloudFormation templates. This might not be ideal in case multiple CDK projects are being used in a single AWS account. and we want a different set up per project.

I have tried a different approach that modifies Default Stack Synthesizer to use S3 object key prefix and leverage cloud assembly output from CDK app synth method to determine, which assets should be kept after successful deployment.

The full description and example Java code is at https://github.com/NewTownData/events-monolith/blob/main/infrastructure/docs/clean-up.md

revmischa · 2024-03-14T01:46:26Z

Just speaking for myself here, I have a small team and just noticed we're wasting 500GB on CDK assets in various buckets.
I emptied the one for dev envs and everything seems to still work fine.
I think it can be solved with a lifecycle policy on the bucket. CDK itself creates the bucket so maybe it could just apply some policy. Maybe just delete something if it's over 6mo old to be conservative?

kaizencc · 2024-10-23T01:27:45Z

hi everyone, some updates for this feature:

aws/aws-cdk#31611 is released in CDK 2.163.0 under an --unstable flag :)
aws/aws-cdk#31841 is under development but should be out next week

between the two of them and any follow ups, we should be able to close out this feature as completed.

@rix0rrr

This is a request for comments about Asset Garbage Collection. See #64 for additional details. APIs are signed off by @rix0rrr . --- _By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license_ --------- Co-authored-by: Rico Hermans <rix0rrr@gmail.com>

kaizencc · 2024-10-30T15:56:54Z

The RFC for this issue has been approved and merged (#379). The implementation has also been completed and should be fully released this week in CDK 2.165.0 under the --unstable flag for the foreseeable future. Please let me know if anyone runs into issues when using it!

TiagoVentosa · 2024-11-13T17:20:37Z

@kaizencc Very exited for this new feature, and just tried to run it now and...

npx cdk gc --unstable=gc --rollback-buffer-days 5
 ⏳  Garbage Collecting environment aws://<ACCOUNT>/eu-west-1...
Error refreshing stacks: AccessDenied: User: arn:aws:sts::<ACCOUNT>:assumed-role/<ROLE> is not authorized to perform: cloudformation:GetTemplateSummary on resource: arn:aws:cloudformation:eu-west-1:<ACCOUNT>:stack/<ORGANIZATION-STACK> with an explicit deny in a service control policy

My first instinct is to say that the gc command should probably ignore stacks that it doesn't have access, but leave a warning in the console. Maybe even have a parameter to control this behaviour?

Should I open a separate ticket for this?

kaizencc · 2024-12-03T20:14:50Z

#32322

NGL321 assigned eladb Sep 10, 2019

eladb transferred this issue from aws/aws-cdk Jan 22, 2020

eladb added the effort/medium Reasonable effort required for implementation label Jan 22, 2020

eladb changed the title ~~assets: garbage collection~~ Asset Garbage Collector Jan 27, 2020

eladb added the status/proposed Newly proposed RFC label Jan 27, 2020

eladb changed the title ~~Asset Garbage Collector~~ Garbage Collection for Assets Feb 4, 2020

eladb mentioned this issue Apr 10, 2020

Instruct CDK to remove old assets after deploy aws/aws-cdk#7194

Closed

2 tasks

wangzhihao mentioned this issue Jul 25, 2020

How to reuse S3 File assets? MoonsetJS/Moonset#89

Closed

eladb removed their assignment Feb 25, 2021

NetaNir mentioned this issue Apr 6, 2021

bootstrap s3 bucket: add lifecycle policy option aws/aws-cdk#13968

Closed

eladb mentioned this issue Apr 7, 2021

(assets) unused images not being deleted from ECR aws/aws-cdk#13967

Closed

mrpackethead mentioned this issue Dec 18, 2023

Garbage Collection for Assets #590

Closed

11 tasks

evgenyka reopened this Dec 18, 2023

evgenyka added the roadmap label Dec 18, 2023

awsmjs removed the roadmap label Dec 18, 2023

evgenyka added roadmap status/planning RFC implementation is being planned and removed status/proposed Newly proposed RFC status/accepted labels Dec 18, 2023

evgenyka mentioned this issue Feb 21, 2024

cdk bootstrap - no destroy option aws/aws-cdk#986

Closed

jlnstrk mentioned this issue Jun 1, 2024

(aws-ecr-assets): Build and save images during synthesis rather than at deployment aws/aws-cdk#24936

Open

2 tasks

kaizencc closed this as completed Oct 30, 2024

kaizencc added status/done Implementation complete and removed status/planning RFC implementation is being planned labels Nov 4, 2024

Garbage Collection for Assets #64

Garbage Collection for Assets #64

Comments

eladb commented Dec 12, 2018 • edited Loading

Description

Roles

Workflow

sam-goodwin commented Dec 12, 2018

sam-goodwin commented Dec 12, 2018

eladb commented Dec 12, 2018

rix0rrr commented Dec 12, 2018

rix0rrr commented Dec 12, 2018

eladb commented Dec 12, 2018

sam-goodwin commented Dec 12, 2018

eladb commented Dec 12, 2018

thekevinbrown commented Jun 22, 2020

ansman commented Feb 11, 2021

mlafeldt commented Feb 11, 2021 • edited Loading

ansman commented Feb 11, 2021

mlafeldt commented Feb 11, 2021

ansman commented Feb 11, 2021

mlafeldt commented Feb 11, 2021

ansman commented Feb 11, 2021

kjpgit commented Apr 7, 2021 • edited Loading

polothy commented May 6, 2021

polothy commented May 6, 2021

mrpackethead commented Dec 18, 2023

sholtomaud commented Dec 18, 2023

vincent-dm commented Dec 18, 2023

a-h commented Dec 18, 2023 • edited Loading

mrpackethead commented Dec 18, 2023

moltar commented Dec 18, 2023

mrpackethead commented Dec 18, 2023

moltar commented Dec 18, 2023

eddie-atkinson commented Dec 18, 2023

kornicameister commented Jan 23, 2024

vkdot commented Mar 7, 2024

revmischa commented Mar 14, 2024

kaizencc commented Oct 23, 2024

kaizencc commented Oct 30, 2024

TiagoVentosa commented Nov 13, 2024

kaizencc commented Dec 3, 2024

eladb commented Dec 12, 2018 •

edited

Loading

mlafeldt commented Feb 11, 2021 •

edited

Loading

kjpgit commented Apr 7, 2021 •

edited

Loading

a-h commented Dec 18, 2023 •

edited

Loading