gc: clean up tmpdirs with dirEngine #17

cyphar · 2016-11-06T13:11:10Z

This is going to be a bit of a layer violation, so I'll have to think about it. But basically we need to have a way to remove all of the temporary directories that have been left behind inside an OCI image.

wking · 2016-11-06T15:00:14Z

My preferred approach to this is to follow Git and give content a grace period before cleanup. If a temp file or temp dir or unreferenced blob has not been touched in 24 hours (or whatever), it's probably a safe bet that nobody cares about it anymore. That means you don't reclaim space until 24 hours after the crash / abandoned push, but those events should be rare locally.

cyphar · 2016-11-06T15:58:08Z

Thanks for your insight. However, that's not relevant here because umoci gc is an explicit command operating on a local image we are currently building.

wking · 2016-11-06T16:27:51Z

So is git gc, see gc.pruneExpire.

cyphar · 2016-11-06T16:31:54Z

Not really comparable IMO. An OCI image is not like a git repo (aside from both being Merkle DAGs), and umoci is not intended to be used like git.

wking · 2016-11-06T16:39:04Z

Merkle DAG seems like enough similarity for sharing similar CAS GC approaches to me. But I'm sure there are other approaches that will work too ;).

Signed-off-by: Aleksa Sarai <asarai@suse.com>

cyphar · 2016-11-16T19:24:38Z

TODO: Use syscall.Flock so that we can make sure we don't delete directories we shouldn't.

@cyphar

Fixes: cyphar/umoci#17 Closes: cyphar/umoci#63 LGTMs: @cyphar

Following Git and defaulting to two weeks [1], as discussed in [2]. This allows the use of other non-flocking consumers (e.g. other CAS implementations) in the same base directory. The added robustness comes with a price, though, since we now have to balance cleanliness vs. robustness for those other consumers. * The hard-coded two-week cutoff may be insufficient for some use cases. Maybe somebody will need to download a huge layer through a very slow pipe, and they'll bump into this limit. Or maybe they have a high-cruft workflow or a small disk and would like to shorten this limit. If that happens or we become concerned that it might, we should make pruneExpire configurable. * Increasing pruneExpire protects more flock-less consumers from Clean, but also means that cruft can survive for longer before Clean is confident enough to remove it. Setting an infite pruneExpire would be very safe but would make Clean a no-op. Two weeks seems like a safe choice, since well-behaved consumers will clean up after themselves when they are closed. Clean is just handling poorly-behaved consumers (or well-behaved consumers that had a hard shutdown). I don't expect cruft to build up quickly enough for the two-week default to be an issue for most users. Consumers who do not wish to rely on pruneExpire (perhaps they have long-running operations or are locally setting pruneExpire very short) can continue to flock their temporary files and directories to protect them from Clean. Flocking a directory will also protect all content within that directory. Removal errors (because of insufficient permissions, etc.) seemed like they should be non-fatal so we could continue to remove other cruft. However, I didn't want to silently ignore the failed removal. In this commit, I log those errors with a "warning" level. [1]: https://git-scm.com/docs/git-gc [2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340 Signed-off-by: W. Trevor King <wking@tremily.us>

Following Git and defaulting to two weeks [1], as discussed in [2]. This allows the use of other non-flocking consumers (e.g. other CAS implementations) in the same base directory. The added robustness comes with a price, though, since we now have to balance cleanliness vs. robustness for those other consumers. * The hard-coded two-week cutoff may be insufficient for some use cases. Maybe somebody will need to download a huge layer through a very slow pipe, and they'll bump into this limit. Or maybe they have a high-cruft workflow or a small disk and would like to shorten this limit. If that happens or we become concerned that it might, we should make pruneExpire configurable. * Increasing pruneExpire protects more flock-less consumers from Clean, but also means that cruft can survive for longer before Clean is confident enough to remove it. Setting an infite pruneExpire would be very safe but would make Clean a no-op. Two weeks seems like a safe choice, since well-behaved consumers will clean up after themselves when they are closed. Clean is just handling poorly-behaved consumers (or well-behaved consumers that had a hard shutdown). I don't expect cruft to build up quickly enough for the two-week default to be an issue for most users. Consumers who do not wish to rely on pruneExpire (perhaps they have long-running operations or are locally setting pruneExpire very short) can continue to flock their temporary files and directories to protect them from Clean. Flocking a directory will also protect all content within that directory. Removal errors (because of insufficient permissions, etc.) seemed like they should be non-fatal so we could continue to remove other cruft. However, I didn't want to silently ignore the failed removal. In this commit, I log those errors with a "warning" level. The order of walk means that an old directory containing an old file will not be removed in a single pass. The first Clean will fail to remove the old directory because it is not empty, but will remove the old file (assuming it has sufficient permissions, etc.). A second Clean pass will remove the now-empty old directory. With cruft building slowly and Clean running fairly frequently, the delated old-directory removal didn't seem like a big enough issue to warrant an immediate re-clean attempt. [1]: https://git-scm.com/docs/git-gc [2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340 Signed-off-by: W. Trevor King <wking@tremily.us>

Following Git and defaulting to two weeks [1], as discussed in [2]. This allows the use of other non-flocking consumers (e.g. other CAS implementations) in the same base directory. The added robustness comes with a price, though, since we now have to balance cleanliness vs. robustness for those other consumers. * The hard-coded two-week cutoff may be insufficient for some use cases. Maybe somebody will need to download a huge layer through a very slow pipe, and they'll bump into this limit. Or maybe they have a high-cruft workflow or a small disk and would like to shorten this limit. If that happens or we become concerned that it might, we should make pruneExpire configurable. * Increasing pruneExpire protects more flock-less consumers from Clean, but also means that cruft can survive for longer before Clean is confident enough to remove it. Setting an infite pruneExpire would be very safe but would make Clean a no-op. Two weeks seems like a safe choice, since well-behaved consumers will clean up after themselves when they are closed. Clean is just handling poorly-behaved consumers (or well-behaved consumers that had a hard shutdown). I don't expect cruft to build up quickly enough for the two-week default to be an issue for most users. Consumers who do not wish to rely on pruneExpire (perhaps they have long-running operations or are locally setting pruneExpire very short) can continue to flock their temporary files and directories to protect them from Clean. Flocking a directory will also protect all content within that directory. Removal errors (because of insufficient permissions, etc.) seemed like they should be non-fatal so we could continue to remove other cruft. However, I didn't want to silently ignore the failed removal. In this commit, I log those errors with a "warning" level. The order of walk means that an old directory containing an old file will not be removed in a single pass. The first Clean will fail to remove the old directory because it is not empty, but will remove the old file (assuming it has sufficient permissions, etc.). A second Clean pass will remove the now-empty old directory. With cruft building slowly and Clean running fairly frequently, the delayed old-directory removal didn't seem like a big enough issue to warrant an immediate re-clean attempt. [1]: https://git-scm.com/docs/git-gc [2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340 Signed-off-by: W. Trevor King <wking@tremily.us>

cyphar added umoci/gc oci/cas feature labels Nov 6, 2016

cyphar added this to the 0.0.0 milestone Nov 6, 2016

wking referenced this issue Nov 10, 2016

image: cas: implement GC

e9a73d2

Signed-off-by: Aleksa Sarai <asarai@suse.com>

wking mentioned this issue Nov 17, 2016

Proposal: Implement ref deleting command opencontainers/image-tools#86

Open

cyphar mentioned this issue Dec 19, 2016

gc: clean internal directories #63

Merged

cyphar closed this as completed in #63 Dec 19, 2016

cyphar referenced this issue Dec 19, 2016

merge branch 'gc-clean-tmpdirs'

955bdf3

Fixes: cyphar/umoci#17 Closes: cyphar/umoci#63 LGTMs: @cyphar

wking mentioned this issue Oct 18, 2017

oci/cas/dir: Only Clean .umoci-* directories #198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: clean up tmpdirs with dirEngine #17

gc: clean up tmpdirs with dirEngine #17

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 16, 2016

gc: clean up tmpdirs with dirEngine #17

gc: clean up tmpdirs with dirEngine #17

Comments

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 6, 2016

wking commented Nov 6, 2016

cyphar commented Nov 16, 2016