Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: clean up tmpdirs with dirEngine #17

Closed
cyphar opened this issue Nov 6, 2016 · 6 comments
Closed

gc: clean up tmpdirs with dirEngine #17

cyphar opened this issue Nov 6, 2016 · 6 comments

Comments

@cyphar
Copy link
Member

cyphar commented Nov 6, 2016

This is going to be a bit of a layer violation, so I'll have to think about it. But basically we need to have a way to remove all of the temporary directories that have been left behind inside an OCI image.

@wking
Copy link
Contributor

wking commented Nov 6, 2016

My preferred approach to this is to follow Git and give content a grace period before cleanup. If a temp file or temp dir or unreferenced blob has not been touched in 24 hours (or whatever), it's probably a safe bet that nobody cares about it anymore. That means you don't reclaim space until 24 hours after the crash / abandoned push, but those events should be rare locally.

@cyphar
Copy link
Member Author

cyphar commented Nov 6, 2016

Thanks for your insight. However, that's not relevant here because umoci gc is an explicit command operating on a local image we are currently building.

@wking
Copy link
Contributor

wking commented Nov 6, 2016

So is git gc, see gc.pruneExpire.

@cyphar
Copy link
Member Author

cyphar commented Nov 6, 2016

Not really comparable IMO. An OCI image is not like a git repo (aside from both being Merkle DAGs), and umoci is not intended to be used like git.

@wking
Copy link
Contributor

wking commented Nov 6, 2016

Merkle DAG seems like enough similarity for sharing similar CAS GC approaches to me. But I'm sure there are other approaches that will work too ;).

wking referenced this issue Nov 10, 2016
Signed-off-by: Aleksa Sarai <asarai@suse.com>
@cyphar
Copy link
Member Author

cyphar commented Nov 16, 2016

TODO: Use syscall.Flock so that we can make sure we don't delete directories we shouldn't.

cyphar referenced this issue Dec 19, 2016
Fixes: cyphar/umoci#17
Closes: cyphar/umoci#63
LGTMs: @cyphar
wking referenced this issue in wking/umoci Oct 18, 2017
Following Git and defaulting to two weeks [1], as discussed in [2].
This allows the use of other non-flocking consumers (e.g. other CAS
implementations) in the same base directory.

The added robustness comes with a price, though, since we now have to
balance cleanliness vs. robustness for those other consumers.

* The hard-coded two-week cutoff may be insufficient for some use
  cases.  Maybe somebody will need to download a huge layer through a
  very slow pipe, and they'll bump into this limit.  Or maybe they
  have a high-cruft workflow or a small disk and would like to shorten
  this limit.  If that happens or we become concerned that it might,
  we should make pruneExpire configurable.

* Increasing pruneExpire protects more flock-less consumers from
  Clean, but also means that cruft can survive for longer before Clean
  is confident enough to remove it.  Setting an infite pruneExpire
  would be very safe but would make Clean a no-op.  Two weeks seems
  like a safe choice, since well-behaved consumers will clean up after
  themselves when they are closed.  Clean is just handling
  poorly-behaved consumers (or well-behaved consumers that had a hard
  shutdown).  I don't expect cruft to build up quickly enough for the
  two-week default to be an issue for most users.

Consumers who do not wish to rely on pruneExpire (perhaps they have
long-running operations or are locally setting pruneExpire very short)
can continue to flock their temporary files and directories to protect
them from Clean.  Flocking a directory will also protect all content
within that directory.

Removal errors (because of insufficient permissions, etc.) seemed like
they should be non-fatal so we could continue to remove other cruft.
However, I didn't want to silently ignore the failed removal.  In this
commit, I log those errors with a "warning" level.

[1]: https://git-scm.com/docs/git-gc
[2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340

Signed-off-by: W. Trevor King <wking@tremily.us>
wking referenced this issue in wking/umoci Oct 18, 2017
Following Git and defaulting to two weeks [1], as discussed in [2].
This allows the use of other non-flocking consumers (e.g. other CAS
implementations) in the same base directory.

The added robustness comes with a price, though, since we now have to
balance cleanliness vs. robustness for those other consumers.

* The hard-coded two-week cutoff may be insufficient for some use
  cases.  Maybe somebody will need to download a huge layer through a
  very slow pipe, and they'll bump into this limit.  Or maybe they
  have a high-cruft workflow or a small disk and would like to shorten
  this limit.  If that happens or we become concerned that it might,
  we should make pruneExpire configurable.

* Increasing pruneExpire protects more flock-less consumers from
  Clean, but also means that cruft can survive for longer before Clean
  is confident enough to remove it.  Setting an infite pruneExpire
  would be very safe but would make Clean a no-op.  Two weeks seems
  like a safe choice, since well-behaved consumers will clean up after
  themselves when they are closed.  Clean is just handling
  poorly-behaved consumers (or well-behaved consumers that had a hard
  shutdown).  I don't expect cruft to build up quickly enough for the
  two-week default to be an issue for most users.

Consumers who do not wish to rely on pruneExpire (perhaps they have
long-running operations or are locally setting pruneExpire very short)
can continue to flock their temporary files and directories to protect
them from Clean.  Flocking a directory will also protect all content
within that directory.

Removal errors (because of insufficient permissions, etc.) seemed like
they should be non-fatal so we could continue to remove other cruft.
However, I didn't want to silently ignore the failed removal.  In this
commit, I log those errors with a "warning" level.

The order of walk means that an old directory containing an old file
will not be removed in a single pass.  The first Clean will fail to
remove the old directory because it is not empty, but will remove the
old file (assuming it has sufficient permissions, etc.).  A second
Clean pass will remove the now-empty old directory.  With cruft
building slowly and Clean running fairly frequently, the delated
old-directory removal didn't seem like a big enough issue to warrant
an immediate re-clean attempt.

[1]: https://git-scm.com/docs/git-gc
[2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340

Signed-off-by: W. Trevor King <wking@tremily.us>
wking referenced this issue in wking/umoci Oct 18, 2017
Following Git and defaulting to two weeks [1], as discussed in [2].
This allows the use of other non-flocking consumers (e.g. other CAS
implementations) in the same base directory.

The added robustness comes with a price, though, since we now have to
balance cleanliness vs. robustness for those other consumers.

* The hard-coded two-week cutoff may be insufficient for some use
  cases.  Maybe somebody will need to download a huge layer through a
  very slow pipe, and they'll bump into this limit.  Or maybe they
  have a high-cruft workflow or a small disk and would like to shorten
  this limit.  If that happens or we become concerned that it might,
  we should make pruneExpire configurable.

* Increasing pruneExpire protects more flock-less consumers from
  Clean, but also means that cruft can survive for longer before Clean
  is confident enough to remove it.  Setting an infite pruneExpire
  would be very safe but would make Clean a no-op.  Two weeks seems
  like a safe choice, since well-behaved consumers will clean up after
  themselves when they are closed.  Clean is just handling
  poorly-behaved consumers (or well-behaved consumers that had a hard
  shutdown).  I don't expect cruft to build up quickly enough for the
  two-week default to be an issue for most users.

Consumers who do not wish to rely on pruneExpire (perhaps they have
long-running operations or are locally setting pruneExpire very short)
can continue to flock their temporary files and directories to protect
them from Clean.  Flocking a directory will also protect all content
within that directory.

Removal errors (because of insufficient permissions, etc.) seemed like
they should be non-fatal so we could continue to remove other cruft.
However, I didn't want to silently ignore the failed removal.  In this
commit, I log those errors with a "warning" level.

The order of walk means that an old directory containing an old file
will not be removed in a single pass.  The first Clean will fail to
remove the old directory because it is not empty, but will remove the
old file (assuming it has sufficient permissions, etc.).  A second
Clean pass will remove the now-empty old directory.  With cruft
building slowly and Clean running fairly frequently, the delayed
old-directory removal didn't seem like a big enough issue to warrant
an immediate re-clean attempt.

[1]: https://git-scm.com/docs/git-gc
[2]: https://github.com/openSUSE/umoci/issues/17#issuecomment-258686340

Signed-off-by: W. Trevor King <wking@tremily.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants