Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge number of files in .drake/keys .drake/data #154

Closed
kendonB opened this issue Nov 13, 2017 · 19 comments
Closed

Huge number of files in .drake/keys .drake/data #154

kendonB opened this issue Nov 13, 2017 · 19 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Nov 13, 2017

I'm running up against my file count quota. Is there any way to consolidate these? I used clean to remove unwanted targets already.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

Edit: TL;DR

  • drake_gc(): garbage collection with no cleaning.
  • clean(garbage_collection = TRUE): cleaning with garbage collection.

Original post

Ultimately, I think it will be up to you to define bigger, fewer targets. Unfortunately, there is no way to consolidate targets post-hoc.

What kind of quota and how many files? Is it imposed by the operating system or storr? How many targets do you have, and what is the cap on the number of files?

Is the quota is on a per-folder basis? Because then you could cleverly use multiple storr/drake caches and have targets from one become imports for the other. The solution is similar if you have multiple drives and you have caches on different drives.

To consolidate targets, the closest thing drake has is gather_plan(), but that assumes the individual targets are already there.

#129 probably exacerbated the problem, but it was the right design choice, and I would rather not go back.

@wlandau-lilly
Copy link
Collaborator

I also recommend posting on the storr issue tracker. @richfitz may have better advice.

@wlandau-lilly
Copy link
Collaborator

Can you run quotacheck --all?

@richfitz
Copy link
Member

Just to garbage collect the underlying storr (#118 and links within). If you really want to restrict the number of files, you can move to a non-rds based storr (this will require changes in drake but these are restricted to cache initialisation). There is support for SQLite in the current storr release, but this is significantly slower than rds (factor of x10 to x100 from memory). I have a new package (thor) which will be about as fast as rds, perhaps a little faster.

@wlandau-lilly
Copy link
Collaborator

@kendonB try drake::get_cache()$gc(). @richfitz how fast is gc()? I am thinking about adding it to drake::clean().

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

By the way: drake does support user-supplied non-rds storr caches. I have only tested with storr_rds() and storr_environment(), but theoretically, an SQLite-backed storr should work too. Thanks for sharing thor.

I am thinking of adding a garbage_collection flag to drake::clean().

@richfitz
Copy link
Member

It's not very fast - it must read all files that contain the hash references (the contents of keys/objects etc) - that's probably the limiting factor and not something that I can see a way around unfortunately.

But it's not crazy slow either. You might make it something disableable within clean()

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

Exactly what I am thinking.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

As of 29a299f, you can use clean(garbage_collection = TRUE). @kendonB, please let me know if it solves your file quota difficulties. I would also like to know what you think of the speed. The default value of garbage_collection is currently FALSE, but I will set it to TRUE if it is fast enough for your massive project.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

On second thought, you may just want to use the new drake_gc() (effective from 07d871e). You should not need to clean() to have garbage collection. I am still curious about speed, but now that we have drake_gc(), let's keep garbage_collection == FALSE by default in clean().

@kendonB
Copy link
Contributor Author

kendonB commented Nov 13, 2017

> system.time(drake_gc())
cache /....
   user  system elapsed
 49.299  19.371 282.253

Not super fast but it's an infrequent operation so not a big deal. I also am not sure it makes sense to have garbage collection in clean() (functions should do one thing well).

# GBs in drake after
> sum(file.info(list.files("./.drake", all.files = TRUE, full.names=TRUE, recursive = TRUE))$size)/1024^3
[1] 25.22917
# Files in drake after
> sum(length(list.files("./.drake", all.files = TRUE, full.names=TRUE, recursive = TRUE)))
[1] 234441

For my case, this didn't actually solve anything unfortunately as I guess my builds were new enough that there wasn't much (or any) junk left behind. I have taken some data off the server to make file-count room for now.

I presume thor will be a solution with few files when it's mature. This sounds like the best long term solution for drake.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 13, 2017

You can already use custom storrs, so an SQLite-based storr is already an option. For thor, I expect integration to be simple, if not trivial. We can continue the thread, but I am closing the issue. As you say, different storr-like backends are the best solution, and drake has already done its due diligence.

Ordinarily, yes, a function should do one thing well. I think I will keep the garbage_collection argument for now, though. Cleaning is a vague concept, and depending on how you think about it, garbage collection may or may not be part of it. Also, garbage collection in clean() is disabled by default, and the storr interface makes it beautifully simple, so there is not actually much clutter or confusion.

@wlandau-lilly
Copy link
Collaborator

Also, I will reiterate how lucky I am to have someone test drive drake on a project with 25 GB spread over 234441 files!

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 16, 2017

Update: effective 3a6e613, a bunch of superfluous storr namespaces are eliminated. For new drake projects, this will help protect against hitting restrictions on the number of files. For existing caches, you might be able to do this:

cache <- get_cache()
cache$clear(namespace = "attempts") # Totally harmless.
cache$clear(namespace = "imported") # My tests say you don't need this one.
cache$clear(namespace = "type") # Or this one.
cache$gc() # Garbage collection
# cache$clear(namespace = "readd) # Not actually sure about this one. You might need it for existing caches.

Otherwise, I really like spreading out data for a target over different namespaces (#129). Namespaces are one of my favorite features of storr, and they help organize, clean up, and future-proof drake's internals (avoiding potential back-compatibility problems going forward). I am sorry that this approach creates lots of tiny files for each target, but as @richfitz mentioned, different storr backends such as RSQlite and the upcoming thor should reduce the number of files for new projects.

@wlandau-lilly
Copy link
Collaborator

FYI: clean() now has a purge argument so you can remove target-level metadata like build times and error logs.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Dec 12, 2017

In light of #181, I am considering condensing some namespaces into a single "meta" namespace.

  • build_times
  • commands
  • depends
  • meta
  • mtimes
  • progress

So we would be left with the following target-level namespaces.

  • errors (should not write many files)
  • kernels
  • meta
  • objects

I think that should halve the number of small files, but the change would affect back compatibility with the current development version. So I would submit a GitHub release of the current version 4.4.1.9001 and then make the change for 4.4.1.9002.

@wlandau-lilly wlandau-lilly reopened this Dec 12, 2017
@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Dec 12, 2017

I have a solution that is nearly ready to deploy. It dramatically reduces the number of tiny files in the cache, and I predict that drake will continue to run reasonably fast.

I condensed all the target-level namespaces into a single "meta" namespace except for the following:

  • kernels - This namespace stores the reproducibly-tracked representation of each target, which could get quite large. It needs to stay its own separate namespace.
  • objects - Stores the actual values of the targets, needs to be its own namespace.
  • errors: This namespace is only used for failed targets, so it should should be very small. You can delete the files with get_cache()$clear(namespace = "errors"); drake_gc().
  • progress: The progress namespace usually needs to be cleared at the beginning of every make(), and putting progress logs in the meta namespace would really slow this step down. Use make(..., log_progress = FALSE) to turn off progress logging and prevent this namespace from being populated at all. Clear out progress files with get_cache()$clear(namespace = "progress"); drake_gc().

To get this to work, I needed to define "subspaces" of storr namespaces so I could put the content of multiple namespaces into a single file. I wonder if storr could solve that problem somehow for the general case.

@wlandau-lilly
Copy link
Collaborator

I wanted to record progress in stages during run_parallel(), but that does not work for "Makefile" parallelism. The only solution flexible enough is to store the progress of each target separately in the cache.

@wlandau-lilly
Copy link
Collaborator

To summarize our storr namespaces,

load_basic_example()
make(my_plan)
get_cache()$list_namespaces()
## [1] "config"   "kernels"  "meta"     "objects"  "progress" "session"

The "kernels", "meta", "objects", and "progress" namespaces have key and value files for every target built in a make(). You can turn off "progress" if you want, and "errors" appears if there are any failed targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants