Huge number of files in .drake/keys .drake/data #154

kendonB · 2017-11-13T06:27:56Z

I'm running up against my file count quota. Is there any way to consolidate these? I used clean to remove unwanted targets already.

The text was updated successfully, but these errors were encountered:

wlandau-lilly · 2017-11-13T12:15:20Z

Edit: TL;DR

drake_gc(): garbage collection with no cleaning.
clean(garbage_collection = TRUE): cleaning with garbage collection.

Original post

Ultimately, I think it will be up to you to define bigger, fewer targets. Unfortunately, there is no way to consolidate targets post-hoc.

What kind of quota and how many files? Is it imposed by the operating system or storr? How many targets do you have, and what is the cap on the number of files?

Is the quota is on a per-folder basis? Because then you could cleverly use multiple storr/drake caches and have targets from one become imports for the other. The solution is similar if you have multiple drives and you have caches on different drives.

To consolidate targets, the closest thing drake has is gather_plan(), but that assumes the individual targets are already there.

#129 probably exacerbated the problem, but it was the right design choice, and I would rather not go back.

wlandau-lilly · 2017-11-13T12:16:56Z

I also recommend posting on the storr issue tracker. @richfitz may have better advice.

wlandau-lilly · 2017-11-13T12:20:54Z

Can you run quotacheck --all?

richfitz · 2017-11-13T12:24:41Z

Just to garbage collect the underlying storr (#118 and links within). If you really want to restrict the number of files, you can move to a non-rds based storr (this will require changes in drake but these are restricted to cache initialisation). There is support for SQLite in the current storr release, but this is significantly slower than rds (factor of x10 to x100 from memory). I have a new package (thor) which will be about as fast as rds, perhaps a little faster.

wlandau-lilly · 2017-11-13T12:36:38Z

@kendonB try drake::get_cache()$gc(). @richfitz how fast is gc()? I am thinking about adding it to drake::clean().

wlandau-lilly · 2017-11-13T12:40:14Z

By the way: drake does support user-supplied non-rds storr caches. I have only tested with storr_rds() and storr_environment(), but theoretically, an SQLite-backed storr should work too. Thanks for sharing thor.

I am thinking of adding a garbage_collection flag to drake::clean().

richfitz · 2017-11-13T12:40:41Z

It's not very fast - it must read all files that contain the hash references (the contents of keys/objects etc) - that's probably the limiting factor and not something that I can see a way around unfortunately.

But it's not crazy slow either. You might make it something disableable within clean()

wlandau-lilly · 2017-11-13T12:47:11Z

Exactly what I am thinking.

wlandau-lilly · 2017-11-13T13:23:46Z

As of 29a299f, you can use clean(garbage_collection = TRUE). @kendonB, please let me know if it solves your file quota difficulties. I would also like to know what you think of the speed. The default value of garbage_collection is currently FALSE, but I will set it to TRUE if it is fast enough for your massive project.

wlandau-lilly · 2017-11-13T14:56:55Z

On second thought, you may just want to use the new drake_gc() (effective from 07d871e). You should not need to clean() to have garbage collection. I am still curious about speed, but now that we have drake_gc(), let's keep garbage_collection == FALSE by default in clean().

kendonB · 2017-11-13T16:39:41Z

> system.time(drake_gc())
cache /....
   user  system elapsed
 49.299  19.371 282.253

Not super fast but it's an infrequent operation so not a big deal. I also am not sure it makes sense to have garbage collection in clean() (functions should do one thing well).

# GBs in drake after
> sum(file.info(list.files("./.drake", all.files = TRUE, full.names=TRUE, recursive = TRUE))$size)/1024^3
[1] 25.22917
# Files in drake after
> sum(length(list.files("./.drake", all.files = TRUE, full.names=TRUE, recursive = TRUE)))
[1] 234441

For my case, this didn't actually solve anything unfortunately as I guess my builds were new enough that there wasn't much (or any) junk left behind. I have taken some data off the server to make file-count room for now.

I presume thor will be a solution with few files when it's mature. This sounds like the best long term solution for drake.

wlandau-lilly · 2017-11-13T17:12:15Z

You can already use custom storrs, so an SQLite-based storr is already an option. For thor, I expect integration to be simple, if not trivial. We can continue the thread, but I am closing the issue. As you say, different storr-like backends are the best solution, and drake has already done its due diligence.

Ordinarily, yes, a function should do one thing well. I think I will keep the garbage_collection argument for now, though. Cleaning is a vague concept, and depending on how you think about it, garbage collection may or may not be part of it. Also, garbage collection in clean() is disabled by default, and the storr interface makes it beautifully simple, so there is not actually much clutter or confusion.

wlandau-lilly · 2017-11-13T17:14:19Z

Also, I will reiterate how lucky I am to have someone test drive drake on a project with 25 GB spread over 234441 files!

wlandau-lilly · 2017-11-16T02:48:23Z

Update: effective 3a6e613, a bunch of superfluous storr namespaces are eliminated. For new drake projects, this will help protect against hitting restrictions on the number of files. For existing caches, you might be able to do this:

cache <- get_cache()
cache$clear(namespace = "attempts") # Totally harmless.
cache$clear(namespace = "imported") # My tests say you don't need this one.
cache$clear(namespace = "type") # Or this one.
cache$gc() # Garbage collection
# cache$clear(namespace = "readd) # Not actually sure about this one. You might need it for existing caches.

Otherwise, I really like spreading out data for a target over different namespaces (#129). Namespaces are one of my favorite features of storr, and they help organize, clean up, and future-proof drake's internals (avoiding potential back-compatibility problems going forward). I am sorry that this approach creates lots of tiny files for each target, but as @richfitz mentioned, different storr backends such as RSQlite and the upcoming thor should reduce the number of files for new projects.

wlandau-lilly · 2017-11-21T22:39:01Z

FYI: clean() now has a purge argument so you can remove target-level metadata like build times and error logs.

wlandau-lilly · 2017-12-12T12:49:31Z

In light of #181, I am considering condensing some namespaces into a single "meta" namespace.

build_times
commands
depends
meta
mtimes
progress

So we would be left with the following target-level namespaces.

errors (should not write many files)
kernels
meta
objects

I think that should halve the number of small files, but the change would affect back compatibility with the current development version. So I would submit a GitHub release of the current version 4.4.1.9001 and then make the change for 4.4.1.9002.

wlandau-lilly · 2017-12-12T19:27:58Z

I have a solution that is nearly ready to deploy. It dramatically reduces the number of tiny files in the cache, and I predict that drake will continue to run reasonably fast.

I condensed all the target-level namespaces into a single "meta" namespace except for the following:

kernels - This namespace stores the reproducibly-tracked representation of each target, which could get quite large. It needs to stay its own separate namespace.
objects - Stores the actual values of the targets, needs to be its own namespace.
errors: This namespace is only used for failed targets, so it should should be very small. You can delete the files with get_cache()$clear(namespace = "errors"); drake_gc().
progress: The progress namespace usually needs to be cleared at the beginning of every make(), and putting progress logs in the meta namespace would really slow this step down. Use make(..., log_progress = FALSE) to turn off progress logging and prevent this namespace from being populated at all. Clear out progress files with get_cache()$clear(namespace = "progress"); drake_gc().

To get this to work, I needed to define "subspaces" of storr namespaces so I could put the content of multiple namespaces into a single file. I wonder if storr could solve that problem somehow for the general case.

wlandau-lilly · 2017-12-12T19:30:37Z

I wanted to record progress in stages during run_parallel(), but that does not work for "Makefile" parallelism. The only solution flexible enough is to store the progress of each target separately in the cache.

wlandau-lilly · 2017-12-12T23:52:11Z

To summarize our storr namespaces,

load_basic_example()
make(my_plan)
get_cache()$list_namespaces()

## [1] "config"   "kernels"  "meta"     "objects"  "progress" "session"

The "kernels", "meta", "objects", and "progress" namespaces have key and value files for every target built in a make(). You can turn off "progress" if you want, and "errors" appears if there are any failed targets.

wlandau-lilly mentioned this issue Nov 13, 2017

garbage collection richfitz/storr#57

Closed

wlandau-lilly mentioned this issue Nov 13, 2017

Garbage collection #118

Closed

wlandau-lilly closed this as completed Nov 13, 2017

This was referenced Nov 19, 2017

target is consistently failing to storr properly #159

Closed

User-side function to load all the dependencies of a target. #20

Closed

wlandau-lilly reopened this Dec 12, 2017

wlandau-lilly closed this as completed in 5b49d40 Dec 12, 2017

This was referenced Feb 1, 2018

Catch warnings and messages #212

Closed

Use tibbles and language objects internally instead of data frames and text #247

Closed

wlandau mentioned this issue Apr 9, 2021

tar_target/tar_option_set error argument ropensci/targets#405

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge number of files in .drake/keys .drake/data #154

Huge number of files in .drake/keys .drake/data #154

kendonB commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017

richfitz commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

richfitz commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

kendonB commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 16, 2017 •

edited

Loading

wlandau-lilly commented Nov 21, 2017

wlandau-lilly commented Dec 12, 2017 •

edited

Loading

wlandau-lilly commented Dec 12, 2017 •

edited

Loading

wlandau-lilly commented Dec 12, 2017

wlandau-lilly commented Dec 12, 2017

Huge number of files in .drake/keys .drake/data #154

Huge number of files in .drake/keys .drake/data #154

Comments

kendonB commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 • edited Loading

Edit: TL;DR

Original post

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017

richfitz commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 • edited Loading

richfitz commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 • edited Loading

wlandau-lilly commented Nov 13, 2017 • edited Loading

wlandau-lilly commented Nov 13, 2017 • edited Loading

kendonB commented Nov 13, 2017

wlandau-lilly commented Nov 13, 2017 • edited Loading

wlandau-lilly commented Nov 13, 2017

wlandau-lilly commented Nov 16, 2017 • edited Loading

wlandau-lilly commented Nov 21, 2017

wlandau-lilly commented Dec 12, 2017 • edited Loading

wlandau-lilly commented Dec 12, 2017 • edited Loading

wlandau-lilly commented Dec 12, 2017

wlandau-lilly commented Dec 12, 2017

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 13, 2017 •

edited

Loading

wlandau-lilly commented Nov 16, 2017 •

edited

Loading

wlandau-lilly commented Dec 12, 2017 •

edited

Loading

wlandau-lilly commented Dec 12, 2017 •

edited

Loading