Is there a way to disable drake's caching system? #384

guilhermealles · 2018-05-18T13:40:29Z

I am trying to use Drake's parallelism features to speed up a data manipulation workflow. My workflow reads some large .csv files from disk, manipulates them and outputs some optimized .feather files to be used by another application.

The problem is that, by using Drake (even in parallel), the workflow is considerably slower to complete (twice as slow or even more), when compared to its sequential version. My hypothesis is that it is taking so long because Drake caches its targets in to disk, which is not necessary in the specific case I am working with. Is there a way to disable Drake's caching system?

wlandau · 2018-05-18T14:00:46Z

Unfortunately, drake's cache cannot be totally disabled, but there may be other workarounds. But first, it would help to know some details:

The version of drake you are using (and the SHA key of the git commit if you are using the GitHub version). Parallel computing in drake has changed a lot recently.
The parallelism and jobs arguments to make().
The specs of your hardware.

I suspect drake is reading from disk a lot because each parallel process needs to have the dependencies in memory for the target it builds, and there is even more re-reading for transient workers. In the sequential version, drake keeps objects in memory until no more downstream targets need them.

You might also use custom files for all your targets and make sure the outputs of your commands are light. That way, drake will check timestamps first.

guilhermealles · 2018-05-18T14:29:34Z

@wlandau, thanks for your response.

Then version of drake that I am using is 5.1.2, as tagged by the commit 4db383bfd6b202595458b66b0d10dc15fe9d8b2b.
I used the default "mclapply" for parallelism, and also tried "Makefile" with no success. max_useful_jobs() was used for jobs, which in my case is 8.
I am running the workflow on an Intel Xeon E5-2620 v4 (2.10GHz, 8 cores).

From your reply, I understand that it is not possible to keep dependencies in memory between targets for parallel workflows. Is that correct?

wlandau · 2018-05-18T16:11:07Z

From looking at the console output of make(jobs = 1, verbose = 4) and make(jobs = 4, verbose = 4), I do see now that drake reads from disk more often than it is supposed to. I thought I fixed that, and I am glad you brought it to my attention. This could be a bug in prune_envir() or its usage, and I will address it.

Until that is fixed, there are a couple things we could try. Are you open to modifying your workflow so that it relies less on the cache? Is it possible to have most of the commands read in CSV files and output feather files? In your case, the more often you have file_out() targets, the less of a problem reads from disk will be.

You could also try using an in-memory cache, but I do not generally recommend it for parallel processing because it is not thread safe.

cache <- storr::storr_environment()
make(cache = cache, jobs = 8) # May error, but could be worth a shot.

What is supposed to happen

drake's approach to memory management is implemented in prune_envir(). At various points in the workflow, the environment is "pruned", or tailored to a given set of targets. In pruning, first, everything is unloaded that is not a dependency of the targets or anything downstream. Next, the dependencies of the targets are loaded. In addition, any newly-built targets are assigned to memory. For sequential execution, all this turns out to minimize large reads from disk. I had hoped it would also minimize reads for "mclapply" parallelism in version 5.1.2, and I am concerned that it is so much slower for you.

Future work

I do realize that drake needs to get better when it comes to memory management across multiple workers. One of the sticking points, especially for persistent workers in the new parallel computing functionality in the GitHub version is that targets get assigned to the next available worker instead of waiting for a worker that already has more of its dependencies in memory. This is a hard problem, and it will take a long time to solve.

wlandau · 2018-05-18T16:43:43Z

@guilhermealles after closer inspection of development drake's behavior, I am more convinced that the slowdown you see is because drake assigns targets to the next available worker rather than the one that already has its dependencies loaded. Sorry, but this isn't likely to be solved any time soon. We could work on making your workflow read from disk less often. On the other hand, restructuring your work will probably make it rely more on custom output files, in which case you may be better off with a tool like snakemake for the time being.

wlandau · 2018-05-18T16:51:55Z

And back to your original question, unfortunately, disabling drake's cache is just not possible. Eventually, I want to separate the high-performance computing piece from the reproducibility piece, but where that effort is concerned, I am currently stuck.

wlandau · 2018-10-27T16:14:32Z

Ah, I now see how this could been related to StarVZ. The development version of drake now supports a customizable "hasty mode": https://ropenscilabs.github.io/drake-manual/hpc.html#hasty-mode

wlandau · 2018-10-28T13:37:50Z

To clarify, "hasty mode" disables the caching system, but it does not skip up-to-date targets unless you come up with your own system for doing so (maybe in the hasty_build argument to make()).

wlandau · 2018-10-28T14:10:29Z

Another approach: maybe try make(parallelism = "clustermq") (which can skip up-to-date targets) and use custom file_out() files for all your targets.

wlandau · 2020-02-22T12:25:08Z

drake has gotten a lot faster since we last spoke, especially with https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. But a lot of the speed gains on the user's side come from choosing good targets: https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. Too many targets and stuff slows down, too few and repeated make()s do not save as much time by skipping targets.

wlandau added the topic: performance label May 18, 2018

wlandau closed this as completed May 18, 2018

wlandau mentioned this issue May 18, 2018

Different pruning strategies: memory versus speed #385

Closed

wlandau mentioned this issue May 27, 2018

Option to prespecify the worker of each target #391

Closed

wlandau mentioned this issue Oct 28, 2018

Remove all non-clustermq parallel backends? #561

Closed

nettoyoussef mentioned this issue Jun 13, 2019

Speed up the cache #907

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to disable drake's caching system? #384

Is there a way to disable drake's caching system? #384

guilhermealles commented May 18, 2018

wlandau commented May 18, 2018

guilhermealles commented May 18, 2018

wlandau commented May 18, 2018 •

edited

Loading

wlandau commented May 18, 2018 •

edited

Loading

wlandau commented May 18, 2018

wlandau commented Oct 27, 2018

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

wlandau commented Feb 22, 2020 •

edited

Loading

Is there a way to disable drake's caching system? #384

Is there a way to disable drake's caching system? #384

Comments

guilhermealles commented May 18, 2018

wlandau commented May 18, 2018

guilhermealles commented May 18, 2018

wlandau commented May 18, 2018 • edited Loading

What is supposed to happen

Future work

wlandau commented May 18, 2018 • edited Loading

wlandau commented May 18, 2018

wlandau commented Oct 27, 2018

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

wlandau commented Feb 22, 2020 • edited Loading

wlandau commented May 18, 2018 •

edited

Loading

wlandau commented May 18, 2018 •

edited

Loading

wlandau commented Feb 22, 2020 •

edited

Loading