Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to disable drake's caching system? #384

Closed
guilhermealles opened this issue May 18, 2018 · 9 comments
Closed

Is there a way to disable drake's caching system? #384

guilhermealles opened this issue May 18, 2018 · 9 comments

Comments

@guilhermealles
Copy link

I am trying to use Drake's parallelism features to speed up a data manipulation workflow. My workflow reads some large .csv files from disk, manipulates them and outputs some optimized .feather files to be used by another application.

The problem is that, by using Drake (even in parallel), the workflow is considerably slower to complete (twice as slow or even more), when compared to its sequential version. My hypothesis is that it is taking so long because Drake caches its targets in to disk, which is not necessary in the specific case I am working with. Is there a way to disable Drake's caching system?

@wlandau
Copy link
Member

wlandau commented May 18, 2018

Unfortunately, drake's cache cannot be totally disabled, but there may be other workarounds. But first, it would help to know some details:

  • The version of drake you are using (and the SHA key of the git commit if you are using the GitHub version). Parallel computing in drake has changed a lot recently.
  • The parallelism and jobs arguments to make().
  • The specs of your hardware.

I suspect drake is reading from disk a lot because each parallel process needs to have the dependencies in memory for the target it builds, and there is even more re-reading for transient workers. In the sequential version, drake keeps objects in memory until no more downstream targets need them.

You might also use custom files for all your targets and make sure the outputs of your commands are light. That way, drake will check timestamps first.

@guilhermealles
Copy link
Author

@wlandau, thanks for your response.

  • Then version of drake that I am using is 5.1.2, as tagged by the commit 4db383bfd6b202595458b66b0d10dc15fe9d8b2b.
  • I used the default "mclapply" for parallelism, and also tried "Makefile" with no success. max_useful_jobs() was used for jobs, which in my case is 8.
  • I am running the workflow on an Intel Xeon E5-2620 v4 (2.10GHz, 8 cores).

From your reply, I understand that it is not possible to keep dependencies in memory between targets for parallel workflows. Is that correct?

@wlandau
Copy link
Member

wlandau commented May 18, 2018

From looking at the console output of make(jobs = 1, verbose = 4) and make(jobs = 4, verbose = 4), I do see now that drake reads from disk more often than it is supposed to. I thought I fixed that, and I am glad you brought it to my attention. This could be a bug in prune_envir() or its usage, and I will address it.

Until that is fixed, there are a couple things we could try. Are you open to modifying your workflow so that it relies less on the cache? Is it possible to have most of the commands read in CSV files and output feather files? In your case, the more often you have file_out() targets, the less of a problem reads from disk will be.

You could also try using an in-memory cache, but I do not generally recommend it for parallel processing because it is not thread safe.

cache <- storr::storr_environment()
make(cache = cache, jobs = 8) # May error, but could be worth a shot.

What is supposed to happen

drake's approach to memory management is implemented in prune_envir(). At various points in the workflow, the environment is "pruned", or tailored to a given set of targets. In pruning, first, everything is unloaded that is not a dependency of the targets or anything downstream. Next, the dependencies of the targets are loaded. In addition, any newly-built targets are assigned to memory. For sequential execution, all this turns out to minimize large reads from disk. I had hoped it would also minimize reads for "mclapply" parallelism in version 5.1.2, and I am concerned that it is so much slower for you.

Future work

I do realize that drake needs to get better when it comes to memory management across multiple workers. One of the sticking points, especially for persistent workers in the new parallel computing functionality in the GitHub version is that targets get assigned to the next available worker instead of waiting for a worker that already has more of its dependencies in memory. This is a hard problem, and it will take a long time to solve.

@wlandau
Copy link
Member

wlandau commented May 18, 2018

@guilhermealles after closer inspection of development drake's behavior, I am more convinced that the slowdown you see is because drake assigns targets to the next available worker rather than the one that already has its dependencies loaded. Sorry, but this isn't likely to be solved any time soon. We could work on making your workflow read from disk less often. On the other hand, restructuring your work will probably make it rely more on custom output files, in which case you may be better off with a tool like snakemake for the time being.

@wlandau
Copy link
Member

wlandau commented May 18, 2018

And back to your original question, unfortunately, disabling drake's cache is just not possible. Eventually, I want to separate the high-performance computing piece from the reproducibility piece, but where that effort is concerned, I am currently stuck.

@wlandau
Copy link
Member

wlandau commented Oct 27, 2018

Ah, I now see how this could been related to StarVZ. The development version of drake now supports a customizable "hasty mode": https://ropenscilabs.github.io/drake-manual/hpc.html#hasty-mode

@wlandau
Copy link
Member

wlandau commented Oct 28, 2018

To clarify, "hasty mode" disables the caching system, but it does not skip up-to-date targets unless you come up with your own system for doing so (maybe in the hasty_build argument to make()).

@wlandau
Copy link
Member

wlandau commented Oct 28, 2018

Another approach: maybe try make(parallelism = "clustermq") (which can skip up-to-date targets) and use custom file_out() files for all your targets.

@nettoyoussef nettoyoussef mentioned this issue Jun 13, 2019
4 tasks
@wlandau
Copy link
Member

wlandau commented Feb 22, 2020

drake has gotten a lot faster since we last spoke, especially with https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. But a lot of the speed gains on the user's side come from choosing good targets: https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. Too many targets and stuff slows down, too few and repeated make()s do not save as much time by skipping targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants