Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with clustermq for persistent workers #431

Closed
wants to merge 2 commits into from

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Jun 27, 2018

Summary

This PR is for discussion only. Do not merge.

Implementing persistent clustermq was relatively easy:

run_clustermq_persistent <- function(config){
if (!requireNamespace("clustermq")){
drake_error(
"drake::make(parallelism = \"clustermq_persistent\") requires ",
"the clustermq package: https://github.com/mschubert/clustermq.",
config = config
)
}
prepare_distributed(config = config)
mc_init_worker_cache(config)
console_persistent_workers(config)
path <- normalizePath(config$cache_path, winslash = "/")
rscript <- grep(
"Rscript",
dir(R.home("bin"), full.names = TRUE),
value = TRUE
)
tmp <- system2(
rscript,
shQuote(c("-e", paste0("drake::remote_master('", path, "')"))),
wait = FALSE
)
clustermq::Q(
worker = mc_worker_id(seq_len(config$jobs)),
cache_path = config$cache_path,
fun = function(worker, cache_path){
drake::remote_worker(worker = worker, cache_path = cache_path)
},
n_jobs = config$jobs
)
finish_distributed(config = config)
}

Workers initiate quickly, but just as I thought, execution still suffers from a lot of (probably target-level) overhead. Transient workers, with caching on the master process, are the best way to use clustermq.

Related GitHub issues

Checklist

  • I have read drake's code of conduct, and I agree to follow its rules.
  • I have read the guidelines for contributing.
  • I have listed any substantial changes in the development news.
  • I have added testthat unit tests to tests/testthat to confirm that any new features or functionality work correctly.
  • I have tested this pull request locally with devtools::check()
  • This pull request is ready for review.
  • I think this pull request is ready to merge.

drake has so much overhead of its own on the cluster
that persistent clustermq workers are not worth it.
Transient workers with caching on master
would be the way to go.
@codecov-io
Copy link

Codecov Report

Merging #431 into master will decrease coverage by 0.53%.
The diff coverage is 12.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
- Coverage     100%   99.46%   -0.54%     
==========================================
  Files          66       67       +1     
  Lines        5349     5379      +30     
==========================================
+ Hits         5349     5350       +1     
- Misses          0       29      +29
Impacted Files Coverage Δ
R/clustermq_persistent.R 0% <0%> (ø)
R/parallel_ui.R 100% <100%> (ø) ⬆️
R/future_lapply.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8948b17...c09644d. Read the comment docs.

@lintr-bot
Copy link

inst/examples/sge_future/run.R:13:3: style: Commented code should be removed.

# make(my_plan, parallelism = "future_lapply", jobs = 4) # persistent workers
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@wlandau
Copy link
Member Author

wlandau commented Jun 30, 2018

This solution relies too much on the file system still. A clustermq backend should maintain a common pool of non-blocking persistent refreshable workers (spawned with clustermq::workers()) which the master can send targets as they become ready. Ref: mschubert/clustermq#86 (comment).

@wlandau wlandau closed this Jun 30, 2018
@wlandau wlandau deleted the clustermq_persistent branch June 30, 2018 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants