Experiment with clustermq for persistent workers #431

wlandau · 2018-06-27T13:26:22Z

Summary

This PR is for discussion only. Do not merge.

Implementing persistent clustermq was relatively easy:

Lines 3 to 34 in 0718982

    
           run_clustermq_persistent <- function(config){ 
        
             if (!requireNamespace("clustermq")){ 
        
               drake_error( 
        
                 "drake::make(parallelism = \"clustermq_persistent\") requires ", 
        
                 "the clustermq package: https://github.com/mschubert/clustermq.", 
        
                 config = config 
        
               ) 
        
             } 
        
             prepare_distributed(config = config) 
        
             mc_init_worker_cache(config) 
        
             console_persistent_workers(config) 
        
             path <- normalizePath(config$cache_path, winslash = "/") 
        
             rscript <- grep( 
        
               "Rscript", 
        
               dir(R.home("bin"), full.names = TRUE), 
        
               value = TRUE 
        
             ) 
        
             tmp <- system2( 
        
               rscript, 
        
               shQuote(c("-e", paste0("drake::remote_master('", path, "')"))), 
        
               wait = FALSE 
        
             ) 
        
             clustermq::Q( 
        
               worker = mc_worker_id(seq_len(config$jobs)), 
        
               cache_path = config$cache_path, 
        
               fun = function(worker, cache_path){ 
        
                 drake::remote_worker(worker = worker, cache_path = cache_path) 
        
               }, 
        
               n_jobs = config$jobs 
        
             ) 
        
             finish_distributed(config = config) 
        
           }

Workers initiate quickly, but just as I thought, execution still suffers from a lot of (probably target-level) overhead. Transient workers, with caching on the master process, are the best way to use clustermq.

Related GitHub issues

Ref: Implement a clustermq backend with persistent workers #425, Support clustermq as backend for drake mschubert/clustermq#86

Checklist

I have read drake's code of conduct, and I agree to follow its rules.
I have read the guidelines for contributing.
I have listed any substantial changes in the development news.
I have added testthat unit tests to tests/testthat to confirm that any new features or functionality work correctly.
I have tested this pull request locally with devtools::check()
This pull request is ready for review.
I think this pull request is ready to merge.

drake has so much overhead of its own on the cluster that persistent clustermq workers are not worth it. Transient workers with caching on master would be the way to go.

codecov-io · 2018-06-27T13:45:47Z

Codecov Report

Merging #431 into master will decrease coverage by 0.53%.
The diff coverage is 12.12%.

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
- Coverage     100%   99.46%   -0.54%     
==========================================
  Files          66       67       +1     
  Lines        5349     5379      +30     
==========================================
+ Hits         5349     5350       +1     
- Misses          0       29      +29

Impacted Files	Coverage Δ
R/clustermq_persistent.R	`0% <0%> (ø)`
R/parallel_ui.R	`100% <100%> (ø)`	⬆️
R/future_lapply.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8948b17...c09644d. Read the comment docs.

lintr-bot · 2018-06-27T13:52:27Z

inst/examples/sge_future/run.R:13:3: style: Commented code should be removed.

# make(my_plan, parallelism = "future_lapply", jobs = 4) # persistent workers
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

wlandau · 2018-06-30T00:47:38Z

This solution relies too much on the file system still. A clustermq backend should maintain a common pool of non-blocking persistent refreshable workers (spawned with clustermq::workers()) which the master can send targets as they become ready. Ref: mschubert/clustermq#86 (comment).

Test drive persistent workers with clustermq

0718982

drake has so much overhead of its own on the cluster that persistent clustermq workers are not worth it. Transient workers with caching on master would be the way to go.

wlandau added status: won't fix topic: performance difficulty: beginner component: internals labels Jun 27, 2018

wlandau mentioned this pull request Jun 27, 2018

Implement a clustermq backend with persistent workers #425

Closed

Suggest clustermq

c09644d

wlandau added the DO NOT MERGE label Jun 27, 2018

wlandau closed this Jun 30, 2018

wlandau deleted the clustermq_persistent branch June 30, 2018 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with clustermq for persistent workers #431

Experiment with clustermq for persistent workers #431

wlandau commented Jun 27, 2018

codecov-io commented Jun 27, 2018

lintr-bot commented Jun 27, 2018

wlandau commented Jun 30, 2018

	run_clustermq_persistent <- function(config){
	if (!requireNamespace("clustermq")){
	drake_error(
	"drake::make(parallelism = \"clustermq_persistent\") requires ",
	"the clustermq package: https://github.com/mschubert/clustermq.",
	config = config
	)
	}
	prepare_distributed(config = config)
	mc_init_worker_cache(config)
	console_persistent_workers(config)
	path <- normalizePath(config$cache_path, winslash = "/")
	rscript <- grep(
	"Rscript",
	dir(R.home("bin"), full.names = TRUE),
	value = TRUE
	)
	tmp <- system2(
	rscript,
	shQuote(c("-e", paste0("drake::remote_master('", path, "')"))),
	wait = FALSE
	)
	clustermq::Q(
	worker = mc_worker_id(seq_len(config$jobs)),
	cache_path = config$cache_path,
	fun = function(worker, cache_path){
	drake::remote_worker(worker = worker, cache_path = cache_path)
	},
	n_jobs = config$jobs
	)
	finish_distributed(config = config)
	}

Experiment with clustermq for persistent workers #431

Experiment with clustermq for persistent workers #431

Conversation

wlandau commented Jun 27, 2018

Summary

Related GitHub issues

Checklist

codecov-io commented Jun 27, 2018

Codecov Report

lintr-bot commented Jun 27, 2018

wlandau commented Jun 30, 2018