Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues getting SLURM + future to work #1359

Closed
6 tasks done
pat-s opened this issue Feb 27, 2021 · 3 comments
Closed
6 tasks done

Issues getting SLURM + future to work #1359

pat-s opened this issue Feb 27, 2021 · 3 comments

Comments

@pat-s
Copy link
Member

pat-s commented Feb 27, 2021

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

I was trying out future (batchtools) + SLURM to play around with transient workers in contrast to clustermq + SLURM.

I got a bit confused on the following points:

  • drake_config(template - list()) is only valid for clustermq (took me hours to find this :/) but it is stated in the help page so my failure 😆
  • It seems for future (batchtools) + SLURM I need to add a resources column to every target? Otherwise this target will run sequentially? I am asking this because I could not get it running yet and therefore could not observe the behavior. If this applies, how much memory is devoted to every target?
  • I could only find one project on GitHub which uses _drake.R and has a future.batchtools template here. I did not see how the resources for the individual workers were specified though 🤔
  • When using the default slurm batchtools template in {drake}, I am running into the following error when involing r_make(). Is the template still valid? Inspecting the part I do not see a parsing error actually so I am wondering why this error occurs.
  • My specific use case is a large dynamic target (with 174 subtargets). Each subtarget should get 4 cores and 3500 MB memory. I do this for quite some time with clustermq + Slurm already but wanted to see how the future backend feels here.
r_make()
Starting parallelization in mode=multicore with cpus=4.dynamic benchmark_no_models_new_buffer2subtarget benchmark_no_models_new_buffer2_6a312eeb
Error : Error brewing template: Error in parse(text = code, srcfile = NULL) : 18:42: unexpected ')'
17: .brew.cat(20,22)
18: cat( if (!is.null(resources$walltime)) { )

Maybe you can still help with some pointers getting me running here - I might be missing something obvious 🤔

Reprex

The same issue arises when I try to use the drake+slurm+batchtools examples with _drake.R:

library(future.batchtools)
library(drake)

# Create the template file. You may have to modify it.
drake_hpc_template_file("slurm_batchtools.tmpl")

# Use future::plan(multicore) instead for a dry run.
future::plan(batchtools_slurm, template = "slurm_batchtools.tmpl")

load_mtcars_example()
drake_config(my_plan, parallelism = "future", jobs = 4)
@wlandau
Copy link
Member

wlandau commented Feb 27, 2021

It seems for future (batchtools) + SLURM I need to add a resources column to every target? Otherwise this target will run sequentially? I am asking this because I could not get it running yet and therefore could not observe the behavior. If this applies, how much memory is devoted to every target?

You could give some targets a resources element of list() to defer to the defaults of the template file. The memory is controlled here:

<%= if (!is.null(resources$memory)) { %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= } %>

I am not sure what the default memory would be without this.

By the way, targets::tar_option_set() has a resources argument where you can set defaults for these brew patterns, and you can set non-default resource configurations with the resources argument of tar_target(). I think this is less awkward than a column of a drake_plan() data frame.

When using the default slurm batchtools template in {drake}, I am running into the following error when involing r_make(). Is the template still valid? Inspecting the part I do not see a parsing error actually so I am wondering why this error occurs.
Error : Error brewing template: Error in parse(text = code, srcfile = NULL) : 18:42: unexpected ')'
17: .brew.cat(20,22)
18: cat( if (!is.null(resources$walltime)) { )

Turns out these kinds of errors are reproducible in brew alone.

library(brew)
library(drake)
drake_hpc_template_file("slurm_batchtools.tmpl", to = tempdir())
path <- file.path(tempdir(), "slurm_batchtools.tmpl")
log.file <- "x"
job.name <- "y"
uri <- "uri"
resources <- list(walltime = 60)
brew(file = path)
#> Error in parse(text = code, srcfile = NULL): 18:42: unexpected ')'
#> 17: .brew.cat(22,24)
#> 18: cat( if (!is.null(resources$walltime)) { )
#>                                              ^

Created on 2021-02-27 by the reprex package (v1.0.0)

Maybe I just need to update the SLURM template file.

@wlandau
Copy link
Member

wlandau commented Feb 27, 2021

I just updated inst/templates/hpc/slurm_batchtools.tmpl so it brews correctly. Beyond that, I am afraid there is not much else I can do because I do not have access to a SLURM cluster. If you get this template file to work with batchtools alone and then future.batchtools, I think it should work with drake.

Any particular reason you are using drake rather than targets and future.batchtools rather than clustermq?

@pat-s
Copy link
Member Author

pat-s commented Feb 27, 2021

This is an old project with like 400 targets and I am not sure if I want to put in the work to port it to {targets}.
New projects will start with {targets} :)

I wanted to explore if transient workers could me in this project. I am sometimes blocking the whole HPC with persistent workers for many days and at some point most workers are idle.

But I found out that the current implementation of transient workers via {future.batchtool} is quite slow and does not support array execution and other stuff (e.g. template are in drake_config).
All these downsides were not apparent to me until now and I am happy I dived in more deeply now.

I then picked up the discussion for transient workers in clustermq via {future.clustermq} in mschubert/clustermq#86 and HenrikBengtsson/future#204 and playing around a bit now (even though I am not really having a clear plan 😄 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants