Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify resources within target #942

Closed
2 tasks done
joelnitta opened this issue Jul 14, 2019 · 2 comments
Closed
2 tasks done

Specify resources within target #942

joelnitta opened this issue Jul 14, 2019 · 2 comments

Comments

@joelnitta
Copy link

Prework

Description

I am following the drake manual section '9.7.5 The resources column for transient workers' to run my plan on a cluster specifying memory and run times, etc. for targets. The example in the manual shows adding a resources column to the drake plan as a list. However, this is unwieldy and error-prone in the case of a large plan. I would like to be able to specify resources (as a named list) as a custom column with target(), with reasonable defaults instead of NA values (e.g., requesting only a small amount of memory).

This is what the manual shows (note that plan$resources is a list-column):

library(drake)

plan <- drake_plan(
  data = download_data(),
  model = big_machine_learning_model(data)
)

plan$resources <- list(
  list(cores = 1, gpus = 0),
  list(cores = 4, gpus = 1)
)

plan
#> # A tibble: 2 x 3
#>   target command                          resources       
#>   <chr>  <expr>                           <list>          
#> 1 data   download_data()                  <named list [2]>
#> 2 model  big_machine_learning_model(data) <named list [2]>

Here is an example of what I would like to be able to do:

library(drake)

# Define plan
plan <- drake_plan(
  
  data = read_csv("https://bit.ly/ppgi_taxonomy"),
  
  # As a test, change memory settings for a single target
  data_slice_1 = target(
    slice(data, 1:10),
    resources = list(
      queue = "mThC.q",
      memory = "mres=2G,h_data=2G,h_vmem=2G")
  ),
  
  data_slice_2 = slice(data, 11:20),
  
  data_out_1 = write_csv(data_slice_1, file_out("data1.csv")),
  
  data_out_2 = write_csv(data_slice_2, file_out("data2.csv"))
  
)

plan
#> # A tibble: 5 x 3
#>   target       command                                        resources 
#>   <chr>        <expr>                                         <list>    
#> 1 data         read_csv("https://bit.ly/ppgi_taxonomy")       <lgl [1]> 
#> 2 data_slice_1 slice(data, 1:10)                              <language>
#> 3 data_slice_2 slice(data, 11:20)                             <lgl [1]> 
#> 4 data_out_1   write_csv(data_slice_1, file_out("data1.csv")) <lgl [1]> 
#> 5 data_out_2   write_csv(data_slice_2, file_out("data2.csv")) <lgl [1]>

In this case, plan$resources is again a list-column, but resources for the target that I tried to specify resources for is a language object, not a named list.

I tried to get this to work anyways by filling the rest of the NAs in plan$resources with a loop:

# Set default memory settings to lowest (short run time, 1 Gb memory)
for (i in 1:nrow(plan)) {
  if(is.na(plan$resources[i]) == TRUE) {
    plan$resources[i] <- list(
      list(queue = "sThC.q", memory = "mres=1G")
    )
  }
}

plan
#> # A tibble: 5 x 3
#>   target       command                                      resources      
#>   <chr>        <expr>                                       <list>         
#> 1 data         read_csv("https://bit.ly/ppgi_taxonomy")   … <named list [2…
#> 2 data_slice_1 slice(data, 1:10)                          … <language>     
#> 3 data_slice_2 slice(data, 11:20)                         … <named list [2…
#> 4 data_out_1   write_csv(data_slice_1, file_out("data1.csv… <named list [2…
#> 5 data_out_2   write_csv(data_slice_2, file_out("data2.csv… <named list [2…

When I tried to run this with future.batchtools, I got the following error:

target data_slice_1
Error in batchtools::submitJobs(reg = reg, ids = jobid, resources = resources) : 
  Assertion on 'resources' failed: Must have names, but element 1 is empty.

Indeed, running names() on plan$resources shows this:

purrr::map(plan$resources, names)
#> [[1]]
#> [1] "queue"  "memory"
#> 
#> [[2]]
#> [1] ""       "queue"  "memory"
#> 
#> [[3]]
#> [1] "queue"  "memory"
#> 
#> [[4]]
#> [1] "queue"  "memory"
#> 
#> [[5]]
#> [1] "queue"  "memory"

For now I'm probably just going to manually tweak the resources column after creating the entire plan, since there aren't that many targets that need special treatment. But I think being able to specify it on-the-fly within target() would be nice.

Created on 2019-07-14 by the reprex package (v0.2.1)

@wlandau
Copy link
Member

wlandau commented Jul 14, 2019

Actually, this is already possible. You can define any custom column with target(). Unfortunately, drake_plan() seems to interpret lists as language objects, but once I fix that, it will be totally seamless.

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

plan <- drake_plan(
  data = target(
    download_data(),
    resources = list(cores = 1, gpus = 0)
  ),
  model = target(
    big_machine_learning_model(data),
    resources = list(cores = 4, gpus = 1)
  )
)

plan
#> # A tibble: 2 x 3
#>   target command                          resources                
#>   <chr>  <expr>                           <expr>                   
#> 1 data   download_data()                  list(cores = 1, gpus = 0)
#> 2 model  big_machine_learning_model(data) list(cores = 4, gpus = 1)

plan <- plan %>%
  mutate(resources = map(resources, eval))

plan
#> # A tibble: 2 x 3
#>   target command                          resources       
#>   <chr>  <expr>                           <list>          
#> 1 data   download_data()                  <named list [2]>
#> 2 model  big_machine_learning_model(data) <named list [2]>

plan$resources
#> [[1]]
#> [[1]]$cores
#> [1] 1
#> 
#> [[1]]$gpus
#> [1] 0
#> 
#> 
#> [[2]]
#> [[2]]$cores
#> [1] 4
#> 
#> [[2]]$gpus
#> [1] 1

Created on 2019-07-14 by the reprex package (v0.3.0)

I should probably update the manual too.

@wlandau
Copy link
Member

wlandau commented Jul 14, 2019

Thanks for bringing up this use case. #943 should make it easier to define custom resources.

wlandau pushed a commit to ropensci-books/drake that referenced this issue Jul 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants