plan with one target runs out of memory (only inside drake and not manually) #930

cimentadaj · 2019-07-04T21:56:16Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.

Description

Processing a big file returns a 'cannot allocate vector of size X' or 'cannot allocate buffer' but ONLY inside drake. That is, I can read the file and process it outside of drake and there is no error.

Reproducible example

I had some trouble making this a reprex because I'm very unfamiliar with drake. In fact this is my first project but since I found this weird error I thought it would useful to put it here. Instead, I have a minimal working repository in Github that has the workflow. Below I explain.

Clone the repo:

git clone https://github.com/cimentadaj/spain_census.git

Run renv (next iteration of packrat) for package management

devtools::install_github("rstudio/renv")
renv::restore() # should only take 1-2 mins

Load drake and run r_make()

library(drake)
r_make()
# This will take a few mins because it downloads the data which is about 4M rows

There are four files (the same as in drake's documentation)

code/01-packages.R loads packages
code/02-reading_data.R has one function which downloads, reads and saves the data in output/
code/plan.R outlines the plan.
_drake.R

If I run r_make() (because my workflow is very interactive), everything will run OK (although it will take some time because everything is very heavy) until the plan in code/plan.R. That is, line 13 will read the heavy data but when the plan executes the target process_data (which only selects a few columns), drake will crash with memory related problems. The specific error is Error : cannot allocate vector of size 7.9GB or 'cannot allocate buffer'.

However, if I run all the scripts inside the folder code/ and manually run everything until line 13 in code/plan.R and then just do select(read_data, CPRO), this works. The error is only happening inside drake.

Session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] drake_7.4.0     workflowr_1.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       ps_1.3.0         crayon_1.3.4     assertthat_0.2.1
 [5] digest_0.6.19    R6_2.4.0         backports_1.1.4  storr_1.2.1     
 [9] magrittr_1.5     evaluate_0.14    cli_1.1.0        rlang_0.4.0     
[13] renv_0.5.0-66    callr_3.2.0      rmarkdown_1.13   tools_3.6.0     
[17] igraph_1.2.4.1   processx_3.3.1   xfun_0.7         compiler_3.6.0  
[21] pkgconfig_2.0.2  base64url_1.4    htmltools_0.3.6  knitr_1.23

Expected output

What output would the correct behavior have produced?
No error and then readd(process_data) will return the correct data frame.

The text was updated successfully, but these errors were encountered:

wlandau · 2019-07-04T23:33:49Z

Thanks for being so thorough.

One thing I notice is that read_data is an object in the global environment.

https://github.com/cimentadaj/spain_census/blob/77c22888472f3b08e11af4ca305390609c4d1a29/code/plan.R#L13

Let's see what happens when read_data is a target. We can use drake's memory management and garbage collection to try to mitigate some of this (https://ropenscilabs.github.io/drake-manual/memory.html). Please see my fork at https://github.com/wlandau/spain_census/tree/drake-930. I cannot actually run the code myself because I get warnings downloading the zip file:

downloaded length 6423328 != reported length 155860498
  URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer'

Another note: drake is designed specifically to skip steps you don't need to rerun, so there is no need to manually do this yourself (here and here). This is another reason why read_data should be a target.

wlandau · 2019-07-04T23:34:59Z

By the way, is there a smaller version of the zip file we can test on?

wlandau · 2019-07-04T23:39:05Z

A more complete error log of the download attempt:

> if (!file.exists(processed_path)) write_census(data_path, processed_path)
The data is being downloaded
trying URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip'
Content type 'application/x-zip-compressed' length 155860498 bytes (148.6 MB)
===
downloaded 11.4 MB

Error in download.file(data_link, destfile = "./data/raw_data.zip") : 
  download from 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip' failed
In addition: Warning messages:
1: In download.file(data_link, destfile = "./data/raw_data.zip") :
  downloaded length 11909743 != reported length 155860498
2: In download.file(data_link, destfile = "./data/raw_data.zip") :
  URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer'

wlandau · 2019-07-04T23:43:45Z

One hypothesis about your original issue: since read_data lives in one environment (drake_config(...)$envir) and commands are evaluated in another environment (drake_config(...)$eval), then select() might be making a deep copy of read_data to config$eval, which could double the memory usage. I do not know enough about R's copy-on-modify semantics, it is just an idea. But it may be worth trying what happens when read_data is a target and using the memory management arguments in _drake.R in https://github.com/wlandau/spain_census/tree/drake-930.

wlandau · 2019-07-05T03:22:23Z

I took a step back and did some memory benchmarking. Even with the memory strategies at https://ropenscilabs.github.io/drake-manual/memory.html, drake is going to consume extra memory. This appears to be due to storr, which drake uses to save targets to disk. When storr saves a target, we have both the object itself and a serialized copy simultaneously in memory (value and value_ser, respectively, in this line). This may be difficult to change in storr, but that is the place to try first (cc @richfitz). If the memory penalty is prohibitively large, you could use the file_in()/file_out() workaround described at https://ropenscilabs.github.io/drake-manual/memory.html.

Here are some benchmarks to show you what you can expect. Each chunk below is a fresh new R session, and each comment shows the maximum memory usage at that point (using Activity Monitor in Mac OS, refreshing once per second). Note: raw(1e8) is 100 MB.

Without drake

# restart R: 67.2 MB
library(drake) # 80.8 MB
library(dplyr) # 108.8 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.2 MB
data2 <- select(data1, c("x", "y")) # 386.3

# With storr:
storage <- storr::storr_rds(tempfile()) # 386.8 MB
storage$set("data", data1, use_cache = FALSE) # 959 MB

A one-target approach

# restart R: 67.5 MB
library(drake) # 86.9 MB
library(dplyr) # 108.3 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.6 MB
plan <- drake_plan(data2 = select(data1, c("x", "y"))) # 386.6 MB
make(plan) # 945 MB

Two targets and no memory management

# restart R: 71.1 MB
library(drake) # 90.5 MB
library(dplyr) # 109.8 MB
plan <- drake_plan( # 109.8 MB
  data1 = {
    out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
    Sys.sleep(10)
    out
  },
  data2 = {
    out <- select(data1, c("x", "y"))
    Sys.sleep(10)
    out
  }
)
make(plan) # 974 MB

Two targets, a memory strategy, and garbage collection

# restart R: 72.3 MB
library(drake) # 90.3 MB
library(dplyr) # 114.3 MB
plan <- drake_plan( # 115.0 MB
  data1 = {
    out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
    Sys.sleep(10)
    out
  },
  data2 = {
    out <- select(data1, c("x", "y"))
    Sys.sleep(10)
    out
  }
)
make( # 980.2 MB
  plan,
  memory_strategy = "autoclean", # development version only
  garbage_collection = TRUE
)

wlandau · 2019-07-05T03:39:31Z

The following usually maxes at about 580 MB, but I have seen it jump to around 716 MB. So we really are limited by in-memory serialization.

library(drake)
library(dplyr)
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
data2 <- select(data1, c("x", "y"))
saveRDS( # 716 MB
  serialize(data2, NULL, ascii = FALSE, xdr = TRUE, version = 2L),
  tempfile()
)

cimentadaj · 2019-07-05T07:10:58Z

Thanks for all the tests @wlandau, amazing work! Yeah, I did not know that make basically duplicated the size. Is there a way that we can avoid serializing/cache a specific target? So one idea would be to avoid serializing the first 2-3 steps until the data is reduced to smaller chunks.

cimentadaj · 2019-07-05T07:19:45Z

By the way, read_data is outside the drake plan because leaving it as a target gave a different error. It was about serializing something too big, so I figured maybe it has difficulties creating the cache for such big files and moved it to outside. That's why my question comes on whether perhaps you can exclude cache/serializing from some targets that you know are not worthy of cacheing such as reading really big datasets.

wlandau · 2019-07-05T10:33:32Z

Yes: manually save that object to a file_out() file and read it in later targets with a file_in() file. That chapter on memory management I linked to discusses this.

wlandau · 2019-07-05T11:03:53Z

And for speed, I recommend the fst package if the object is a data frame.

wlandau · 2019-08-04T03:46:44Z

Update: #971 (comment) should really help reduce memory consumption in big data applications.

wlandau · 2019-08-08T03:04:58Z

A quick workaround is now available. If you specify a format for a target, drake saves the data directly and should not waste memory. See the example from #977. I recommend "fst" for data frames, "keras" for Keras models, and "rds" for everything else.

cimentadaj · 2019-08-16T14:23:43Z

@wlandau thanks a bunch! This works fine now. Great work and thank you for drake!

cimentadaj added the type: bug label Jul 4, 2019

cimentadaj assigned wlandau Jul 4, 2019

wlandau added type: use case and removed type: bug labels Jul 4, 2019

wlandau added type: bug and removed type: bug labels Jul 4, 2019

wlandau added the topic: performance label Jul 4, 2019

wlandau closed this as completed Jul 5, 2019

wlandau mentioned this issue Jul 5, 2019

Reconsider drake's storage backend #931

Closed

3 tasks

wlandau mentioned this issue Jul 5, 2019

Memory management in the storr API richfitz/storr#69

Open

wlandau mentioned this issue Jul 7, 2019

Memory profiling #932

Closed

3 tasks

wlandau mentioned this issue Aug 5, 2019

Specialized formats for big targets #977

Merged

4 tasks

billdenney mentioned this issue Aug 8, 2020

Feature request: If memory allocation fails, try to autoclean and retry #1304

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan with one target runs out of memory (only inside drake and not manually) #930

plan with one target runs out of memory (only inside drake and not manually) #930

cimentadaj commented Jul 4, 2019

wlandau commented Jul 4, 2019 •

edited

Loading

wlandau commented Jul 4, 2019

wlandau commented Jul 4, 2019

wlandau commented Jul 4, 2019

wlandau commented Jul 5, 2019

wlandau commented Jul 5, 2019

cimentadaj commented Jul 5, 2019

cimentadaj commented Jul 5, 2019

wlandau commented Jul 5, 2019 •

edited

Loading

wlandau commented Jul 5, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 8, 2019

cimentadaj commented Aug 16, 2019

plan with one target runs out of memory (only inside drake and not manually) #930

plan with one target runs out of memory (only inside drake and not manually) #930

Comments

cimentadaj commented Jul 4, 2019

Prework

Description

Reproducible example

Session info

Expected output

wlandau commented Jul 4, 2019 • edited Loading

wlandau commented Jul 4, 2019

wlandau commented Jul 4, 2019

wlandau commented Jul 4, 2019

wlandau commented Jul 5, 2019

Without drake

A one-target approach

Two targets and no memory management

Two targets, a memory strategy, and garbage collection

wlandau commented Jul 5, 2019

cimentadaj commented Jul 5, 2019

cimentadaj commented Jul 5, 2019

wlandau commented Jul 5, 2019 • edited Loading

wlandau commented Jul 5, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 8, 2019

cimentadaj commented Aug 16, 2019

wlandau commented Jul 4, 2019 •

edited

Loading

wlandau commented Jul 5, 2019 •

edited

Loading