Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan with one target runs out of memory (only inside drake and not manually) #930

Closed
3 tasks done
cimentadaj opened this issue Jul 4, 2019 · 13 comments
Closed
3 tasks done

Comments

@cimentadaj
Copy link

Prework

Description

Processing a big file returns a 'cannot allocate vector of size X' or 'cannot allocate buffer' but ONLY inside drake. That is, I can read the file and process it outside of drake and there is no error.

Reproducible example

I had some trouble making this a reprex because I'm very unfamiliar with drake. In fact this is my first project but since I found this weird error I thought it would useful to put it here. Instead, I have a minimal working repository in Github that has the workflow. Below I explain.

  1. Clone the repo:
git clone https://github.com/cimentadaj/spain_census.git
  1. Run renv (next iteration of packrat) for package management
devtools::install_github("rstudio/renv")
renv::restore() # should only take 1-2 mins
  1. Load drake and run r_make()
library(drake)
r_make()
# This will take a few mins because it downloads the data which is about 4M rows

There are four files (the same as in drake's documentation)

  • code/01-packages.R loads packages
  • code/02-reading_data.R has one function which downloads, reads and saves the data in output/
  • code/plan.R outlines the plan.
  • _drake.R

If I run r_make() (because my workflow is very interactive), everything will run OK (although it will take some time because everything is very heavy) until the plan in code/plan.R. That is, line 13 will read the heavy data but when the plan executes the target process_data (which only selects a few columns), drake will crash with memory related problems. The specific error is Error : cannot allocate vector of size 7.9GB or 'cannot allocate buffer'.

However, if I run all the scripts inside the folder code/ and manually run everything until line 13 in code/plan.R and then just do select(read_data, CPRO), this works. The error is only happening inside drake.

Session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] drake_7.4.0     workflowr_1.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       ps_1.3.0         crayon_1.3.4     assertthat_0.2.1
 [5] digest_0.6.19    R6_2.4.0         backports_1.1.4  storr_1.2.1     
 [9] magrittr_1.5     evaluate_0.14    cli_1.1.0        rlang_0.4.0     
[13] renv_0.5.0-66    callr_3.2.0      rmarkdown_1.13   tools_3.6.0     
[17] igraph_1.2.4.1   processx_3.3.1   xfun_0.7         compiler_3.6.0  
[21] pkgconfig_2.0.2  base64url_1.4    htmltools_0.3.6  knitr_1.23      

Expected output

What output would the correct behavior have produced?
No error and then readd(process_data) will return the correct data frame.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2019

Thanks for being so thorough.

One thing I notice is that read_data is an object in the global environment.

https://github.com/cimentadaj/spain_census/blob/77c22888472f3b08e11af4ca305390609c4d1a29/code/plan.R#L13

Let's see what happens when read_data is a target. We can use drake's memory management and garbage collection to try to mitigate some of this (https://ropenscilabs.github.io/drake-manual/memory.html). Please see my fork at https://github.com/wlandau/spain_census/tree/drake-930. I cannot actually run the code myself because I get warnings downloading the zip file:

downloaded length 6423328 != reported length 155860498
  URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer'

Another note: drake is designed specifically to skip steps you don't need to rerun, so there is no need to manually do this yourself (here and here). This is another reason why read_data should be a target.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2019

By the way, is there a smaller version of the zip file we can test on?

@wlandau
Copy link
Member

wlandau commented Jul 4, 2019

A more complete error log of the download attempt:

> if (!file.exists(processed_path)) write_census(data_path, processed_path)
The data is being downloaded
trying URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip'
Content type 'application/x-zip-compressed' length 155860498 bytes (148.6 MB)
===
downloaded 11.4 MB

Error in download.file(data_link, destfile = "./data/raw_data.zip") : 
  download from 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip' failed
In addition: Warning messages:
1: In download.file(data_link, destfile = "./data/raw_data.zip") :
  downloaded length 11909743 != reported length 155860498
2: In download.file(data_link, destfile = "./data/raw_data.zip") :
  URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer'

@wlandau
Copy link
Member

wlandau commented Jul 4, 2019

One hypothesis about your original issue: since read_data lives in one environment (drake_config(...)$envir) and commands are evaluated in another environment (drake_config(...)$eval), then select() might be making a deep copy of read_data to config$eval, which could double the memory usage. I do not know enough about R's copy-on-modify semantics, it is just an idea. But it may be worth trying what happens when read_data is a target and using the memory management arguments in _drake.R in https://github.com/wlandau/spain_census/tree/drake-930.

@wlandau
Copy link
Member

wlandau commented Jul 5, 2019

I took a step back and did some memory benchmarking. Even with the memory strategies at https://ropenscilabs.github.io/drake-manual/memory.html, drake is going to consume extra memory. This appears to be due to storr, which drake uses to save targets to disk. When storr saves a target, we have both the object itself and a serialized copy simultaneously in memory (value and value_ser, respectively, in this line). This may be difficult to change in storr, but that is the place to try first (cc @richfitz). If the memory penalty is prohibitively large, you could use the file_in()/file_out() workaround described at https://ropenscilabs.github.io/drake-manual/memory.html.

Here are some benchmarks to show you what you can expect. Each chunk below is a fresh new R session, and each comment shows the maximum memory usage at that point (using Activity Monitor in Mac OS, refreshing once per second). Note: raw(1e8) is 100 MB.

Without drake

# restart R: 67.2 MB
library(drake) # 80.8 MB
library(dplyr) # 108.8 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.2 MB
data2 <- select(data1, c("x", "y")) # 386.3

# With storr:
storage <- storr::storr_rds(tempfile()) # 386.8 MB
storage$set("data", data1, use_cache = FALSE) # 959 MB

A one-target approach

# restart R: 67.5 MB
library(drake) # 86.9 MB
library(dplyr) # 108.3 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.6 MB
plan <- drake_plan(data2 = select(data1, c("x", "y"))) # 386.6 MB
make(plan) # 945 MB

Two targets and no memory management

# restart R: 71.1 MB
library(drake) # 90.5 MB
library(dplyr) # 109.8 MB
plan <- drake_plan( # 109.8 MB
  data1 = {
    out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
    Sys.sleep(10)
    out
  },
  data2 = {
    out <- select(data1, c("x", "y"))
    Sys.sleep(10)
    out
  }
)
make(plan) # 974 MB

Two targets, a memory strategy, and garbage collection

# restart R: 72.3 MB
library(drake) # 90.3 MB
library(dplyr) # 114.3 MB
plan <- drake_plan( # 115.0 MB
  data1 = {
    out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
    Sys.sleep(10)
    out
  },
  data2 = {
    out <- select(data1, c("x", "y"))
    Sys.sleep(10)
    out
  }
)
make( # 980.2 MB
  plan,
  memory_strategy = "autoclean", # development version only
  garbage_collection = TRUE
)

@wlandau
Copy link
Member

wlandau commented Jul 5, 2019

The following usually maxes at about 580 MB, but I have seen it jump to around 716 MB. So we really are limited by in-memory serialization.

library(drake)
library(dplyr)
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
data2 <- select(data1, c("x", "y"))
saveRDS( # 716 MB
  serialize(data2, NULL, ascii = FALSE, xdr = TRUE, version = 2L),
  tempfile()
)

@cimentadaj
Copy link
Author

Thanks for all the tests @wlandau, amazing work! Yeah, I did not know that make basically duplicated the size. Is there a way that we can avoid serializing/cache a specific target? So one idea would be to avoid serializing the first 2-3 steps until the data is reduced to smaller chunks.

@cimentadaj
Copy link
Author

By the way, read_data is outside the drake plan because leaving it as a target gave a different error. It was about serializing something too big, so I figured maybe it has difficulties creating the cache for such big files and moved it to outside. That's why my question comes on whether perhaps you can exclude cache/serializing from some targets that you know are not worthy of cacheing such as reading really big datasets.

@wlandau
Copy link
Member

wlandau commented Jul 5, 2019

Yes: manually save that object to a file_out() file and read it in later targets with a file_in() file. That chapter on memory management I linked to discusses this.

@wlandau
Copy link
Member

wlandau commented Jul 5, 2019

And for speed, I recommend the fst package if the object is a data frame.

@wlandau wlandau mentioned this issue Jul 7, 2019
3 tasks
@wlandau
Copy link
Member

wlandau commented Aug 4, 2019

Update: #971 (comment) should really help reduce memory consumption in big data applications.

@wlandau
Copy link
Member

wlandau commented Aug 8, 2019

A quick workaround is now available. If you specify a format for a target, drake saves the data directly and should not waste memory. See the example from #977. I recommend "fst" for data frames, "keras" for Keras models, and "rds" for everything else.

@cimentadaj
Copy link
Author

@wlandau thanks a bunch! This works fine now. Great work and thank you for drake!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants