-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plan with one target runs out of memory (only inside drake and not manually) #930
Comments
Thanks for being so thorough. One thing I notice is that Let's see what happens when downloaded length 6423328 != reported length 155860498
URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer' Another note: |
By the way, is there a smaller version of the zip file we can test on? |
A more complete error log of the download attempt: > if (!file.exists(processed_path)) write_census(data_path, processed_path)
The data is being downloaded
trying URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip'
Content type 'application/x-zip-compressed' length 155860498 bytes (148.6 MB)
===
downloaded 11.4 MB
Error in download.file(data_link, destfile = "./data/raw_data.zip") :
download from 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip' failed
In addition: Warning messages:
1: In download.file(data_link, destfile = "./data/raw_data.zip") :
downloaded length 11909743 != reported length 155860498
2: In download.file(data_link, destfile = "./data/raw_data.zip") :
URL 'ftp://www.ine.es/temas/censopv/cen11/Microdatos_personas_nacional.zip': status was 'Failure when receiving data from the peer' |
One hypothesis about your original issue: since |
I took a step back and did some memory benchmarking. Even with the memory strategies at https://ropenscilabs.github.io/drake-manual/memory.html, Here are some benchmarks to show you what you can expect. Each chunk below is a fresh new R session, and each comment shows the maximum memory usage at that point (using Activity Monitor in Mac OS, refreshing once per second). Note: Without drake# restart R: 67.2 MB
library(drake) # 80.8 MB
library(dplyr) # 108.8 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.2 MB
data2 <- select(data1, c("x", "y")) # 386.3
# With storr:
storage <- storr::storr_rds(tempfile()) # 386.8 MB
storage$set("data", data1, use_cache = FALSE) # 959 MB A one-target approach# restart R: 67.5 MB
library(drake) # 86.9 MB
library(dplyr) # 108.3 MB
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8)) # 386.6 MB
plan <- drake_plan(data2 = select(data1, c("x", "y"))) # 386.6 MB
make(plan) # 945 MB Two targets and no memory management# restart R: 71.1 MB
library(drake) # 90.5 MB
library(dplyr) # 109.8 MB
plan <- drake_plan( # 109.8 MB
data1 = {
out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
Sys.sleep(10)
out
},
data2 = {
out <- select(data1, c("x", "y"))
Sys.sleep(10)
out
}
)
make(plan) # 974 MB Two targets, a memory strategy, and garbage collection# restart R: 72.3 MB
library(drake) # 90.3 MB
library(dplyr) # 114.3 MB
plan <- drake_plan( # 115.0 MB
data1 = {
out <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
Sys.sleep(10)
out
},
data2 = {
out <- select(data1, c("x", "y"))
Sys.sleep(10)
out
}
)
make( # 980.2 MB
plan,
memory_strategy = "autoclean", # development version only
garbage_collection = TRUE
) |
The following usually maxes at about 580 MB, but I have seen it jump to around 716 MB. So we really are limited by in-memory serialization. library(drake)
library(dplyr)
data1 <- data.frame(x = raw(1e8), y = raw(1e8), z = raw(1e8))
data2 <- select(data1, c("x", "y"))
saveRDS( # 716 MB
serialize(data2, NULL, ascii = FALSE, xdr = TRUE, version = 2L),
tempfile()
) |
Thanks for all the tests @wlandau, amazing work! Yeah, I did not know that make basically duplicated the size. Is there a way that we can avoid serializing/cache a specific target? So one idea would be to avoid serializing the first 2-3 steps until the data is reduced to smaller chunks. |
By the way, |
Yes: manually save that object to a file_out() file and read it in later targets with a file_in() file. That chapter on memory management I linked to discusses this. |
And for speed, I recommend the fst package if the object is a data frame. |
Update: #971 (comment) should really help reduce memory consumption in big data applications. |
A quick workaround is now available. If you specify a format for a target, drake saves the data directly and should not waste memory. See the example from #977. I recommend "fst" for data frames, "keras" for Keras models, and "rds" for everything else. |
@wlandau thanks a bunch! This works fine now. Great work and thank you for |
Prework
drake
's code of conduct.remotes::install_github("ropensci/drake")
) and mention the SHA-1 hash of the Git commit you install.Description
Processing a big file returns a 'cannot allocate vector of size X' or 'cannot allocate buffer' but ONLY inside drake. That is, I can read the file and process it outside of
drake
and there is no error.Reproducible example
I had some trouble making this a reprex because I'm very unfamiliar with
drake
. In fact this is my first project but since I found this weird error I thought it would useful to put it here. Instead, I have a minimal working repository in Github that has the workflow. Below I explain.renv
(next iteration ofpackrat
) for package managementdrake
and runr_make()
There are four files (the same as in
drake
's documentation)code/01-packages.R
loads packagescode/02-reading_data.R
has one function which downloads, reads and saves the data inoutput/
code/plan.R
outlines the plan._drake.R
If I run
r_make()
(because my workflow is very interactive), everything will run OK (although it will take some time because everything is very heavy) until theplan
incode/plan.R
. That is, line 13 will read the heavy data but when the plan executes the targetprocess_data
(which only selects a few columns), drake will crash with memory related problems. The specific error isError : cannot allocate vector of size 7.9GB
or 'cannot allocate buffer'.However, if I run all the scripts inside the folder
code/
and manually run everything until line 13 incode/plan.R
and then just doselect(read_data, CPRO)
, this works. The error is only happening insidedrake
.Session info
Expected output
What output would the correct behavior have produced?
No error and then
readd(process_data)
will return the correct data frame.The text was updated successfully, but these errors were encountered: