-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing collection of R6 objects makes them consume much more memory #383
Comments
Wow, that's so weird! Does memory blow up when you run I don't know exactly what is going on, but I suspect it has something to do with environments and scoping. I managed to reproduce the problem entirely without library(pryr)
library(R6)
library(storr)
Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
self$rstring <- sample(letters)
self$id <- paste(collapse = "", self$rstring)
}))
CallCollections <- R6Class("CallCollections", public = list(callList = list(),
initialize = function(n = 1000) {
self$callList <- replicate(Call$new(), n = n)
}))
cache <- storr_rds("cache")
f <- function(key, cache) {
cache$get(key)
}
g <- function(key) {
cache <- storr_rds("cache")
cache$get(key)
}
x <- CallCollections$new()
cache$set("x", x)
object_size(f("x", cache))
#> 1.19 MB
object_size(g("x"))
#> 48 MB It seems to matter where the In cache <- storr_rds(".drake", mangle_key = TRUE) # drake's default cache
x <- CallCollections$new()
cache$set("x", x)
object_size(readd(x)) # Uses get_cache() by default.
#> 48 MB
object_size(readd(x), cache = storr_rds(".drake", mangle_key = TRUE))
#> 48.2 MB
object_size(readd(x, cache = cache))
#> 1.19 MB |
I think there are a bunch of weird things going on. First is that library(storr)
library(pryr)
f <- function(key, cache) {
cache$get(key)
}
g <- function(key) {
cache <- storr_rds("cache")
cache$get(key)
}
info <- function(x) {
cache <- storr_rds("cache")
cache$set("x", x)
y <- f("x", cache)
z <- g("x")
cat('object_size(x): ')
print(object_size(x))
cat('object_size(f("x", cache)): ')
print(object_size(y))
cat('object_size(g("x")): ')
print(object_size(z))
cat('object.size(x): ')
print(object.size(x))
cat('object.size(f("x", cache)): ')
print(object.size(y))
cat('object.size(g("x")): ')
print(object.size(z))
}
x1 <- lapply(1:1000, function(n) 1)
x2 <- as.list(rep(1, 1000))
info(x1)
#> object_size(x): 8.14 kB
#> object_size(f("x", cache)): 8.14 kB
#> object_size(g("x")): 56 kB
#> object.size(x): 56040 bytes
#> object.size(f("x", cache)): 56040 bytes
#> object.size(g("x")): 56040 bytes
info(x2)
#> object_size(x): 56 kB
#> object_size(f("x", cache)): 56 kB
#> object_size(g("x")): 56 kB
#> object.size(x): 56040 bytes
#> object.size(f("x", cache)): 56040 bytes
#> object.size(g("x")): 56040 bytes
identical(x1, x2, F, F, F, F)
#> [1] TRUE But before you get too excited and think that library(R6)
Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
self$rstring <- sample(letters)
self$id <- paste(collapse = "", self$rstring)
}))
CallCollections <- R6Class("CallCollections", public = list(callList = list(),
initialize = function(n = 1000) {
self$callList <- replicate(Call$new(), n = n)
}))
y <- CallCollections$new()
info(y)
#> object_size(x): 1.16 MB
#> object_size(f("x", cache)): 1.16 MB
#> object_size(g("x")): 47.3 MB
#> object.size(x): 328 bytes
#> object.size(f("x", cache)): 328 bytes
#> object.size(g("x")): 328 bytes Calculating object sizes is hard, because it's not clear exactly what should be counted as part of the object. See I also tried comparing with cache <- storr_rds("cache")
system.time(f("x", cache))
#> user system elapsed
#> 0.212 0.002 0.215
system.time(f("x", cache)) # Second run is very fast
#> user system elapsed
#> 0.000 0.000 0.001
system.time(g("x"))
#> user system elapsed
#> 0.217 0.004 0.221 |
I think
See also this example where the object logically should be much larger, which > library(pryr)
> library(R6)
> library(storr)
> Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
+ self$rstring <- sample(letters)
+ self$id <- paste(collapse = "", self$rstring)
+ }))
> CallCollections <- R6Class("CallCollections", public = list(callList = list(),
+ initialize = function(n = 1000) {
+ self$callList <- replicate(Call$new(), n = n)
+ }))
> x<-CallCollections$new(10)
> y<-CallCollections$new(1000)
> object.size(x)
328 bytes
> object_size(x)
68.7 kB
> object.size(y)
328 bytes
> object_size(y)
1.16 MB |
At this point, @wch's helpful comments make me wonder how much memory your actual R process is consuming as a whole. What does htop say? Is it as bad as object_size() says? If so, can you get around it by passing the cache directly to readd()? |
I guess my real use case is slightly more complex, on a slurm cluster using drake in combination with future.batchtools. I always pass the cache explicitly read through the > library(pryr)
> library(R6)
> library(storr)
>
> Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
+ self$rstring <- sample(letters)
+ self$id <- paste(collapse = "", self$rstring)
+ }))
> CallCollections <- R6Class("CallCollections", public = list(callList = list(),
+ initialize = function(n = 1000) {
+ self$callList <- replicate(Call$new(), n = n)
+ }))
>
>
> cache <- storr_rds("cache")
>
> f <- function(key, cache) {
+ cache$get(key)
+ }
>
> g <- function(key) {
+ cache <- storr_rds("cache")
+ cache$get(key)
+ }
>
> x <- CallCollections$new(10000)
> cache$set("x", x)
> rm(x);gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 579122 31.0 940480 50.3 940480 50.3
Vcells 1066138 8.2 9859238 75.3 10054982 76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bart 23086 100 0.5 199732 83852 pts/3 S+ 18:35 0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> xx<-(f("x", cache))
> system(paste0('ps u --pid ', Sys.getpid()))
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bart 23086 100 0.5 199732 83912 pts/3 S+ 18:35 0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> rm(xx);gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 579431 31.0 940480 50.3 940480 50.3
Vcells 1066877 8.2 7887390 60.2 10054982 76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bart 23086 102 0.5 199732 83912 pts/3 S+ 18:35 0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> xxx<-(g("x"))
> system(paste0('ps u --pid ', Sys.getpid()))
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bart 23086 100 3.1 628320 512996 pts/3 S+ 18:35 0:08 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> rm(xxx);gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 579435 31.0 7974897 426.0 8760842 467.9
Vcells 1066878 8.2 4038343 30.9 10054982 76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bart 23086 104 3.1 628320 513016 pts/3 S+ 18:35 0:08 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline |
@bart1, you may also see increased memory usage on the cluster because In any case, I suspect the original problem you brought up has nothing to do with library(drake)
library(pryr)
library(storr)
clean(destroy = TRUE)
gc(verbose = FALSE)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 894781 47.8 1835641 98.1 1128155 60.3
#> Vcells 1496617 11.5 8388608 64.0 1939935 14.9
mem_used()
#> 62.1 MB
x0 <- c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06)))
object_size(x0)
#> 72 MB
mem_used()
#> 142 MB
plan <- drake_plan(a = c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06))))
make(plan)
#> target a
cache <- storr_rds(".drake", mangle_key = TRUE)
mem_used()
#> 148 MB
x <- cache$get("a")
mem_used()
#> 292 MB
object_size(x)
#> 128 MB
x <- readd(a)
mem_used()
#> 436 MB
object_size(x)
#> 128 MB At first glance, this does not appear to affect pure library(pryr)
library(storr)
cache <- storr_rds("cache", mangle_key = TRUE)
x <- c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06)))
cache$set("a", x)
object_size(x)
#> 72 MB
mem_used()
#> 117 MB
y <- cache$get("a")
object_size(y)
#> 72 MB
mem_used()
#> 117 MB
cache$destroy() But if I restart my R session before reading from the cache, memory explodes again. library(pryr)
library(storr)
cache <- storr_rds("cache", mangle_key = TRUE)
x <- c(lapply(1:1e6, function(n) 1), as.list(rep(1, 1e6)))
object_size(x)
#> 72 MB
cache$set("a", x)
# Restart R.
library(storr)
library(pryr)
cache <- storr_rds("cache", mangle_key = TRUE)
y <- cache$get("a")
object_size(y)
#> 128 MB
mem_used()
#> 184 MB Together with @wch's comments, I think this is an unfortunate and confusing problem, but I believe the solution is outside the scope of |
@bart1, you might want to see richfitz/storr#76 (comment). As with #345, it seems like serialization just isn't cooperating. |
We should probably dissuade users from setting R6 objects as targets. I am planning a new chapter of the manual to help people think about what should be a target and what should not: ropensci-books/drake#120. |
I have encountered this case while I was working on simulations. I base my simulations on
R6
objects that again contain otherR6
objects and use drake to run the simulations. I noticed that when I store the simulations they consume disproportionally much memory (some blow up to 10gb of memory).Below is some code with a small reproducible example where the object retrieved from drake is 40 times larger compared to the original. It seems not to be related to storing it in the cache since doing that directly does not have the same effect. I think
R6
does some stuff to conserve memory by linking functions across environments. I could not figure out what causes this effect ( I have been looking at functions likestore_object
,store_target
andbuild_and_store
The text was updated successfully, but these errors were encountered: