Storing collection of R6 objects makes them consume much more memory #383

bart1 · 2018-05-12T13:51:48Z

I have encountered this case while I was working on simulations. I base my simulations on R6objects that again contain other R6objects and use drake to run the simulations. I noticed that when I store the simulations they consume disproportionally much memory (some blow up to 10gb of memory).

Below is some code with a small reproducible example where the object retrieved from drake is 40 times larger compared to the original. It seems not to be related to storing it in the cache since doing that directly does not have the same effect. I think R6 does some stuff to conserve memory by linking functions across environments. I could not figure out what causes this effect ( I have been looking at functions like store_object, store_target and build_and_store

require(R6)
Call <- R6Class(
                'CallItem',
                public = list(rstring=c(NA),id='',
                              initialize = function() {
                                self$rstring=sample(letters)
                                self$id=paste(collapse='',self$rstring)
                              }))
CallCollections <- R6Class(
                           'CallCollections',
                           public = list(
                             callList=list(), 
                             initialize=function(n=1000){
                               self$callList=replicate(Call$new(), n=n)
                             }
                           ))
require(drake)
require(pryr)
clean()
# just create a object and calculate size
object_size(CallCollections$new())

# create object in drake plan and calculate size
make(drake_plan(a=CallCollections$new()))
object_size(readd(a))
# resulting object is 40 times larger

# storing it inside the cache outside drake does not make the object larger
ch<-get_cache()
r<-CallCollections$new()
ch$set('r',r)
object_size(ch$get('r'))

The text was updated successfully, but these errors were encountered:

wlandau · 2018-05-12T22:22:26Z

Wow, that's so weird! Does memory blow up when you run make(), or only when you call readd() afterwards?

I don't know exactly what is going on, but I suspect it has something to do with environments and scoping. I managed to reproduce the problem entirely without drake.

library(pryr)
library(R6)
library(storr)

Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
  self$rstring <- sample(letters)
  self$id <- paste(collapse = "", self$rstring)
}))
CallCollections <- R6Class("CallCollections", public = list(callList = list(), 
  initialize = function(n = 1000) {
    self$callList <- replicate(Call$new(), n = n)
  }))


cache <- storr_rds("cache")

f <- function(key, cache) {
  cache$get(key)
}

g <- function(key) {
  cache <- storr_rds("cache")
  cache$get(key)
}

x <- CallCollections$new()
cache$set("x", x)
object_size(f("x", cache))
#> 1.19 MB
object_size(g("x"))
#> 48 MB

It seems to matter where the storr cache object is created. @richfitz and @wch, any ideas?

In drake, the readd() function can be have like f() or g(), depending on whether you give it the cache. So for you, I recommend creating the cache explicitly and then formally passing it to readd().

cache <- storr_rds(".drake", mangle_key = TRUE)  # drake's default cache
x <- CallCollections$new()
cache$set("x", x)
object_size(readd(x))  # Uses get_cache() by default.
#> 48 MB
object_size(readd(x), cache = storr_rds(".drake", mangle_key = TRUE))
#> 48.2 MB
object_size(readd(x, cache = cache))
#> 1.19 MB

wch · 2018-05-13T02:18:03Z

I think there are a bunch of weird things going on. First is that object_size may not be giving the right answer. Check this out:

library(storr)
library(pryr)

f <- function(key, cache) {
  cache$get(key)
}

g <- function(key) {
  cache <- storr_rds("cache")
  cache$get(key)
}

info <- function(x) {
  cache <- storr_rds("cache")
  cache$set("x", x)
  y <- f("x", cache)
  z <- g("x")
  
  cat('object_size(x):             ')
  print(object_size(x))
  cat('object_size(f("x", cache)): ')
  print(object_size(y))
  cat('object_size(g("x")):        ')
  print(object_size(z))
  cat('object.size(x):             ')
  print(object.size(x))
  cat('object.size(f("x", cache)): ')
  print(object.size(y))
  cat('object.size(g("x")):        ')
  print(object.size(z))
}

x1 <- lapply(1:1000, function(n) 1)
x2 <- as.list(rep(1, 1000))
info(x1)
#> object_size(x):             8.14 kB
#> object_size(f("x", cache)): 8.14 kB
#> object_size(g("x")):        56 kB
#> object.size(x):             56040 bytes
#> object.size(f("x", cache)): 56040 bytes
#> object.size(g("x")):        56040 bytes
info(x2)
#> object_size(x):             56 kB
#> object_size(f("x", cache)): 56 kB
#> object_size(g("x")):        56 kB
#> object.size(x):             56040 bytes
#> object.size(f("x", cache)): 56040 bytes
#> object.size(g("x")):        56040 bytes
identical(x1, x2, F, F, F, F)
#> [1] TRUE

But before you get too excited and think that object.size() is the answer, look at this:

library(R6)

Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
  self$rstring <- sample(letters)
  self$id <- paste(collapse = "", self$rstring)
}))
CallCollections <- R6Class("CallCollections", public = list(callList = list(), 
  initialize = function(n = 1000) {
    self$callList <- replicate(Call$new(), n = n)
  }))

y <- CallCollections$new()
info(y)
#> object_size(x):             1.16 MB
#> object_size(f("x", cache)): 1.16 MB
#> object_size(g("x")):        47.3 MB
#> object.size(x):             328 bytes
#> object.size(f("x", cache)): 328 bytes
#> object.size(g("x")):        328 bytes

Calculating object sizes is hard, because it's not clear exactly what should be counted as part of the object. See ?object.size and ?object_size for more about that.

I also tried comparing with system.time(). For f() and g(), the speed is the same. (Note that the first access of a storr_rds object is slow, but subsequent ones are fast. g() creates a new one each time, so it will be slow each time.)

cache <- storr_rds("cache")
system.time(f("x", cache))
#>   user  system elapsed 
#>  0.212   0.002   0.215 
system.time(f("x", cache))  # Second run is very fast
#>   user  system elapsed 
#>  0.000   0.000   0.001 

system.time(g("x"))
#>   user  system elapsed 
#>  0.217   0.004   0.221

bart1 · 2018-05-13T15:42:09Z

I think object.size is the wrong function to use since it does not measure the size of environments well

This function is better than the built-in object.size() because it accounts for shared elements within an object and includes the size of environments. (http://adv-r.had.co.nz/memory.html)

See also this example where the object logically should be much larger, which object_size reports correctly but object.size does not report:

> library(pryr)
> library(R6)
> library(storr)
> Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
+   self$rstring <- sample(letters)
+   self$id <- paste(collapse = "", self$rstring)
+ }))
> CallCollections <- R6Class("CallCollections", public = list(callList = list(), 
+                                                             initialize = function(n = 1000) {
+                                                               self$callList <- replicate(Call$new(), n = n)
+                                                             }))
> x<-CallCollections$new(10)
> y<-CallCollections$new(1000)
> object.size(x)
328 bytes
> object_size(x)
68.7 kB
> object.size(y)
328 bytes
> object_size(y)
1.16 MB

wlandau · 2018-05-13T16:01:54Z

At this point, @wch's helpful comments make me wonder how much memory your actual R process is consuming as a whole. What does htop say? Is it as bad as object_size() says? If so, can you get around it by passing the cache directly to readd()?

bart1 · 2018-05-13T17:01:47Z

I guess my real use case is slightly more complex, on a slurm cluster using drake in combination with future.batchtools. I always pass the cache explicitly read through the recover_cache function (see #381). I will later see if this is different if I recover the cache through storr functions. I just tried here to make a small reproducible example in the report here. I'm also pretty sure it relates to real memory usage since only jobs on the cluster where I store R6 object need to have 12 Gb of memory reserved where as jobs where I handle similar object but do not store them I can get away with 2 Gb or so. Below an example that also shows that the process really increases in memory consumption after reading from the cache (here memory consumption increases to 3% after invoking the G function, I assume it stays that high due to caching in storr).

> library(pryr)
> library(R6)
> library(storr)
> 
> Call <- R6Class("CallItem", public = list(rstring = c(NA), id = "", initialize = function() {
+   self$rstring <- sample(letters)
+   self$id <- paste(collapse = "", self$rstring)
+ }))
> CallCollections <- R6Class("CallCollections", public = list(callList = list(), 
+                                                             initialize = function(n = 1000) {
+                                                               self$callList <- replicate(Call$new(), n = n)
+                                                             }))
> 
> 
> cache <- storr_rds("cache")
> 
> f <- function(key, cache) {
+   cache$get(key)
+ }
> 
> g <- function(key) {
+   cache <- storr_rds("cache")
+   cache$get(key)
+ }
> 
> x <- CallCollections$new(10000)
> cache$set("x", x)
> rm(x);gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  579122 31.0     940480 50.3   940480 50.3
Vcells 1066138  8.2    9859238 75.3 10054982 76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bart     23086  100  0.5 199732 83852 pts/3    S+   18:35   0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> xx<-(f("x", cache))
> system(paste0('ps u --pid ', Sys.getpid()))
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bart     23086  100  0.5 199732 83912 pts/3    S+   18:35   0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> rm(xx);gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  579431 31.0     940480 50.3   940480 50.3
Vcells 1066877  8.2    7887390 60.2 10054982 76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bart     23086  102  0.5 199732 83912 pts/3    S+   18:35   0:04 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> xxx<-(g("x"))
> system(paste0('ps u --pid ', Sys.getpid()))
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bart     23086  100  3.1 628320 512996 pts/3   S+   18:35   0:08 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline
> rm(xxx);gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  579435 31.0    7974897 426.0  8760842 467.9
Vcells 1066878  8.2    4038343  30.9 10054982  76.8
> system(paste0('ps u --pid ', Sys.getpid()))
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bart     23086  104  3.1 628320 513016 pts/3   S+   18:35   0:08 /usr/lib/R/bin/exec/R -f tmp.R --restore --save --no-readline

wlandau · 2018-05-13T20:39:30Z

@bart1, you may also see increased memory usage on the cluster because drake needs all the dependencies to be in memory in order to build a target. And in an even more extravagant move, drake keeps those dependencies and the newly-built target in memory until nothing downstream will need them again in the current make() (see prune_envir()). I still think this is the right design choice because otherwise we would waste time repeatedly reading the same large objects from the cache. However, it may explain some of the high memory usage you are seeing in future.batchtools jobs.

In any case, I suspect the original problem you brought up has nothing to do with drake or R6. Here is a drake runthrough like yours except without R6. The object should be 72 MB, but it is 128 MB when read from the cache.

library(drake)
library(pryr)
library(storr)

clean(destroy = TRUE)
gc(verbose = FALSE)
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  894781 47.8    1835641 98.1  1128155 60.3
#> Vcells 1496617 11.5    8388608 64.0  1939935 14.9

mem_used()
#> 62.1 MB
x0 <- c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06)))
object_size(x0)
#> 72 MB
mem_used()
#> 142 MB
plan <- drake_plan(a = c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06))))
make(plan)
#> target a
cache <- storr_rds(".drake", mangle_key = TRUE)
mem_used()
#> 148 MB
x <- cache$get("a")
mem_used()
#> 292 MB
object_size(x)
#> 128 MB
x <- readd(a)
mem_used()
#> 436 MB
object_size(x)
#> 128 MB

At first glance, this does not appear to affect pure storr.

library(pryr)
library(storr)
cache <- storr_rds("cache", mangle_key = TRUE)
x <- c(lapply(1:1e+06, function(n) 1), as.list(rep(1, 1e+06)))
cache$set("a", x)
object_size(x)
#> 72 MB
mem_used()
#> 117 MB
y <- cache$get("a")
object_size(y)
#> 72 MB
mem_used()
#> 117 MB
cache$destroy()

But if I restart my R session before reading from the cache, memory explodes again.

library(pryr)
library(storr)
cache <- storr_rds("cache", mangle_key = TRUE)
x <- c(lapply(1:1e6, function(n) 1), as.list(rep(1, 1e6)))
object_size(x)
#> 72 MB
cache$set("a", x)
# Restart R.
library(storr)
library(pryr)
cache <- storr_rds("cache", mangle_key = TRUE)
y <- cache$get("a")
object_size(y)
#> 128 MB
mem_used()
#> 184 MB

Together with @wch's comments, I think this is an unfortunate and confusing problem, but I believe the solution is outside the scope of drake.

wlandau · 2018-05-23T18:39:51Z

@bart1, you might want to see richfitz/storr#76 (comment). As with #345, it seems like serialization just isn't cooperating.

wlandau · 2019-11-03T14:13:43Z

We should probably dissuade users from setting R6 objects as targets. I am planning a new chapter of the manual to help people think about what should be a target and what should not: ropensci-books/drake#120.

wlandau added topic: performance type: edge case labels May 12, 2018

wlandau added the depends: help or input label May 12, 2018

wlandau mentioned this issue May 12, 2018

Strange edge case that consumes gratuitous memory richfitz/storr#76

Closed

wlandau closed this as completed May 13, 2018

wlandau added the status: may revisit label May 13, 2018

wlandau mentioned this issue Nov 3, 2019

Discussion on what should be a target ropensci-books/drake#120

Closed

wlandau removed the status: may revisit label Nov 3, 2019

d-sharpe mentioned this issue Nov 14, 2019

Serializing R6 classes? r-lib/R6#157

Open

pat-s mentioned this issue Apr 6, 2020

saveRDS is extremely slow when saving resample results, reading these files is RAM intensive mlr-org/mlr3#482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing collection of R6 objects makes them consume much more memory #383

Storing collection of R6 objects makes them consume much more memory #383

bart1 commented May 12, 2018

wlandau commented May 12, 2018

wch commented May 13, 2018

bart1 commented May 13, 2018

wlandau commented May 13, 2018

bart1 commented May 13, 2018

wlandau commented May 13, 2018 •

edited

Loading

wlandau commented May 23, 2018

wlandau commented Nov 3, 2019

Storing collection of R6 objects makes them consume much more memory #383

Storing collection of R6 objects makes them consume much more memory #383

Comments

bart1 commented May 12, 2018

wlandau commented May 12, 2018

wch commented May 13, 2018

bart1 commented May 13, 2018

wlandau commented May 13, 2018

bart1 commented May 13, 2018

wlandau commented May 13, 2018 • edited Loading

wlandau commented May 23, 2018

wlandau commented Nov 3, 2019

wlandau commented May 13, 2018 •

edited

Loading