Speed up the cache #907

nettoyoussef · 2019-06-13T18:03:14Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the [existing issues]: issues Is there a way to disable drake's caching system? #384, Combine cascading drake plans #904, rmarkdown::render with transform #866
Advanced users: Im using Drake version 7.3.0.
Also read Drake's book chapter 10(memory) and 12 (storage)

Description

Following #891, I am now able to produce several datasets to be fed to a classifier algorithm logging their parameters using Mlflow.

However, I am now having trouble with the caching system of Drake.
I'm performing dimensionality reduction using algos such as t-sne and word2vec with different parameters. Each of them produces a different dataset that will be used with a different classifier.

Since the datasets are huge (~ 6 million rows), Drake takes a long time to save them on cache, and it does so in an uncompressed way. As a result, it takes about 5 min to save each rds file, and the files themselves have sizes of about 1.1. Gb.

If instead, I use the fst library and max compression, I can store each file in 13s with about 300Mb in size, when doing this outside Drake.

So, I'm divided on how to address this issue.

I could change the calls in the targets to avoid them returning datasets so Drake would not store them in the cache, and make a call inside the function to save them directly on disk using the fst library.

The problem, however, is that I will use those same datasets in the next step of the plan when passing them as arguments to the classifiers, so it is not clear how I would reference them. I could just make the classifiers read the files on the disk directly but that would not make reference to read / write operations on disk.

On the other hand, I could use parallelism with the rds native cache to speed things up. That, however, would not help to reduce the size of such files, and it seems to be a waste of resources.

Finally, I could just make the saving and reading transparent to Drake, and just address the plan using some kind of referencing. See the last example. It took me about 20s for each object.

Reproducible example

This would be the basic plan - using drake cache - but it is too slow

plan <-     
        drake_plan(    
            max_expand = 5                   
            ,data = read_file(file_in("cleaned_data_to_dim_red"))        
            ,db_tsne = target(
                              tsne_function(db_cnae_bag = data
                                            ,perplexity = perplexity 
                                            ,other_arguments
                                            )
                              , transform = map(perplexity = !!seq_len(200))
                              )
         )

Using `file_in` and `file_out` gives a strange graph (`max_expand` set to aid visualization):

plan <-     
        drake_plan(    
            max_expand = 2          
            ,data = read_file(file_in("cleaned_data_to_dim_red"))
        
            ,db_tsne = target(
                              tsne_fst_write_on_disk(db_cnae_bag = data
                                                     ,perplexity = perplexity 
                                                     ,other_arguments
                                                     )
                              , transform = map(perplexity = !!seq_len(200))
                              )
        
          #This duplicates the files on the cache and on the project folder 
          #Fst through Drake seems to be faster than Drake - about 1.5 min.
          #Saving directly, outside Drake, it takes about 13s.
         ,db_tsne_write = target(    
                                      fst::write_fst(db_tsne
                                                     ,here(file_out(!!paste0("/data/"
                                                               , .id_chr, ".fst")))
                                                     ,compress = 100)
                                      ,transform = map(db_tsne)
                                    )                     
         
         ,classification_db = target(    
                                      fst::read_fst(db_tsne_write
                                                     ,here(file_in(!!paste0("/data/"
                                                     , .id_chr, ".fst")))
                                                     ,compress = 100)
                                      ,transform = map(db_tsne_write)
                                    )                     
         
                             
        )

Finally, the best solution so far (it doesn't make use of `file_in` or `file_out` though) :

plan <-     
        drake_plan(    
            max_expand = 5
            ,data = read_file(file_in("cleaned_data_to_dim_red"))
            
            ,db_tsne = target(
                              tsne_fst_write_on_disk(db_cnae_bag = data
                                                     ,perplexity = perplexity 
                                                     ,other_arguments
                                                     #It is necessary to do some 
                                                     #massaging on the names of 
                                                     #the saved files
                                                     )
                              , transform = map(perplexity = !!seq_len(200))
                              )
        
            ,db_tsne_read = target(    
                                    fst::read_fst(db_tsne
                                                   ,here(!!paste0("/data/", .id_chr, ".fst")))
                                    ,transform = map(db_tsne)
                                    ) 
                             
        )

Benchmarks

A similar sized dataset would be:

temp <- as.data.frame(matrix(runif(20*(6*10^6)) ,nrow = 6*10^6, ncol = 20))

Timings:

Drake cache: target took about 5,5 minutes (rds file) and 1,1 Gb.
Drake cache + fst: target took about 1.5 min (fst file) and 300Mb.
Only fst write (cache transparent - last example): target took about 20s and 300Mb.

The text was updated successfully, but these errors were encountered:

wlandau · 2019-06-13T20:27:44Z

This is a high-priority issue, and I expected it to come up at some point. I have some initial thoughts, and I will dive into your code when I have more time.

Near the end of the deep learning chapter of the manual, I have an example of the file_out()/file_in() workaround. It works because file_out()s in one step can be file_in()s of a downstream step. From the manual:

And a smaller standalone reprex:

library(drake)
plan <- drake_plan(
    x = write_stuff(file_out("large_dataset.fst")),
    y = read_stuff(file_in("large_dataset.fst"))
)
config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2019-06-13 by the reprex package (v0.3.0)}

In both cases, drake's cache only stores the hashes of the hdf5/fst files, not the contents of the files themselves. The slowdown you observe from file_out()/file_in() may be due to the fact that those files are still hashed. (But drake takes measures to avoid repeated hashing: #4).

But if people like you have to rely on file_out() and file_in() for large datasets, this is really not an ideal situation. I do have plans to make the cache faster for big data: richfitz/storr#103. The proposed multiformat storr will look at the class (and maybe other characteristics) of a data object and store it in the format best for that object. cc @richfitz.

nettoyoussef · 2019-06-14T15:59:41Z

Hi Will! Thanks for the fast response, as usual.

I think this workaround will suffice for my immediate needs.

I am not very familiar with the Storr package but I was curious if it could not be made to be more file agnostic. For example, leaving to the user to specify the format of the cache for the object through a flag on a call, and letting the function find an appropriate library to do the saving.

This way the package would be more flexible to adapt to changes in the ecosystem. On the top of my head, the last years saw at least two very fast write/read implementations for R objects, with the fst and the feather packages, which could be valuable additions to storr.

Leaving the saving as just a wrapper for a call to another library would make very easy to support new file formats. Faster implementations of existing formats could also be adopted with much less overhead.

This implementation could be made not at the level of the overall cache, but instead of individual objects. This way, files that would have to been shared with users in other platforms would not have to be manually managed, but instead would be handled by the cache directly. In other words, you spent less time managing names, calls to read/write and worrying if the files are updated or not.

As a beneficial side effect, you wouldn't have to worry to create tools to "guess" which would be the best file format for an object. You just set a default, and if the user feels like, he can change it to another one. More flexibility with (apparently) less work.

Not sure how difficult would be to do something along those lines, and if this kind of implementation would be better suited to be made at the level of storr or Drake, though.

nettoyoussef · 2019-06-15T14:05:47Z

I had some problems using file_out in my plan.

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

wlandau · 2019-06-16T19:59:30Z

I am not very familiar with the Storr package but I was curious if it could not be made to be more file agnostic. For example, leaving to the user to specify the format of the cache for the object through a flag on a call, and letting the function find an appropriate library to do the saving.

I do have plans for this: richfitz/storr#103. It seems most natural to set the format at the class/type level.

Also, inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Benchmarks are encouraging

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

but there are some roadblocks with big data / long vectors.

library(fst)
x <- data.frame(x = raw(2^32))
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
x <- list(x = raw(2^32))
as.data.frame(x)
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
class(x) <- "data.frame"
file <- tempfile()
write_fst(x, file) # No error here...
# read_fst(file)   # but I get a segfault here.

^{Created on 2019-06-16 by the reprex package (v0.3.0)}

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

No, this is not expected. Would you open another issue and post a reprex?

nettoyoussef · 2019-06-18T13:33:32Z

I do have plans for this: richfitz/storr#103. It seems most natural to set the format at the class/type level.

Also, inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Benchmarks are encouraging
library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

Great! There seems to be a considerable speed up using fst to compress files, and the save rds after compression as richfitz/storr#110 appears to save considerable trouble.

Created on 2019-06-16 by the reprex package (v0.3.0)

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

No, this is not expected. Would you open another issue and post a reprex?

I will create a reprex and post it as a new issue asap.

wlandau · 2019-06-18T16:49:03Z

Since the datasets are huge (~ 6 million rows), drake takes a long time to save them on cache, and it does so in an uncompressed way. As a result, it takes about 5 min to save each rds file, and the files themselves have sizes of about 1.1. Gb.

As I just learned, storr's default is actually to use compression with gzfile(). Even without gzfile, saveRDS() has its own compression, which seems to take quite a bit of time.

I totally inundated @richfitz this week (sorry about that) so richfitz/storr#111 could take a while. For now, you can try creating a storr with no compression and feeding it to drake. If you do, please let me know how it goes.

library(drake)
cache <- storr::storr_rds(tempfile(), compress = FALSE)
plan <- drake_plan(x = 1)
make(plan, cache = cache)
#> target x
readd(x, cache = cache)
#> [1] 1

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

wlandau · 2019-06-20T14:23:44Z

To do:

Merge lz4 and zstd compression via fst richfitz/storr#111 (or Sketch fst driver richfitz/storr#109).
If we go with lz4 and zstd compression via fst richfitz/storr#111, find a suitable compression factor for LZ4. (benchmarks for compression = 0: lz4 and zstd compression via fst richfitz/storr#111 (comment)).

nettoyoussef · 2019-06-30T16:34:03Z

Hi Will,

Just a quick update. I tried recreating the file_out problem but I was not successful. Probably I was doing something wrong before.

wlandau · 2019-07-21T00:10:08Z

From the benchmarks at richfitz/storr#111, it looks like storr's default gzip compression incurs a severe runtime penalty. We can address this properly in drake after richfitz/storr#111 is complete. In the meantime, does your work go any faster if you disable compression entirely? Example:

library(drake)
library(storr)
load_mtcars_example()
cache <- storr_rds(tempfile(), compress = FALSE)
make(my_plan, cache = cache)

nettoyoussef · 2019-07-31T23:45:09Z

Hi Will,

Sorry for the late response. I finally had the time to test some benchmarks.
At this time, I am not using drake's cache for large objects.
To make the comparisons as straightforward as possible, I did the following:

Set the environment using cache <- storr_rds(tempfile(), compress = FALSE)

A run saving the object directly on disk (using fst) with a pre-defined name and returning NULL
A run saving the object directly on disk (using fst) with a pre-defined name and returning the object to be saved on drake's cache.

Using drake 7.3.0

run 1 took 13.57 minutes to train and classify about a 1000 models on a large dataset on my laptop. The saved file has a size of about 1 Gb.
run 2 took 15.98 minutes to do the same. A new cache object appeared of size 8.9 Gb.

Using drake 7.5.2

run 1 took 14.88 minutes.
run 2 took 16.8 minutes. A new cache object appeared of size 8.7 Gb.

So, it appears that drake is taking about 1.5~2 minutes to encode the object in the cache.

I did not try to enable compression inside drake, because the overhead would be even greater.

wlandau · 2019-08-01T14:35:59Z

Thank you running benchmarks in a practical scenario, this is very useful! What happens if you install wlandau/storr@deea50d and try cache <- storr_rds(tempfile(), compress = "lz4")? If performance is still low, we may have to rethink richfitz/storr#111.

nettoyoussef · 2019-08-03T15:58:10Z

Not sure if I installed correctly, but I did:

library(devtools)
devtools::install_github("wlandau/storr", ref = "110")

The version with lz4 compression took 15.33 min and created an object with size of 3.2Gb.
So apparently, this version is indeed better.

wlandau · 2019-08-04T03:39:32Z

Awesome, thanks! richfitz/storr#111 seems to get us half way there. To fully achieve the efficiency of fst without requiring file_out() or file_in(), I have an idea I think will work: #971 (comment).

wlandau · 2019-08-05T18:38:37Z

I now consider this issue solved via #977. Now, all you need for large data frames is target(your_command, format = "fst").

nettoyoussef added the topic: performance label Jun 13, 2019

nettoyoussef assigned wlandau Jun 13, 2019

wlandau changed the title ~~Drake's cache too slow - Is there a workaround?~~ drake's cache too slow - Is there a workaround? Jun 13, 2019

wlandau mentioned this issue Jun 14, 2019

Use writeBin() for large objects? richfitz/storr#106

Closed

This was referenced Jun 16, 2019

fst driver richfitz/storr#108

Open

fst for arbitrary data storage? fstpackage/fst#201

Closed

nettoyoussef mentioned this issue Jul 3, 2019

cache$destroy deletes entire directory #926

Closed

2 tasks

wlandau mentioned this issue Jul 5, 2019

Reconsider drake's storage backend #931

Closed

3 tasks

wlandau added the depends: external prerequisite label Jul 12, 2019

wlandau changed the title ~~drake's cache too slow - Is there a workaround?~~ Speed up the cache Jul 28, 2019

wlandau mentioned this issue Aug 4, 2019

Decorated cache and specialized data storage #971

Closed

2 tasks

This was referenced Aug 4, 2019

lz4 and zstd compression via fst richfitz/storr#111

Open

Specialized formats for big targets #977

Merged

wlandau closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the cache #907

Speed up the cache #907

nettoyoussef commented Jun 13, 2019 •

edited

Loading

wlandau commented Jun 13, 2019 •

edited

Loading

nettoyoussef commented Jun 14, 2019

nettoyoussef commented Jun 15, 2019

wlandau commented Jun 16, 2019

nettoyoussef commented Jun 18, 2019

wlandau commented Jun 18, 2019

wlandau commented Jun 20, 2019

nettoyoussef commented Jun 30, 2019

wlandau commented Jul 21, 2019

nettoyoussef commented Jul 31, 2019 •

edited by wlandau

Loading

wlandau commented Aug 1, 2019

nettoyoussef commented Aug 3, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 5, 2019

Speed up the cache #907

Speed up the cache #907

Comments

nettoyoussef commented Jun 13, 2019 • edited Loading

Prework

Description

Reproducible example

This would be the basic plan - using drake cache - but it is too slow

Using file_in and file_out gives a strange graph (max_expand set to aid visualization):

Finally, the best solution so far (it doesn't make use of file_in or file_out though) :

Benchmarks

wlandau commented Jun 13, 2019 • edited Loading

nettoyoussef commented Jun 14, 2019

nettoyoussef commented Jun 15, 2019

wlandau commented Jun 16, 2019

nettoyoussef commented Jun 18, 2019

wlandau commented Jun 18, 2019

wlandau commented Jun 20, 2019

nettoyoussef commented Jun 30, 2019

wlandau commented Jul 21, 2019

nettoyoussef commented Jul 31, 2019 • edited by wlandau Loading

Using drake 7.3.0

Using drake 7.5.2

wlandau commented Aug 1, 2019

nettoyoussef commented Aug 3, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 5, 2019

nettoyoussef commented Jun 13, 2019 •

edited

Loading

Using `file_in` and `file_out` gives a strange graph (`max_expand` set to aid visualization):

Finally, the best solution so far (it doesn't make use of `file_in` or `file_out` though) :

wlandau commented Jun 13, 2019 •

edited

Loading

nettoyoussef commented Jul 31, 2019 •

edited by wlandau

Loading