Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the cache #907

Closed
4 tasks done
nettoyoussef opened this issue Jun 13, 2019 · 13 comments
Closed
4 tasks done

Speed up the cache #907

nettoyoussef opened this issue Jun 13, 2019 · 13 comments

Comments

@nettoyoussef
Copy link

nettoyoussef commented Jun 13, 2019

Prework

Description

Following #891, I am now able to produce several datasets to be fed to a classifier algorithm logging their parameters using Mlflow.

However, I am now having trouble with the caching system of Drake.
I'm performing dimensionality reduction using algos such as t-sne and word2vec with different parameters. Each of them produces a different dataset that will be used with a different classifier.

Since the datasets are huge (~ 6 million rows), Drake takes a long time to save them on cache, and it does so in an uncompressed way. As a result, it takes about 5 min to save each rds file, and the files themselves have sizes of about 1.1. Gb.

If instead, I use the fst library and max compression, I can store each file in 13s with about 300Mb in size, when doing this outside Drake.

So, I'm divided on how to address this issue.

I could change the calls in the targets to avoid them returning datasets so Drake would not store them in the cache, and make a call inside the function to save them directly on disk using the fst library.

The problem, however, is that I will use those same datasets in the next step of the plan when passing them as arguments to the classifiers, so it is not clear how I would reference them. I could just make the classifiers read the files on the disk directly but that would not make reference to read / write operations on disk.

On the other hand, I could use parallelism with the rds native cache to speed things up. That, however, would not help to reduce the size of such files, and it seems to be a waste of resources.

Finally, I could just make the saving and reading transparent to Drake, and just address the plan using some kind of referencing. See the last example. It took me about 20s for each object.

Reproducible example

This would be the basic plan - using drake cache - but it is too slow

plan <-     
        drake_plan(    
            max_expand = 5                   
            ,data = read_file(file_in("cleaned_data_to_dim_red"))        
            ,db_tsne = target(
                              tsne_function(db_cnae_bag = data
                                            ,perplexity = perplexity 
                                            ,other_arguments
                                            )
                              , transform = map(perplexity = !!seq_len(200))
                              )
         )

Using file_in and file_out gives a strange graph (max_expand set to aid visualization):

plan <-     
        drake_plan(    
            max_expand = 2          
            ,data = read_file(file_in("cleaned_data_to_dim_red"))
        
            ,db_tsne = target(
                              tsne_fst_write_on_disk(db_cnae_bag = data
                                                     ,perplexity = perplexity 
                                                     ,other_arguments
                                                     )
                              , transform = map(perplexity = !!seq_len(200))
                              )
        
          #This duplicates the files on the cache and on the project folder 
          #Fst through Drake seems to be faster than Drake - about 1.5 min.
          #Saving directly, outside Drake, it takes about 13s.
         ,db_tsne_write = target(    
                                      fst::write_fst(db_tsne
                                                     ,here(file_out(!!paste0("/data/"
                                                               , .id_chr, ".fst")))
                                                     ,compress = 100)
                                      ,transform = map(db_tsne)
                                    )                     
         
         ,classification_db = target(    
                                      fst::read_fst(db_tsne_write
                                                     ,here(file_in(!!paste0("/data/"
                                                     , .id_chr, ".fst")))
                                                     ,compress = 100)
                                      ,transform = map(db_tsne_write)
                                    )                     
         
                             
        )

Finally, the best solution so far (it doesn't make use of file_in or file_out though) :

plan <-     
        drake_plan(    
            max_expand = 5
            ,data = read_file(file_in("cleaned_data_to_dim_red"))
            
            ,db_tsne = target(
                              tsne_fst_write_on_disk(db_cnae_bag = data
                                                     ,perplexity = perplexity 
                                                     ,other_arguments
                                                     #It is necessary to do some 
                                                     #massaging on the names of 
                                                     #the saved files
                                                     )
                              , transform = map(perplexity = !!seq_len(200))
                              )
        
            ,db_tsne_read = target(    
                                    fst::read_fst(db_tsne
                                                   ,here(!!paste0("/data/", .id_chr, ".fst")))
                                    ,transform = map(db_tsne)
                                    ) 
                             
        )

Benchmarks

A similar sized dataset would be:

temp <- as.data.frame(matrix(runif(20*(6*10^6)) ,nrow = 6*10^6, ncol = 20))

Timings:

  • Drake cache: target took about 5,5 minutes (rds file) and 1,1 Gb.
  • Drake cache + fst: target took about 1.5 min (fst file) and 300Mb.
  • Only fst write (cache transparent - last example): target took about 20s and 300Mb.
@wlandau wlandau changed the title Drake's cache too slow - Is there a workaround? drake's cache too slow - Is there a workaround? Jun 13, 2019
@wlandau
Copy link
Member

wlandau commented Jun 13, 2019

This is a high-priority issue, and I expected it to come up at some point. I have some initial thoughts, and I will dive into your code when I have more time.

Near the end of the deep learning chapter of the manual, I have an example of the file_out()/file_in() workaround. It works because file_out()s in one step can be file_in()s of a downstream step. From the manual:

Screen Shot 2019-06-13 at 4 20 13 PM

And a smaller standalone reprex:

library(drake)
plan <- drake_plan(
    x = write_stuff(file_out("large_dataset.fst")),
    y = read_stuff(file_in("large_dataset.fst"))
)
config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-06-13 by the reprex package (v0.3.0)

In both cases, drake's cache only stores the hashes of the hdf5/fst files, not the contents of the files themselves. The slowdown you observe from file_out()/file_in() may be due to the fact that those files are still hashed. (But drake takes measures to avoid repeated hashing: #4).

But if people like you have to rely on file_out() and file_in() for large datasets, this is really not an ideal situation. I do have plans to make the cache faster for big data: richfitz/storr#103. The proposed multiformat storr will look at the class (and maybe other characteristics) of a data object and store it in the format best for that object. cc @richfitz.

@nettoyoussef
Copy link
Author

Hi Will! Thanks for the fast response, as usual.

I think this workaround will suffice for my immediate needs.

I am not very familiar with the Storr package but I was curious if it could not be made to be more file agnostic. For example, leaving to the user to specify the format of the cache for the object through a flag on a call, and letting the function find an appropriate library to do the saving.

This way the package would be more flexible to adapt to changes in the ecosystem. On the top of my head, the last years saw at least two very fast write/read implementations for R objects, with the fst and the feather packages, which could be valuable additions to storr.

Leaving the saving as just a wrapper for a call to another library would make very easy to support new file formats. Faster implementations of existing formats could also be adopted with much less overhead.

This implementation could be made not at the level of the overall cache, but instead of individual objects. This way, files that would have to been shared with users in other platforms would not have to be manually managed, but instead would be handled by the cache directly. In other words, you spent less time managing names, calls to read/write and worrying if the files are updated or not.

As a beneficial side effect, you wouldn't have to worry to create tools to "guess" which would be the best file format for an object. You just set a default, and if the user feels like, he can change it to another one. More flexibility with (apparently) less work.

Not sure how difficult would be to do something along those lines, and if this kind of implementation would be better suited to be made at the level of storr or Drake, though.

@nettoyoussef
Copy link
Author

I had some problems using file_out in my plan.

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

@wlandau
Copy link
Member

wlandau commented Jun 16, 2019

I am not very familiar with the Storr package but I was curious if it could not be made to be more file agnostic. For example, leaving to the user to specify the format of the cache for the object through a flag on a call, and letting the function find an appropriate library to do the saving.

I do have plans for this: richfitz/storr#103. It seems most natural to set the format at the class/type level.

Also, inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Benchmarks are encouraging

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

but there are some roadblocks with big data / long vectors.

library(fst)
x <- data.frame(x = raw(2^32))
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
x <- list(x = raw(2^32))
as.data.frame(x)
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
class(x) <- "data.frame"
file <- tempfile()
write_fst(x, file) # No error here...
# read_fst(file)   # but I get a segfault here.

Created on 2019-06-16 by the reprex package (v0.3.0)

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

No, this is not expected. Would you open another issue and post a reprex?

@nettoyoussef
Copy link
Author

I do have plans for this: richfitz/storr#103. It seems most natural to set the format at the class/type level.

Also, inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Benchmarks are encouraging

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

Great! There seems to be a considerable speed up using fst to compress files, and the save rds after compression as richfitz/storr#110 appears to save considerable trouble.

Created on 2019-06-16 by the reprex package (v0.3.0)

Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the file_out function, the plan appears as up to date after rerunning, and I have no issues whatsoever. Is this expected behaviour?

No, this is not expected. Would you open another issue and post a reprex?

I will create a reprex and post it as a new issue asap.

@wlandau
Copy link
Member

wlandau commented Jun 18, 2019

Since the datasets are huge (~ 6 million rows), drake takes a long time to save them on cache, and it does so in an uncompressed way. As a result, it takes about 5 min to save each rds file, and the files themselves have sizes of about 1.1. Gb.

As I just learned, storr's default is actually to use compression with gzfile(). Even without gzfile, saveRDS() has its own compression, which seems to take quite a bit of time.

I totally inundated @richfitz this week (sorry about that) so richfitz/storr#111 could take a while. For now, you can try creating a storr with no compression and feeding it to drake. If you do, please let me know how it goes.

library(drake)
cache <- storr::storr_rds(tempfile(), compress = FALSE)
plan <- drake_plan(x = 1)
make(plan, cache = cache)
#> target x
readd(x, cache = cache)
#> [1] 1

Created on 2019-06-18 by the reprex package (v0.3.0)

@wlandau
Copy link
Member

wlandau commented Jun 20, 2019

@nettoyoussef
Copy link
Author

Hi Will,

Just a quick update. I tried recreating the file_out problem but I was not successful. Probably I was doing something wrong before.

@wlandau
Copy link
Member

wlandau commented Jul 21, 2019

From the benchmarks at richfitz/storr#111, it looks like storr's default gzip compression incurs a severe runtime penalty. We can address this properly in drake after richfitz/storr#111 is complete. In the meantime, does your work go any faster if you disable compression entirely? Example:

library(drake)
library(storr)
load_mtcars_example()
cache <- storr_rds(tempfile(), compress = FALSE)
make(my_plan, cache = cache)

@wlandau wlandau changed the title drake's cache too slow - Is there a workaround? Speed up the cache Jul 28, 2019
@nettoyoussef
Copy link
Author

nettoyoussef commented Jul 31, 2019

Hi Will,

Sorry for the late response. I finally had the time to test some benchmarks.
At this time, I am not using drake's cache for large objects.
To make the comparisons as straightforward as possible, I did the following:

  • Set the environment using cache <- storr_rds(tempfile(), compress = FALSE)
  1. A run saving the object directly on disk (using fst) with a pre-defined name and returning NULL
  2. A run saving the object directly on disk (using fst) with a pre-defined name and returning the object to be saved on drake's cache.

Using drake 7.3.0

  • run 1 took 13.57 minutes to train and classify about a 1000 models on a large dataset on my laptop. The saved file has a size of about 1 Gb.
  • run 2 took 15.98 minutes to do the same. A new cache object appeared of size 8.9 Gb.

Using drake 7.5.2

  • run 1 took 14.88 minutes.
  • run 2 took 16.8 minutes. A new cache object appeared of size 8.7 Gb.

So, it appears that drake is taking about 1.5~2 minutes to encode the object in the cache.

I did not try to enable compression inside drake, because the overhead would be even greater.

@wlandau
Copy link
Member

wlandau commented Aug 1, 2019

Thank you running benchmarks in a practical scenario, this is very useful! What happens if you install wlandau/storr@deea50d and try cache <- storr_rds(tempfile(), compress = "lz4")? If performance is still low, we may have to rethink richfitz/storr#111.

@nettoyoussef
Copy link
Author

Not sure if I installed correctly, but I did:

library(devtools)
devtools::install_github("wlandau/storr", ref = "110")

The version with lz4 compression took 15.33 min and created an object with size of 3.2Gb.
So apparently, this version is indeed better.

@wlandau
Copy link
Member

wlandau commented Aug 4, 2019

Awesome, thanks! richfitz/storr#111 seems to get us half way there. To fully achieve the efficiency of fst without requiring file_out() or file_in(), I have an idea I think will work: #971 (comment).

@wlandau
Copy link
Member

wlandau commented Aug 5, 2019

I now consider this issue solved via #977. Now, all you need for large data frames is target(your_command, format = "fst").

@wlandau wlandau closed this as completed Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants