Decorated cache and specialized data storage #971

wlandau · 2019-08-01T23:01:25Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Description

I am thinking about a decorated storr to contain things like config$progress_hashmap, config$cache_path (a guarantee on the path), and the encoders. Maybe it could also help history somehow.

The text was updated successfully, but these errors were encountered:

wlandau · 2019-08-04T03:38:18Z

A better use: increase the cache performance beyond storr's current capabilities. We can write data faster and halve the memory consumption.

Strategy

From @richfitz: richfitz/storr#77 (comment)

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

Command API

Users will have special return_*() functions for drake_plan() commands and imported functions. At minimum, we should have these:

return_fst()
return_hdf5()
return_rds()

Example:

plan <- drake_plan(
  data = return_fst(get_big_data_frame()),
  model = return_hdf5(fit_big_keras_model(data))
)

return_fst() will assign a special "return_fst" S3 class to the output of get_big_data_frame(). That will tell drake to store the object with fst::write_fst(). return_hdf5() is similar, but for Keras models, which should address richfitz/storr#77 (comment) quite nicely.

Storage

Let's use a return_fst() command as an example.

Use fst::write_fst() to save the target value to a temporary file.
Hash the file.
In the inner storr, save the file hash and give it an S3 class of "return_fst".
Get the hash of the storr object.
Move the fst file from (1) to a special folder in the drake cache and give it a file name based on the hash from (4).

Retrieval

Get the object using the get() method of the inner storr.
If the object has class "return_fst", identify the path to the fst file using the hash from storage step (4).
Read the fst file from (2) using read_fst(), using any custom informal arguments saved with storage step (3.2).
Return the data frame read in by (3).

Garbage collection

Identify all the data hashes of the inner storr.
Call the gc() method of the inner storr.
Identify the data hashes removed with (2).
Using the hashes from (3), identify and remove all the corresponding files saved via return_*().

Implementation

The implementation will be a decorated cache: an outer storr-like cache that contains a true storr on the inside. The storage, retrieval, and garbage collection routines above will be implemented in their own get(), set(), and gc() methods in the outer cache. Those methods will delegate to the get(), set(), and gc() methods of the inner storr as appropriate.

Benefits

We can take advantage of efficient domain-specific methods to save user's data such as fst.
We avoid copying the data to a serialized vector. That way, we will no longer have to duplicate the data in memory.

Change of direction

Disadvantages of return_*():

Extra stuff added to the API.
The return values of return_fst() etc. are not the actual values you want. They are special lists with special S3 classes, which drake understands, but users do not.

Proposal

Let the data format be a custom column in the plan.

plan <- drake_plan(
  data = target(get_big_data_frame(), format = "fst"),
  model = target(fit_big_keras_model(data), format = "keras")
)

The internals in https://github.com/ropensci/drake/tree/971 will not need much adjustment. All we need is an assign_format() function to call the right return_*() function to wrap up the value. Then, store_outputs() can call assign_format() based on the format in the layout. We can rename the return_*() functions to assign_format.*() and make them S3 methods.

wlandau · 2019-08-05T18:35:17Z

Remaining for this issue: use cache$hash_algorithm and cache$path internally instead of config$hash_algorithm and config$path. We can worry about encoders, encoding hash tables, and the progress hash table later on when we refactor those data structures.

wlandau · 2019-08-07T02:23:54Z

Now, most of the hash tables are in the decorated cache where they belong. We might put some of these in a dedicated encoder, but that's for #968.

wlandau added the topic: style label Aug 1, 2019

wlandau changed the title ~~Decorated cache?~~ Decorated cache and specialized data storage Aug 4, 2019

wlandau added status: priority topic: performance and removed topic: style labels Aug 4, 2019

wlandau self-assigned this Aug 4, 2019

wlandau mentioned this issue Aug 4, 2019

Put history in the cache folder #975

Closed

2 tasks

wlandau changed the title ~~Decorated cache and specialized data storage~~ Specialized cache and specialized data storage Aug 4, 2019

wlandau changed the title ~~Specialized cache and specialized data storage~~ Decorated cache and specialized data storage Aug 4, 2019

wlandau mentioned this issue Aug 5, 2019

Wishlist for drake version 8 #919

Closed

17 tasks

wlandau mentioned this issue Aug 5, 2019

Specialized formats for big targets #977

Merged

4 tasks

wlandau closed this as completed in 0925893 Aug 7, 2019

This was referenced Aug 7, 2019

Advertise formats #980

Closed

HPC profiling #937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decorated cache and specialized data storage #971

Decorated cache and specialized data storage #971

wlandau commented Aug 1, 2019 •

edited

Loading

wlandau commented Aug 4, 2019 •

edited

Loading

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 5, 2019

wlandau commented Aug 5, 2019

wlandau commented Aug 7, 2019

Decorated cache and specialized data storage #971

Decorated cache and specialized data storage #971

Comments

wlandau commented Aug 1, 2019 • edited Loading

Prework

Description

wlandau commented Aug 4, 2019 • edited Loading

Strategy

Command API

Storage

Retrieval

Garbage collection

Implementation

Benefits

Related

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 4, 2019

wlandau commented Aug 5, 2019

Change of direction

Proposal

wlandau commented Aug 5, 2019

wlandau commented Aug 7, 2019

wlandau commented Aug 1, 2019 •

edited

Loading

wlandau commented Aug 4, 2019 •

edited

Loading