Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decorated cache and specialized data storage #971

Closed
2 tasks done
wlandau opened this issue Aug 1, 2019 · 8 comments
Closed
2 tasks done

Decorated cache and specialized data storage #971

wlandau opened this issue Aug 1, 2019 · 8 comments

Comments

@wlandau
Copy link
Member

wlandau commented Aug 1, 2019

Prework

Description

I am thinking about a decorated storr to contain things like config$progress_hashmap, config$cache_path (a guarantee on the path), and the encoders. Maybe it could also help history somehow.

@wlandau wlandau changed the title Decorated cache? Decorated cache and specialized data storage Aug 4, 2019
@wlandau
Copy link
Member Author

wlandau commented Aug 4, 2019

A better use: increase the cache performance beyond storr's current capabilities. We can write data faster and halve the memory consumption.

Strategy

From @richfitz: richfitz/storr#77 (comment)

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

Command API

Users will have special return_*() functions for drake_plan() commands and imported functions. At minimum, we should have these:

  • return_fst()
  • return_hdf5()
  • return_rds()

Example:

plan <- drake_plan(
  data = return_fst(get_big_data_frame()),
  model = return_hdf5(fit_big_keras_model(data))
)

return_fst() will assign a special "return_fst" S3 class to the output of get_big_data_frame(). That will tell drake to store the object with fst::write_fst(). return_hdf5() is similar, but for Keras models, which should address richfitz/storr#77 (comment) quite nicely.

Storage

Let's use a return_fst() command as an example.

  1. Use fst::write_fst() to save the target value to a temporary file.
  2. Hash the file.
  3. In the inner storr, save the file hash and give it an S3 class of "return_fst".
  4. Get the hash of the storr object.
  5. Move the fst file from (1) to a special folder in the drake cache and give it a file name based on the hash from (4).

Retrieval

  1. Get the object using the get() method of the inner storr.
  2. If the object has class "return_fst", identify the path to the fst file using the hash from storage step (4).
  3. Read the fst file from (2) using read_fst(), using any custom informal arguments saved with storage step (3.2).
  4. Return the data frame read in by (3).

Garbage collection

  1. Identify all the data hashes of the inner storr.
  2. Call the gc() method of the inner storr.
  3. Identify the data hashes removed with (2).
  4. Using the hashes from (3), identify and remove all the corresponding files saved via return_*().

Implementation

The implementation will be a decorated cache: an outer storr-like cache that contains a true storr on the inside. The storage, retrieval, and garbage collection routines above will be implemented in their own get(), set(), and gc() methods in the outer cache. Those methods will delegate to the get(), set(), and gc() methods of the inner storr as appropriate.

Benefits

  • We can take advantage of efficient domain-specific methods to save user's data such as fst.
  • We avoid copying the data to a serialized vector. That way, we will no longer have to duplicate the data in memory.

Related

@wlandau
Copy link
Member Author

wlandau commented Aug 4, 2019

We should keep schnorr/starvz#6 in mind as well.

@wlandau
Copy link
Member Author

wlandau commented Aug 4, 2019

Let's implement a decorated driver too so the API is completely the same.

@wlandau
Copy link
Member Author

wlandau commented Aug 4, 2019

Nope. We can just use the existing driver. No need to decorate that. We just have to make sure it's there (and public).

@wlandau
Copy link
Member Author

wlandau commented Aug 4, 2019

Another thing: let's think about composing the encoder and hash tables in the decorated cache. Related: #967, #968.

@wlandau wlandau self-assigned this Aug 4, 2019
@wlandau wlandau changed the title Decorated cache and specialized data storage Specialized cache and specialized data storage Aug 4, 2019
@wlandau wlandau changed the title Specialized cache and specialized data storage Decorated cache and specialized data storage Aug 4, 2019
@wlandau wlandau mentioned this issue Aug 5, 2019
17 tasks
@wlandau
Copy link
Member Author

wlandau commented Aug 5, 2019

Change of direction

Disadvantages of return_*():

  1. Extra stuff added to the API.
  2. The return values of return_fst() etc. are not the actual values you want. They are special lists with special S3 classes, which drake understands, but users do not.

Proposal

Let the data format be a custom column in the plan.

plan <- drake_plan(
  data = target(get_big_data_frame(), format = "fst"),
  model = target(fit_big_keras_model(data), format = "keras")
)

The internals in https://github.com/ropensci/drake/tree/971 will not need much adjustment. All we need is an assign_format() function to call the right return_*() function to wrap up the value. Then, store_outputs() can call assign_format() based on the format in the layout. We can rename the return_*() functions to assign_format.*() and make them S3 methods.

@wlandau
Copy link
Member Author

wlandau commented Aug 5, 2019

Remaining for this issue: use cache$hash_algorithm and cache$path internally instead of config$hash_algorithm and config$path. We can worry about encoders, encoding hash tables, and the progress hash table later on when we refactor those data structures.

@wlandau wlandau closed this as completed in 0925893 Aug 7, 2019
@wlandau
Copy link
Member Author

wlandau commented Aug 7, 2019

Now, most of the hash tables are in the decorated cache where they belong. We might put some of these in a dedicated encoder, but that's for #968.

This was referenced Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant