Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other potential backends #77

Open
richfitz opened this issue May 21, 2018 · 7 comments
Open

Other potential backends #77

richfitz opened this issue May 21, 2018 · 7 comments

Comments

@richfitz
Copy link
Owner

@wlandau
Copy link
Contributor

wlandau commented Mar 10, 2019

I think it would be extremely useful to have RDS-like drivers that overcome the serialization bottleneck. Would it be feasible to store unserialized binary blobs instead of RDS files? Can we leverage fst or Arrow for large data frames and still accommodate arbitrary data objects elsewhere in the same storr? Does thor make some of these points moot?

@richfitz
Copy link
Owner Author

unserialized binary blobs don't really exist - there is not a linear memory map for all but the simplest structures. fst is only going to work if every object serialised is a data.frame. Arrow is a possibility (it's the technology behind feather I think) but that would require waiting for a significant amount of work do be done.

thor still requires serialising R objects. It's probably a little faster than rds with compression turned off, but still has to pay the cost of serialising.

It's ultimately a performance/generality tradeoff. If there is a storr backend that serialises only simple types (atomic types, lists, and therefore data.frame's) it will choke as soon as something adds an exotic attribute.

What possibly could be done by someone sufficiently motivated would be to write a replacement for readRDS and writeRDS that did a few different behaviours based on type to efficiently serialise out the most common structures (though I suspect that only data.frame objects will see a big saving here. The reader would need to check the magic number of the files before reading them in. With that in place, a driver that directly extended the rds one would be trivial to write.

@wlandau
Copy link
Contributor

wlandau commented Mar 25, 2019

What about an RDS-like driver with the ability to choose how individual objects are loaded and saved? We might be able to store the optional custom saving/loading methods in the key files. This could be especially useful in drake-powered deep learning workflows because Keras models require their own serialization. saveRDS(keras_model) unfortunately does not preserve the data correctly.

my_storr$set(
  key = "model",
  value = keras_model,
  save = keras::save_model_hdf5(value, file),
  load = keras::load_model_hdf5(file)
)

my_storr$get(key = "model")

With storr as it is now, we could theoretically just call my_storr$set("model", keras::serialize_model(keras_model)) and then keras::unserialize_model(my_storr$get("model")), but that would serialize a big object twice. We could try to skip base::serialize(), but then we would just end up calling base::unserialize() on an object that really needs keras::unserialize_model().

@richfitz
Copy link
Owner Author

This would be possible to implement. We would need to know for each special case:

  • the R class of the special case
  • a custom serialisation function that would return raw (keras::serialize_model here would be fine)
  • a set of "magic numbers" that can be used to associate the first few bytes of a file unambiguously with a deserialisation function (for hdf5 that's 89 48 44 46 0d 0a 1a 0a and I've dug out the numbers for rds before
  • a deserialisation function

This requires a bit of fiddling around with the current hash functions, but it could be possible.

The limitation would be that you'd pay a little extra I/O cost on each deserialisation because you'd need to check the first few bytes then read the whole thing, and if you had two things that serialised down to a format with the same magic number but different formats you'd be stuffed (so for example if keras saves models in an hdf5 format of one flavour and another thing in a slightly different hdf5 format with a different load function it just would not work).

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

@wlandau
Copy link
Contributor

wlandau commented Apr 4, 2019

Interesting. I was assuming we would need to store a deserialization reference somewhere else, like a key file, but it sounds like those first few bytes could save us some bookkeeping. Any reading material you would recommend on serialization internals?

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

I have not decided whether to have drake automatically do this with Keras targets in the backend, but if it leads to a nice framework, drake might accommodate Arrow in the same way. For now, I am proposing this workaround on the user side.

@wlandau
Copy link
Contributor

wlandau commented Apr 27, 2019

Hmm... my comment just now is quite long and very specific. I will relocate it to a new issue.

@wlandau
Copy link
Contributor

wlandau commented Aug 4, 2019

@richfitz, I am coming back to your suggestion from the bottom of #77 (comment). I am proposing a decorated storr for drake: ropensci/drake#971 (comment). Is this something you think would be helpful for a user base beyond drake?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants