Other potential backends #77

richfitz · 2018-05-21T06:54:26Z

git2r (Support git2r #1)
monetdb (monetdb? #21)
boltdb (boltdb #24)
arangodb (arangodb #30)
redis hashes (Option in redis driver store in hashes #35)
unqlite (unqlite #39)

wlandau · 2019-03-10T13:08:26Z

I think it would be extremely useful to have RDS-like drivers that overcome the serialization bottleneck. Would it be feasible to store unserialized binary blobs instead of RDS files? Can we leverage fst or Arrow for large data frames and still accommodate arbitrary data objects elsewhere in the same storr? Does thor make some of these points moot?

richfitz · 2019-03-10T17:58:24Z

unserialized binary blobs don't really exist - there is not a linear memory map for all but the simplest structures. fst is only going to work if every object serialised is a data.frame. Arrow is a possibility (it's the technology behind feather I think) but that would require waiting for a significant amount of work do be done.

thor still requires serialising R objects. It's probably a little faster than rds with compression turned off, but still has to pay the cost of serialising.

It's ultimately a performance/generality tradeoff. If there is a storr backend that serialises only simple types (atomic types, lists, and therefore data.frame's) it will choke as soon as something adds an exotic attribute.

What possibly could be done by someone sufficiently motivated would be to write a replacement for readRDS and writeRDS that did a few different behaviours based on type to efficiently serialise out the most common structures (though I suspect that only data.frame objects will see a big saving here. The reader would need to check the magic number of the files before reading them in. With that in place, a driver that directly extended the rds one would be trivial to write.

wlandau · 2019-03-25T16:28:53Z

What about an RDS-like driver with the ability to choose how individual objects are loaded and saved? We might be able to store the optional custom saving/loading methods in the key files. This could be especially useful in drake-powered deep learning workflows because Keras models require their own serialization. saveRDS(keras_model) unfortunately does not preserve the data correctly.

my_storr$set(
  key = "model",
  value = keras_model,
  save = keras::save_model_hdf5(value, file),
  load = keras::load_model_hdf5(file)
)

my_storr$get(key = "model")

With storr as it is now, we could theoretically just call my_storr$set("model", keras::serialize_model(keras_model)) and then keras::unserialize_model(my_storr$get("model")), but that would serialize a big object twice. We could try to skip base::serialize(), but then we would just end up calling base::unserialize() on an object that really needs keras::unserialize_model().

richfitz · 2019-03-25T17:20:44Z

This would be possible to implement. We would need to know for each special case:

the R class of the special case
a custom serialisation function that would return raw (keras::serialize_model here would be fine)
a set of "magic numbers" that can be used to associate the first few bytes of a file unambiguously with a deserialisation function (for hdf5 that's 89 48 44 46 0d 0a 1a 0a and I've dug out the numbers for rds before
a deserialisation function

This requires a bit of fiddling around with the current hash functions, but it could be possible.

The limitation would be that you'd pay a little extra I/O cost on each deserialisation because you'd need to check the first few bytes then read the whole thing, and if you had two things that serialised down to a format with the same magic number but different formats you'd be stuffed (so for example if keras saves models in an hdf5 format of one flavour and another thing in a slightly different hdf5 format with a different load function it just would not work).

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

wlandau · 2019-04-04T14:47:43Z

Interesting. I was assuming we would need to store a deserialization reference somewhere else, like a key file, but it sounds like those first few bytes could save us some bookkeeping. Any reading material you would recommend on serialization internals?

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

I have not decided whether to have drake automatically do this with Keras targets in the backend, but if it leads to a nice framework, drake might accommodate Arrow in the same way. For now, I am proposing this workaround on the user side.

wlandau · 2019-04-27T01:16:06Z

Hmm... my comment just now is quite long and very specific. I will relocate it to a new issue.

wlandau · 2019-08-04T03:42:21Z

@richfitz, I am coming back to your suggestion from the bottom of #77 (comment). I am proposing a decorated storr for drake: ropensci/drake#971 (comment). Is this something you think would be helpful for a user base beyond drake?

richfitz added the new-backend label May 21, 2018

wlandau mentioned this issue Nov 25, 2018

Best practices for collaborative work on a single RDS storr? #92

Open

This was referenced Mar 31, 2019

More edits to the workflow notebook sol-eng/tensorflow-w-r#8

Merged

Workflow follow-up sol-eng/tensorflow-w-r#7

Merged

wlandau mentioned this issue Apr 27, 2019

Multiformat driver #103

Open

wlandau mentioned this issue May 15, 2019

Check the overhead incurred by drake sol-eng/tensorflow-w-r#9

Merged

This was referenced Jun 16, 2019

fst driver #108

Open

Sketch fst driver #109

Open

This was referenced Aug 4, 2019

Reconsider drake's storage backend ropensci/drake#931

Closed

Decorated cache and specialized data storage ropensci/drake#971

Closed

wlandau mentioned this issue Aug 8, 2019

Update to incorporate new features in drake sol-eng/tensorflow-w-r#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other potential backends #77

Other potential backends #77

richfitz commented May 21, 2018

wlandau commented Mar 10, 2019

richfitz commented Mar 10, 2019

wlandau commented Mar 25, 2019 •

edited

Loading

richfitz commented Mar 25, 2019

wlandau commented Apr 4, 2019

wlandau commented Apr 27, 2019

wlandau commented Aug 4, 2019

Other potential backends #77

Other potential backends #77

Comments

richfitz commented May 21, 2018

wlandau commented Mar 10, 2019

richfitz commented Mar 10, 2019

wlandau commented Mar 25, 2019 • edited Loading

richfitz commented Mar 25, 2019

wlandau commented Apr 4, 2019

wlandau commented Apr 27, 2019

wlandau commented Aug 4, 2019

wlandau commented Mar 25, 2019 •

edited

Loading