fst for arbitrary data storage? #201

wlandau · 2019-06-16T20:08:11Z

Inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Essentially, I would like to take an arbitrary (and arbitrarily large) data structure in memory, serialize it to a raw vector, save it in a one-column data frame with write_fst(), and retrieve it later with read_fst(). Have you tried this before? What would it take to make it work.

Benchmarks for small-ish data are encouraging

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

but there are some roadblocks with big data / long vectors.

library(fst)
x <- data.frame(x = raw(2^32))
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
x <- list(x = raw(2^32))
as.data.frame(x)
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
class(x) <- "data.frame"
file <- tempfile()
write_fst(x, file) # No error here...
# read_fst(file)   # but I get a segfault here.

^{Created on 2019-06-16 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2019-06-16T21:34:48Z

Hi @wlandau, thanks for your question and pointing me to the stackoverflow issue! Indeed, you can use fst to serialize a raw vector to disk and benefit from the multi-threaded (de-)compression to get smaller files and faster IO.

With two methods almost similar to your code example:

write_raw <- function(x, path, compress = 50) {

  # create a list and add required attributes
  y <- list(X = x)
  attributes(y) <- c(attributes(y), class = "data.frame")

  # serialize and compress to disk
  fst::write_fst(y, path, compress)
}

read_raw <- function(path) {
  
  # read from disk
  z <- fst::read_fst(path)
  
  z$X
}

you could write and re-read your raw-vector to disk without any overhead from (internal) copying:

# 2 GB raw vector
x <- rep(as.raw(0:255), 2^23 -1)

microbenchmark::microbenchmark(
  write_raw(x, "arbitrary.fst", 70),
  times = 1
)
#> Unit: milliseconds
#>                               expr      min       lq     mean   median
#>  write_raw(x, "arbitrary.fst", 70) 739.9292 739.9292 739.9292 739.9292
#>        uq      max neval
#>  739.9292 739.9292     1

microbenchmark::microbenchmark(
  z <- read_raw("arbitrary.fst"),
  times = 1
)
#> Unit: seconds
#>                            expr     min      lq    mean  median      uq
#>  z <- read_raw("arbitrary.fst") 4.02801 4.02801 4.02801 4.02801 4.02801
#>      max neval
#>  4.02801     1

(Note that I'm using package microbenchmark because system.time() doesn't handle multi-threaded code all too well)

This will give you a ~300 MB file with the compressed vector data, so a compression factor of 6-7 for this particular sample (and compression setting of 70).

This setup should scale to raw vectors with sizes up to 2^64 - 1, except, as your segfault already shows, it doesn't :-).

I will try to pinpoint and fix this problem as soon as possible and get back to you on that, thanks!

MarcusKlik · 2019-06-16T21:48:54Z

Hi @wlandau, as a side-note, fst::compress_fst() and fst::decompress_fst() do work as expected and can take raw vectors larger than 2 GB:

# 4 GB raw vector
x <- rep(as.raw(0:255), 2^24)

microbenchmark::microbenchmark(
  y <- fst::compress_fst(x),
  times = 1
)
#> Unit: milliseconds
#>                       expr      min       lq     mean   median       uq
#>  y <- fst::compress_fst(x) 236.1515 236.1515 236.1515 236.1515 236.1515
#>       max neval
#>  236.1515     1

microbenchmark::microbenchmark(
  z <- fst::decompress_fst(y),
  times = 1
)
#> Unit: milliseconds
#>                         expr      min       lq     mean   median       uq
#>  z <- fst::decompress_fst(y) 413.8044 413.8044 413.8044 413.8044 413.8044
#>       max neval
#>  413.8044     1

wlandau · 2019-06-16T22:09:19Z

@MarcusKlik, I am so excited that you are willing to work on this! A solution would be such a boon to the stuff I work on, re ropensci/drake#907, richfitz/storr#103, and richfitz/storr#108. cc @richfitz and @nettoyoussef.

wlandau · 2019-06-16T22:30:14Z

And thanks for the tip about compress_fst(). Both the compression and the parallel hashing are very relevant to richfitz/storr#108.

MarcusKlik · 2019-06-16T22:35:30Z

Hi @wlandau, drake and storr are very interesting projects, and it looks like there could be a lot of opportunity for fst to make a difference, very nice!

wlandau · 2019-06-16T23:16:41Z

My thoughts exactly! drake is designed for big computation (long runtimes) and it currently struggles with big data. There is so much important ground to gain.

MarcusKlik · 2019-06-18T20:49:57Z

Hi @wlandau, I've tracked the problem to a downcast in the fstlib library (the usual suspect with problems around 2^31 values) and believe it to be fixed now.

However, as you already mention in your stackoverflow issue, long vectors are not (yet) supported in data.table's, tibble's or in data.frame's.

Therefore, I think the best solution for reading datasets with >=2^31 rows is to return a named list object instead of a data.frame. That doesn't stop the solution above from returning your raw vector, so that should work as expected now!

MarcusKlik · 2019-06-18T21:29:06Z

As an example, we can store and retrieve a single column 8GB data.frame with >2^31 rows:

# create integer column with >2^31 rows
x <- list(X = rep(sample(0:2000000, 256), 2^23 + 10))
attributes(x) <- c(attributes(x), list(class = "data.frame"))

# printing does not work for a long vector data.frame without row names
x
#> [1] X
#> <0 rows> (or 0-length row.names)

# but the data is there
length(x$X)
#> [1] 2147486208

# serialize and compress to disk
fst::write_fst(x, "arbitrary.fst", 100)

# read from disk
z <- fst::read_fst("arbitrary.fst")

# a named list is returned now
str(z)
#> List of 1
#>  $ X: int [1:2147486208] 649321 1501443 1558355 1684020 1206196 1858874 1691882 492303 739807 1566262 ...

to illustrate some of the missing long vector support in R at the moment:

# we cannot set row names so creating a data.frame is not possible
attr(z, "row.names") <- 1:length(z$X)
#> Error in attr(z, "row.names") <- 1:length(z$X): long vectors not supported yet: attrib.c:42

# also not with data.table methods
data.table::setattr(z, "row.names", 1:length(z$X))
#> Error in data.table::setattr(z, "row.names", 1:length(z$X)) : long vectors not supported yet: attrib.c:42

# and we cannot create a data.table
class(z) <- "data.table"
z
#> Error in dim.data.table(x): long vectors not supported yet: ../include/Rinlinedfuns.h:522

so until long vectors are supported here, I think we best stick with returning a names list.

wlandau · 2019-06-19T02:34:04Z

Amazing! The class<- workaround no longer segfaults. Thank you for the patch!

library(digest)
library(fst)
x <- list(raw = serialize(runif(3e8), NULL, ascii = FALSE))
class(x) <- "data.frame"
length(x$raw) > 2^31
#> [1] TRUE
path <- tempfile()
write_fst(x, path)
y <- read_fst(path)
digest(x$raw, serialize = FALSE) == digest(y$raw, serialize = FALSE)
#> [1] TRUE

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

Now I can benchmark richfitz/storr#109 against richfitz/storr#111.

wlandau · 2019-06-19T03:29:42Z

The benchmarks at richfitz/storr#109 (comment) are interesting. In the case of storr, it looks like we can save large-ish data faster with write_fst(faux_data.frame), but we can read small data faster with richfitz/storr#109 (comment). Both choices uniformly and noticeably outperform storr's default compression setting. Either would be a huge help for drake.

wlandau changed the title ~~fst for arbitrary data storage~~ fst for arbitrary data storage? Jun 16, 2019

MarcusKlik self-assigned this Jun 16, 2019

MarcusKlik added the bug label Jun 16, 2019

MarcusKlik added this to the fst v0.9.2 milestone Jun 16, 2019

wlandau mentioned this issue Jun 18, 2019

Sketch fst driver richfitz/storr#109

Open

MarcusKlik closed this as completed in c8bfde9 Jun 18, 2019

wlandau mentioned this issue Oct 4, 2019

writeBin() in chunks richfitz/storr#107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fst for arbitrary data storage? #201

fst for arbitrary data storage? #201

wlandau commented Jun 16, 2019 •

edited

Loading

MarcusKlik commented Jun 16, 2019 •

edited

Loading

MarcusKlik commented Jun 16, 2019

wlandau commented Jun 16, 2019

wlandau commented Jun 16, 2019

MarcusKlik commented Jun 16, 2019

wlandau commented Jun 16, 2019 •

edited

Loading

MarcusKlik commented Jun 18, 2019

MarcusKlik commented Jun 18, 2019

wlandau commented Jun 19, 2019

wlandau commented Jun 19, 2019

fst for arbitrary data storage? #201

fst for arbitrary data storage? #201

Comments

wlandau commented Jun 16, 2019 • edited Loading

MarcusKlik commented Jun 16, 2019 • edited Loading

MarcusKlik commented Jun 16, 2019

wlandau commented Jun 16, 2019

wlandau commented Jun 16, 2019

MarcusKlik commented Jun 16, 2019

wlandau commented Jun 16, 2019 • edited Loading

MarcusKlik commented Jun 18, 2019

MarcusKlik commented Jun 18, 2019

wlandau commented Jun 19, 2019

wlandau commented Jun 19, 2019

wlandau commented Jun 16, 2019 •

edited

Loading

MarcusKlik commented Jun 16, 2019 •

edited

Loading

wlandau commented Jun 16, 2019 •

edited

Loading