Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fst for arbitrary data storage? #201

Closed
wlandau opened this issue Jun 16, 2019 · 10 comments
Closed

fst for arbitrary data storage? #201

wlandau opened this issue Jun 16, 2019 · 10 comments
Assignees
Labels
Milestone

Comments

@wlandau
Copy link

wlandau commented Jun 16, 2019

Inspired by advice from @eddelbuettel here, I am attempting to leverage fst for arbitrary data. Essentially, I would like to take an arbitrary (and arbitrarily large) data structure in memory, serialize it to a raw vector, save it in a one-column data frame with write_fst(), and retrieve it later with read_fst(). Have you tried this before? What would it take to make it work.

Benchmarks for small-ish data are encouraging

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

but there are some roadblocks with big data / long vectors.

library(fst)
x <- data.frame(x = raw(2^32))
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
x <- list(x = raw(2^32))
as.data.frame(x)
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
class(x) <- "data.frame"
file <- tempfile()
write_fst(x, file) # No error here...
# read_fst(file)   # but I get a segfault here.

Created on 2019-06-16 by the reprex package (v0.3.0)

@wlandau wlandau changed the title fst for arbitrary data storage fst for arbitrary data storage? Jun 16, 2019
@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Jun 16, 2019

Hi @wlandau, thanks for your question and pointing me to the stackoverflow issue! Indeed, you can use fst to serialize a raw vector to disk and benefit from the multi-threaded (de-)compression to get smaller files and faster IO.

With two methods almost similar to your code example:

write_raw <- function(x, path, compress = 50) {

  # create a list and add required attributes
  y <- list(X = x)
  attributes(y) <- c(attributes(y), class = "data.frame")

  # serialize and compress to disk
  fst::write_fst(y, path, compress)
}

read_raw <- function(path) {
  
  # read from disk
  z <- fst::read_fst(path)
  
  z$X
}

you could write and re-read your raw-vector to disk without any overhead from (internal) copying:

# 2 GB raw vector
x <- rep(as.raw(0:255), 2^23 -1)

microbenchmark::microbenchmark(
  write_raw(x, "arbitrary.fst", 70),
  times = 1
)
#> Unit: milliseconds
#>                               expr      min       lq     mean   median
#>  write_raw(x, "arbitrary.fst", 70) 739.9292 739.9292 739.9292 739.9292
#>        uq      max neval
#>  739.9292 739.9292     1

microbenchmark::microbenchmark(
  z <- read_raw("arbitrary.fst"),
  times = 1
)
#> Unit: seconds
#>                            expr     min      lq    mean  median      uq
#>  z <- read_raw("arbitrary.fst") 4.02801 4.02801 4.02801 4.02801 4.02801
#>      max neval
#>  4.02801     1

(Note that I'm using package microbenchmark because system.time() doesn't handle multi-threaded code all too well)

This will give you a ~300 MB file with the compressed vector data, so a compression factor of 6-7 for this particular sample (and compression setting of 70).

This setup should scale to raw vectors with sizes up to 2^64 - 1, except, as your segfault already shows, it doesn't :-).

I will try to pinpoint and fix this problem as soon as possible and get back to you on that, thanks!

@MarcusKlik MarcusKlik self-assigned this Jun 16, 2019
@MarcusKlik MarcusKlik added the bug label Jun 16, 2019
@MarcusKlik MarcusKlik added this to the fst v0.9.2 milestone Jun 16, 2019
@MarcusKlik
Copy link
Collaborator

Hi @wlandau, as a side-note, fst::compress_fst() and fst::decompress_fst() do work as expected and can take raw vectors larger than 2 GB:

# 4 GB raw vector
x <- rep(as.raw(0:255), 2^24)

microbenchmark::microbenchmark(
  y <- fst::compress_fst(x),
  times = 1
)
#> Unit: milliseconds
#>                       expr      min       lq     mean   median       uq
#>  y <- fst::compress_fst(x) 236.1515 236.1515 236.1515 236.1515 236.1515
#>       max neval
#>  236.1515     1

microbenchmark::microbenchmark(
  z <- fst::decompress_fst(y),
  times = 1
)
#> Unit: milliseconds
#>                         expr      min       lq     mean   median       uq
#>  z <- fst::decompress_fst(y) 413.8044 413.8044 413.8044 413.8044 413.8044
#>       max neval
#>  413.8044     1

@wlandau
Copy link
Author

wlandau commented Jun 16, 2019

@MarcusKlik, I am so excited that you are willing to work on this! A solution would be such a boon to the stuff I work on, re ropensci/drake#907, richfitz/storr#103, and richfitz/storr#108. cc @richfitz and @nettoyoussef.

@wlandau
Copy link
Author

wlandau commented Jun 16, 2019

And thanks for the tip about compress_fst(). Both the compression and the parallel hashing are very relevant to richfitz/storr#108.

@MarcusKlik
Copy link
Collaborator

Hi @wlandau, drake and storr are very interesting projects, and it looks like there could be a lot of opportunity for fst to make a difference, very nice!

@wlandau
Copy link
Author

wlandau commented Jun 16, 2019

My thoughts exactly! drake is designed for big computation (long runtimes) and it currently struggles with big data. There is so much important ground to gain.

@MarcusKlik
Copy link
Collaborator

Hi @wlandau, I've tracked the problem to a downcast in the fstlib library (the usual suspect with problems around 2^31 values) and believe it to be fixed now.

However, as you already mention in your stackoverflow issue, long vectors are not (yet) supported in data.table's, tibble's or in data.frame's.

Therefore, I think the best solution for reading datasets with >=2^31 rows is to return a named list object instead of a data.frame. That doesn't stop the solution above from returning your raw vector, so that should work as expected now!

@MarcusKlik
Copy link
Collaborator

As an example, we can store and retrieve a single column 8GB data.frame with >2^31 rows:

# create integer column with >2^31 rows
x <- list(X = rep(sample(0:2000000, 256), 2^23 + 10))
attributes(x) <- c(attributes(x), list(class = "data.frame"))

# printing does not work for a long vector data.frame without row names
x
#> [1] X
#> <0 rows> (or 0-length row.names)

# but the data is there
length(x$X)
#> [1] 2147486208

# serialize and compress to disk
fst::write_fst(x, "arbitrary.fst", 100)

# read from disk
z <- fst::read_fst("arbitrary.fst")

# a named list is returned now
str(z)
#> List of 1
#>  $ X: int [1:2147486208] 649321 1501443 1558355 1684020 1206196 1858874 1691882 492303 739807 1566262 ...

to illustrate some of the missing long vector support in R at the moment:

# we cannot set row names so creating a data.frame is not possible
attr(z, "row.names") <- 1:length(z$X)
#> Error in attr(z, "row.names") <- 1:length(z$X): long vectors not supported yet: attrib.c:42

# also not with data.table methods
data.table::setattr(z, "row.names", 1:length(z$X))
#> Error in data.table::setattr(z, "row.names", 1:length(z$X)) : long vectors not supported yet: attrib.c:42

# and we cannot create a data.table
class(z) <- "data.table"
z
#> Error in dim.data.table(x): long vectors not supported yet: ../include/Rinlinedfuns.h:522

so until long vectors are supported here, I think we best stick with returning a names list.

@wlandau
Copy link
Author

wlandau commented Jun 19, 2019

Amazing! The class<- workaround no longer segfaults. Thank you for the patch!

library(digest)
library(fst)
x <- list(raw = serialize(runif(3e8), NULL, ascii = FALSE))
class(x) <- "data.frame"
length(x$raw) > 2^31
#> [1] TRUE
path <- tempfile()
write_fst(x, path)
y <- read_fst(path)
digest(x$raw, serialize = FALSE) == digest(y$raw, serialize = FALSE)
#> [1] TRUE

Created on 2019-06-18 by the reprex package (v0.3.0)

Now I can benchmark richfitz/storr#109 against richfitz/storr#111.

@wlandau
Copy link
Author

wlandau commented Jun 19, 2019

The benchmarks at richfitz/storr#109 (comment) are interesting. In the case of storr, it looks like we can save large-ish data faster with write_fst(faux_data.frame), but we can read small data faster with richfitz/storr#109 (comment). Both choices uniformly and noticeably outperform storr's default compression setting. Either would be a huge help for drake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants