-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fst for arbitrary data storage? #201
Comments
Hi @wlandau, thanks for your question and pointing me to the stackoverflow issue! Indeed, you can use fst to serialize a raw vector to disk and benefit from the multi-threaded (de-)compression to get smaller files and faster IO. With two methods almost similar to your code example: write_raw <- function(x, path, compress = 50) {
# create a list and add required attributes
y <- list(X = x)
attributes(y) <- c(attributes(y), class = "data.frame")
# serialize and compress to disk
fst::write_fst(y, path, compress)
}
read_raw <- function(path) {
# read from disk
z <- fst::read_fst(path)
z$X
} you could write and re-read your raw-vector to disk without any overhead from (internal) copying: # 2 GB raw vector
x <- rep(as.raw(0:255), 2^23 -1)
microbenchmark::microbenchmark(
write_raw(x, "arbitrary.fst", 70),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median
#> write_raw(x, "arbitrary.fst", 70) 739.9292 739.9292 739.9292 739.9292
#> uq max neval
#> 739.9292 739.9292 1
microbenchmark::microbenchmark(
z <- read_raw("arbitrary.fst"),
times = 1
)
#> Unit: seconds
#> expr min lq mean median uq
#> z <- read_raw("arbitrary.fst") 4.02801 4.02801 4.02801 4.02801 4.02801
#> max neval
#> 4.02801 1 (Note that I'm using package This will give you a ~300 MB file with the compressed vector data, so a compression factor of 6-7 for this particular sample (and compression setting of 70). This setup should scale to raw vectors with sizes up to 2^64 - 1, except, as your segfault already shows, it doesn't :-). I will try to pinpoint and fix this problem as soon as possible and get back to you on that, thanks! |
Hi @wlandau, as a side-note, # 4 GB raw vector
x <- rep(as.raw(0:255), 2^24)
microbenchmark::microbenchmark(
y <- fst::compress_fst(x),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> y <- fst::compress_fst(x) 236.1515 236.1515 236.1515 236.1515 236.1515
#> max neval
#> 236.1515 1
microbenchmark::microbenchmark(
z <- fst::decompress_fst(y),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> z <- fst::decompress_fst(y) 413.8044 413.8044 413.8044 413.8044 413.8044
#> max neval
#> 413.8044 1 |
@MarcusKlik, I am so excited that you are willing to work on this! A solution would be such a boon to the stuff I work on, re ropensci/drake#907, richfitz/storr#103, and richfitz/storr#108. cc @richfitz and @nettoyoussef. |
And thanks for the tip about |
Hi @wlandau, |
My thoughts exactly! |
Hi @wlandau, I've tracked the problem to a downcast in the However, as you already mention in your stackoverflow issue, long vectors are not (yet) supported in data.table's, tibble's or in data.frame's. Therefore, I think the best solution for reading datasets with >=2^31 rows is to return a named list object instead of a data.frame. That doesn't stop the solution above from returning your raw vector, so that should work as expected now! |
As an example, we can store and retrieve a single column 8GB data.frame with >2^31 rows: # create integer column with >2^31 rows
x <- list(X = rep(sample(0:2000000, 256), 2^23 + 10))
attributes(x) <- c(attributes(x), list(class = "data.frame"))
# printing does not work for a long vector data.frame without row names
x
#> [1] X
#> <0 rows> (or 0-length row.names)
# but the data is there
length(x$X)
#> [1] 2147486208
# serialize and compress to disk
fst::write_fst(x, "arbitrary.fst", 100)
# read from disk
z <- fst::read_fst("arbitrary.fst")
# a named list is returned now
str(z)
#> List of 1
#> $ X: int [1:2147486208] 649321 1501443 1558355 1684020 1206196 1858874 1691882 492303 739807 1566262 ... to illustrate some of the missing long vector support in # we cannot set row names so creating a data.frame is not possible
attr(z, "row.names") <- 1:length(z$X)
#> Error in attr(z, "row.names") <- 1:length(z$X): long vectors not supported yet: attrib.c:42
# also not with data.table methods
data.table::setattr(z, "row.names", 1:length(z$X))
#> Error in data.table::setattr(z, "row.names", 1:length(z$X)) : long vectors not supported yet: attrib.c:42
# and we cannot create a data.table
class(z) <- "data.table"
z
#> Error in dim.data.table(x): long vectors not supported yet: ../include/Rinlinedfuns.h:522 so until long vectors are supported here, I think we best stick with returning a names list. |
Amazing! The library(digest)
library(fst)
x <- list(raw = serialize(runif(3e8), NULL, ascii = FALSE))
class(x) <- "data.frame"
length(x$raw) > 2^31
#> [1] TRUE
path <- tempfile()
write_fst(x, path)
y <- read_fst(path)
digest(x$raw, serialize = FALSE) == digest(y$raw, serialize = FALSE)
#> [1] TRUE Created on 2019-06-18 by the reprex package (v0.3.0) Now I can benchmark richfitz/storr#109 against richfitz/storr#111. |
The benchmarks at richfitz/storr#109 (comment) are interesting. In the case of |
Inspired by advice from @eddelbuettel here, I am attempting to leverage
fst
for arbitrary data. Essentially, I would like to take an arbitrary (and arbitrarily large) data structure in memory, serialize it to a raw vector, save it in a one-column data frame withwrite_fst()
, and retrieve it later withread_fst()
. Have you tried this before? What would it take to make it work.Benchmarks for small-ish data are encouraging
but there are some roadblocks with big data / long vectors.
Created on 2019-06-16 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: