Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for disk.frame #1004

Closed
xiaodaigh opened this issue Sep 5, 2019 · 12 comments
Closed

Add support for disk.frame #1004

xiaodaigh opened this issue Sep 5, 2019 · 12 comments

Comments

@xiaodaigh
Copy link
Contributor

xiaodaigh commented Sep 5, 2019

drake has recently added support for fst. It would be great to add disk.frame as a storage medium as well. E.g. make it possible to do the below.

library(drake)
n <- 1e8 # Each target is 1.6 GB in memory.
plan <- drake_plan(
  data_fst = target(
    data.frame(x = runif(n), y = runif(n)),
    format = "disk.frame"
  ),
  data_old = data.frame(x = runif(n), y = runif(n))
)
@wlandau
Copy link
Member

wlandau commented Sep 5, 2019

Yeah, let's definitely explore this. disk.frame looks really promising at a glance, I will need to try it out and learn more about it. As we go forward, we might also be able to use this approach to reduce the memory consumption of the split transformation, i.e. drake_plan(x = target(..., transform = split(...))).

@wlandau
Copy link
Member

wlandau commented Sep 11, 2019

After reading the quick start guide and the article on ingesting data, and I think a disk.frame format is definitely possible. Another sketch:

library(nycflights13)
library(dplyr)
library(disk.frame)
library(data.table)
library(drake)

plan <- drake_plan(
  flights.df = target(
    as.disk.frame(
      flights, 
      outdir = file.path(tempdir(), "tmp_flights.df"),
      overwrite = TRUE
    ),
    format = "disk.frame"
  )
)

But there are some big caveats:

  • Let's say the target's return value is already a disk.frame. No matter where it initially lives in storage, drake needs to physically move it to a special subdirectory of the cache. (The cache is usually a hidden .drake/ folder in the working directory where drake::make() is called.) If the disk.frame were to remain external, it would be much safer and more reproducible to track it as an external directory using file_in() and/or file_out(), which is less convenient than target(format = "disk.frame"). If the disk.frame is already on the same drive as the cache, I suspect a simple file.rename() will be quick. Otherwise, the transfer will take a long time and may even bottleneck the pipeline.
  • If the target is not already a disk.frame, it must be coercible to one. I think we can handle data.frames and data.tables, but users who begin with CSV or zip files will probably need to ingest them manually.

Is all this reasonable?

@xiaodaigh
Copy link
Contributor Author

I probably need time to digest this. Apologies if I appear slow when I mull over this. Really appreciate your responsiveness here.

@wlandau
Copy link
Member

wlandau commented Sep 11, 2019

Sure, we can take our time. Let me know if you need clarification.

@kendonB
Copy link
Contributor

kendonB commented Sep 25, 2019

A benefit of disk.frame over the current split transform function is that you can shardby a variable if you want to keep a dimension of the data all together at the analysis stage. Or is there a way for drake's split to currently do that?

@wlandau
Copy link
Member

wlandau commented Sep 25, 2019

If shareby is similar to group_by(), then split() does not do this. However, I have plans for a dynamic group_map() when it comes time to extend #685.

@xiaodaigh
Copy link
Contributor Author

Would it work that when you run create a disk.frame inside a plan using as.disk.frame, the out directory is automatically in the .drake folder? So we basically disable outdir for the user when using drake? This will circumvent the copy to .drake issue with disk.frame. I think tighter integration like this will be awesome. Ultimately, for many use-cases, I don't think the user really cares where the data lives, as long as it's there.

@wlandau
Copy link
Member

wlandau commented Sep 27, 2019

Seems like the most expedient approach we have. A couple thoughts:

  1. We just need to bear in mind that the initial out directory is still not going to be the final location, even if it is inside .drake/. The disk.frame directory with the chunks ultimately needs to have a special name (just a hash) so drake can find it again. That said, moving the files from the initial location to the final location should be practically instantaneous because the whole .drake/ folder is on the same drive. All we need to do is rename all the chunks. Still sound reasonable?
  2. In the docs, maybe we should advise users to consider running frequent garbage collection on the cache. New disk.frames could be created rather often, and drake keeps them all for the sake of history and data recovery. drake_gc() removes all but the most current targets.

@wlandau
Copy link
Member

wlandau commented Sep 27, 2019

By the way, do you have ideas on how we disable outdir automatically? In as.disk.frame(), the default outdir is tempfile(fileext = ".df"). It may be simpler to require something like as.disk.frame(outdir = drake_cache_tempfile()) or as.disk.frame(outdir = drake_cache_staging_area()).

@xiaodaigh
Copy link
Contributor Author

Possible option is to have a disk.frame.temp.file() function and a global options disk.frame.use.drake = T then the function returns the drake_cache_tempfile()? This would annoying. Would be cool for plan to capture as.disk.frame() and just use meta-programming to overwrite the directory? That may be too intrusive. Firstly, I need to learn more about drake

@wlandau
Copy link
Member

wlandau commented Sep 27, 2019

Automatic metaprogramming is achievable if the affected commands in the plan contain as.disk.frame(), but things get tricky if people start using package or functions that return disk.frames and there is nothing in the plan to modify. And then there are cases when people want to define their own file_out() directories and control where disk.frames are saved. So I am not so sure. I hesitate to overfit the use cases we can imagine right now.

Personally, I am more concerned about clarity than automation. As long as people know how to use the disk.frame format correctly, a little extra work does not seem like a big deal. Correct me if I am wrong, but disk.frame seems like a tool where people are willing to think a little about file paths anyway.

To be clear, the only consequence of misspecifying outdir is the extra runtime it might take to copy files from one drive to another. For some users, tempdir() will be on the same drive as .drake/, so there is nothing special to do. Otherwise,drake should still be able to get the job done.

@wlandau wlandau mentioned this issue Sep 29, 2019
4 tasks
@wlandau
Copy link
Member

wlandau commented Sep 29, 2019

Implemented in #1022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants