-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new transform function split
to chunk a data.frame
#833
Comments
Such a good idea! However, at this stage of development, I am resistant to building it into That said, this is a fantastic technique. Perhaps it belongs in the manual, re #80. Maybe in a new example chapter? |
Reopening: if we handle this in a more general way, we may be able get most of what #685 is trying to solve. EDIT, 2019-05-20: a digression on indices@kendonB, I like your original suggestion of focusing on the indices. Transforming the indices instead of the data might help us avoid loading the entire dataset into memory. That alone is worth a special case! I have thought about something like this: drake_plan(
x = target(
read_csv(file_in("large.csv"), skip = i, n_max = 1e2),
transform = split(i, .size = 1e6, .splits = 1e4)
)
) But it requires prior knowledge that |
@kendonB, your sketch from #833 (comment) gets us almost there. With the demand for the use case, I think we could add a special
Sketch: drake_plan(
iris_chunked = target(
iris,
transform = split(iris, .splits = 3, .dim = 2, .margin = 1)
)
)
#> # A tibble: 4 x 2
#> target command
#> <chr> <expr>
#> 1 iris_chunked_1 drake_split(iris, .splits = 3, .dim = 2, .margin = 1, .index = 1)
#> 2 iris_chunked_2 drake_split(iris, .splits = 3, .dim = 2, .margin = 1, .index = 1)
#> 3 iris_chunked_3 drake_split(iris, .splits = 3, .dim = 2, .margin = 1, .index = 1) where |
You know what? It would save a lot of hard work in the implementation and documentation to just leverage drake_plan(
iris_chunked = target(
drake_split(iris, splits = 3, margin = 1, index = i),
transform = map(i = c(1, 2, 3))
)
)
#> # A tibble: 4 x 2
#> target command
#> <chr> <expr>
#> 1 iris_chunked_1 drake_split(iris, splits = 3, margin = 1, index = 1)
#> 2 iris_chunked_2 drake_split(iris, splits = 3, margin = 1, index = 2)
#> 3 iris_chunked_3 drake_split(iris, splits = 3, margin = 1, index = 3) |
But we could probably use a better name than |
In fact, I like using existing transforms a lot more because it ends up being more flexible. One example that drake_plan(
iris_chunked = target(
fn(drake_split(iris, splits = 3, margin = 1, index = i)),
transform = cross(i = c(1, 2, 3), fn = c(analysis1, analysis2), .id = FALSE)
)
)
#> # A tibble: 4 x 2
#> target command
#> <chr> <expr>
#> 1 iris_chunked analysis1(drake_split(iris, splits = 3, margin = 1, index = 1))
#> 2 iris_chunked_2 analysis1(drake_split(iris, splits = 3, margin = 1, index = 2))
#> 3 iris_chunked_3 analysis1(drake_split(iris, splits = 3, margin = 1, index = 3))
#> 4 iris_chunked_4 analysis2(drake_split(iris, splits = 3, margin = 1, index = 1))
#> 5 iris_chunked_5 analysis2(drake_split(iris, splits = 3, margin = 1, index = 2))
#> 6 iris_chunked_6 analysis2(drake_split(iris, splits = 3, margin = 1, index = 3)) |
A couple more thoughts:
library(parallel)
base_indices <- function(n, splits) {
out <- list()
delta <- floor(n / splits)
for (i in seq_len(splits)) {
out[[i]] <- seq.int(from = 1 + delta * (i - 1), to = delta * i, by = 1)
}
out
}
microbenchmark::microbenchmark(
x = splitIndices(1e7, 1e4),
y = base_indices(1e7, 1e4)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> x 1704.36510 1828.32656 1892.58508 1869.34639 1940.80053 2159.7202 100
#> y 20.16306 42.41099 53.84792 45.20207 48.65589 177.5154 100 |
Does your solution handle the last chunk being smaller than the rest? On my phone so can't check rn |
Yes, Lines 69 to 73 in 6e33d92
We have a unit test to confirm that the chunk sizes differ by no more than one element. drake/tests/testthat/test-utils.R Lines 224 to 226 in 6e33d92
|
Neat. I think this is worth an email to r-devel as 1) there may be some pitfall we're missing and 2) if not, your approach can significantly improve |
Reopening to investigate #876 (comment). |
From #876 (comment),
|
To elaborate: the DSL should turn this:
into this:
To avoid opening Pandora's Box, let's assume one dataset per |
A useful note: library(drake)
plan <- drake_plan(
all_rows = file_in("huge.csv") %>%
number_of_rows() %>%
seq_len(),
rows = target(
all_rows,
transform = split(all_rows, slices = 3)
),
analysis = target(
read_rows( # custom function
file = file_in("huge.csv"),
rows = rows
) %>%
analyze_data(),
transform = map(rows, .id = rows_index) # an internal trick
)
)
drake_plan_source(plan)
#> drake_plan(
#> all_rows = file_in("huge.csv") %>%
#> number_of_rows() %>%
#> seq_len(),
#> rows_1 = drake_slice(data = all_rows, slices = 3, index = 1),
#> rows_2 = drake_slice(data = all_rows, slices = 3, index = 2),
#> rows_3 = drake_slice(data = all_rows, slices = 3, index = 3),
#> analysis_1 = read_rows(file = file_in("huge.csv"), rows = rows_1) %>% analyze_data(),
#> analysis_2 = read_rows(file = file_in("huge.csv"), rows = rows_2) %>% analyze_data(),
#> analysis_3 = read_rows(file = file_in("huge.csv"), rows = rows_3) %>% analyze_data()
#> ) Created on 2019-05-22 by the reprex package (v0.3.0) |
|
@wlandau I was looking for a way to chunk up a data frame into 10 roughly even chunks and found that there was no way to do it with dynamic branching. The only way I can see is to use |
I agree, it would be a bit more convenient for some users. But in the case of dynamic branching in |
Prework
drake
's code of conduct.Description
I have found myself often chunking up a data.frame to run crunching code on the chunks like this:
Created on 2019-04-12 by the reprex package (v0.2.1)
A possible interface could be:
Precalculating the
chunks
variable is necessary as it can be costly for largenx
andncl
. Using[
and notslice
is also necessary as the currentslice
is very slow. You could also potentially add groups to thesplit
call.The text was updated successfully, but these errors were encountered: