-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial work on splitting #79
Conversation
this is the beginning of a parallel `do` for drake, allowing simultaneous parallel processing pipelines.
I really like what you have done so far. Tidyverse or not, it is super easy to split up a data frame. Some preliminary thoughts: Big datasets as targetsWhat if the big dataset is a target later in the workflow? There is a little hiccup in the interface if it is not yet in the environment. drake_split(x, data_rows = 4, splits = 2)
## Error in inherits(x, "grouped_df") : object 'x' not found
x <- 1
drake_split(x, data_rows = 4, splits = 2)
## target command
## 1 slice_x_1 dplyr::slice(x, 1:2)
## 2 slice_x_2 dplyr::slice(x, 3:4) And it may be expedient to specify the dataset with a quoted target name. drake_split("targ", data_rows = 4, splits = 2)
## target command
## 1 slice_"targ"_1 dplyr::slice("targ", 1:2)
## 2 slice_"targ"_2 dplyr::slice("targ", 3:4) You might look at readd <- function(target, character_only = FALSE, ...){
...
if(!character_only) target = as.character(substitute(target))
...
} In this case, it may also be unwise to hard-code the row indices passed to Suggested number of splitsThe default |
I agree, This is mostly something I have been using semi-interactively, so I've been using it on targets that already exist. The pipe operator fails outright if the object doesn't exist ( I agree that I probably will end up making an additional target, like the |
This will remove the need to know the number of rows ahead of time. Further, It will ensure that if the splitted frame is not in the environment, that drake won't freak out about it. Also, TODO: include a `character_only` argument to pass a character name, rather than an actual object name.
PReviously, data would need to exist for split to work.
I forgot to address your point about the vertical positions of nodes. I struggled with this for 4.0.0, but could not get |
Now that I'm using indicies rather than counts, I want to determine which list is the shortest and push to that, rather than pushing to the one that has the smallest sum (which was appropriate when I was using counts of indices)
Did some work on this branch today. Major Updates:
|
Upon seeing the CI results, I'm noticing that I inadvertently put a dependency on My suggestion would be either:
I don't have much of a preference, especially since dplyr is a super common package to have installed already, and if someone has dplyr, then they also have |
I am not sure about (1) versus (2) just yet. (If (2), we should make sure to update the unfortunate The reason I am not sure is that you opened such a huge door with the idea of splitting. We could split evenly across the rows, use I also wonder: if
(2) sounds extremely difficult, and it may not even be possible. It may even require a successor to |
Splitting seems to me to be a pretty key functionality, even if it's not baked right into The latest push has support for splitting evely either across rows, or, if groups exist, it will split on those instead, and attempt to fill the splits as evenly as possible (one of the reasons that I picked the Extending
|
Good points about auto-splitting. You have convinced me that we should choose the number of splits (and whether to split at all) in advance, particularly to maintain predictability and control for end users. What to put in those splits can be decided later. And this makes me a little less worried about Also, I just saw |
Also, I had assumed we would build these splitting methods into drake itself. The way our discussion is going makes me lean toward replacing |
Thanks for pointing out
|
) | ||
} | ||
|
||
drake_unsplit <- function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I previously mentioned, I think we could rely on gather_plan()
here. Maybe that means removing drake_unsplit()
entirely, I do not know.
expand = FALSE | ||
) | ||
} | ||
evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In preparation for #147, many functions have been deprecated and renamed due to potential name conflicts. For example, evaluate()
is now evaluate_plan()
.
@AlexAxthelm, since you gave permission, I am closing this PR in favor of #233. I'm glad GitHub retains these things because your ideas are great reference material. |
This is a start on building the
do
functionality that I mentioned in #77.I expect this to fail on Travis and Appveyor, since I used
test_with_dir()
rather than the vanillatest_that()
. Also there is no documentation, but the last test shows an (impractical, but fast running) example case, for sequential work. The graph for the plan is shown here:I have no idea why the parallel pipelines are graphed with a curve to them, but notice that they all stem from the (imported blue)
mtcars
on the left, and reconnect atmtc_cleaned
near the right. I chose to run the recombined dataframe through anlm
, so that I woulnd't have to worry abouttest_that
complaining about differently ordered rows, even though they should be the same.