Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial work on splitting #79

Closed
wants to merge 12 commits into from

Conversation

AlexAxthelm
Copy link
Collaborator

This is a start on building the do functionality that I mentioned in #77.

I expect this to fail on Travis and Appveyor, since I used test_with_dir() rather than the vanilla test_that(). Also there is no documentation, but the last test shows an (impractical, but fast running) example case, for sequential work. The graph for the plan is shown here:
image

I have no idea why the parallel pipelines are graphed with a curve to them, but notice that they all stem from the (imported blue) mtcars on the left, and reconnect at mtc_cleaned near the right. I chose to run the recombined dataframe through an lm, so that I woulnd't have to worry about test_that complaining about differently ordered rows, even though they should be the same.

this is the beginning of a parallel `do` for drake, allowing
simultaneous parallel processing pipelines.
@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Aug 30, 2017

I really like what you have done so far. Tidyverse or not, it is super easy to split up a data frame. Some preliminary thoughts:

Big datasets as targets

What if the big dataset is a target later in the workflow? There is a little hiccup in the interface if it is not yet in the environment.

drake_split(x, data_rows = 4, splits = 2)
## Error in inherits(x, "grouped_df") : object 'x' not found
x <- 1
drake_split(x, data_rows = 4, splits = 2)
##      target              command
## 1 slice_x_1 dplyr::slice(x, 1:2)
## 2 slice_x_2 dplyr::slice(x, 3:4)

And it may be expedient to specify the dataset with a quoted target name.

drake_split("targ", data_rows = 4, splits = 2)
##           target                   command
## 1 slice_"targ"_1 dplyr::slice("targ", 1:2)
## 2 slice_"targ"_2 dplyr::slice("targ", 3:4)

You might look at readd() for a possible adjustment.

readd <- function(target, character_only = FALSE, ...){
   ...
   if(!character_only) target = as.character(substitute(target))
   ...
}

In this case, it may also be unwise to hard-code the row indices passed to dplyr::slice(), since the actual number of rows in the target may be unpredictable.

Suggested number of splits

The default splits = ceiling(nrow(.data) / 1e6) seems like a great choice based on my experience with R's performance, but I wonder if we are missing a more sophisticated metric or community standard. And to my knowledge, nobody has rigorously explored the performance tradeoff between target size and the number of targets for drake specifically. This point will not stop me from accepting the PR, just thinking out loud.

@AlexAxthelm
Copy link
Collaborator Author

I agree, This is mostly something I have been using semi-interactively, so I've been using it on targets that already exist. The pipe operator fails outright if the object doesn't exist (kwyjibo %>% head()).

I agree that character_only is the way to go. for passing object that don't exist yet.

I probably will end up making an additional target, like the grouplist target that I've used for the grouped frames, and I'll probably end up using base selection, rather than dplyr then: frame[rowlist[[X]], ]

Alex Axthelm added 6 commits August 30, 2017 16:55
This will remove the need to know the number of rows ahead of time.
Further, It will ensure that if the splitted frame is not in the
environment, that drake won't freak out about it. Also, TODO: include a
`character_only` argument to pass a character name, rather than an
actual object name.
PReviously, data would need to exist for split to work.
@wlandau-lilly
Copy link
Collaborator

I forgot to address your point about the vertical positions of nodes. I struggled with this for 4.0.0, but could not get visNetwork to respond to all of my instructions in the x, y, and level columns of the nodes data frame. I hope visNetwork solves this, I vaguely remember posting an issue. The layout argument of drake::plot_graph() and drake::render_graph() offers a tiny bit of flexibility in the meantime, but it is not a complete solution.

Now that I'm using indicies rather than counts, I want to determine
which list is the shortest and push to that, rather than pushing to the
one that has the smallest sum (which was appropriate when I was using
counts of indices)
@AlexAxthelm AlexAxthelm mentioned this pull request Aug 31, 2017
@AlexAxthelm
Copy link
Collaborator Author

Did some work on this branch today.

Major Updates:

  • Changed the structure of drake_split() plans, so they have the same structure, regardless of if the .data is grouped or not.
    • The actual subsetting/slicing is taken care of when make is run, rather than when the plan is made, meaning that the plan can be made before the data exists. (Must still manually define the number of splits in that case, so an estimated number of rows is still helpful to have.)
  • Created the terribly named analyses_split, which is a variant of analyses() that makes writing the parallel plans a bit easier.
    • seriously, the function name should be changed.
    • magic_wildcards argument name should probably also be changed. But it is a binary flag which takes the methods plan, and auto-magically treats the commands as ordered, and will build the full plan with the splits inserted, and dependencies upon the previous line in methods
      • weird to explain, but the example-test in the updated test-split.R has the flag enabled, allowing methods to just have ..datasets.., none of the weird step_one_..datasets.. that I needed before
  • moved tests back to test_that so that travis can actually try them.
  • Using base selection, rather than dplyr::filter
  • Changing phrasing from "slice" to "split" (Partially done)

@AlexAxthelm
Copy link
Collaborator Author

Upon seeing the CI results, I'm noticing that I inadvertently put a dependency on dplyr in here. I hadn't actually noticed that it wasn't in there before. It's kind of a hard requirement for checking on dplyr groups, which seem like a natural breakpoint for data frames.

My suggestion would be either:

  1. add dplyr to suggests, and not check for groups if it's not installed, since the user couldn't have them anyway.
    2: remove magrittr from Imports, replacing it with dplyr, and import the pipe operator from there.

I don't have much of a preference, especially since dplyr is a super common package to have installed already, and if someone has dplyr, then they also have magrittr.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Sep 1, 2017

I am not sure about (1) versus (2) just yet. (If (2), we should make sure to update the unfortunate load_packages_parLapply(). Otherwise, parLapply functionality will fail on Windows R-devel.)

The reason I am not sure is that you opened such a huge door with the idea of splitting. We could split evenly across the rows, use dplyr::group_by(), leverage columns as split variables some other way, split things that are not data frames, etc. In fact, it would be really cool to formalize this API to allow different splitting methods to be added in modular fashion. Here, you might have separate functions for uniform row splitting vs group_by() row splitting, and they could share common standards in naming, style, behavior, etc.

I also wonder: if group_by() could dictate a split during make(), we might think about how to manage the number of splits differently. Here are some choices.

  1. Distribute the data over a fixed pre-specified number of splits. Here data chunks would not always correspond 1:1 with the targets. Some targets might get multiple data chunks, some would get nothing.
  2. Amend the workflow plan during a call to make(). We can only really know the best number of splits after we actually have a big dataset that needs splitting. What if we wait until then, modify the graph, and then proceed with the rest of the new workflow? This behavior could also help data analyses where a major strategic decision / pivot is made programatically somewhere in the middle (choose your own adventure).

(2) sounds extremely difficult, and it may not even be possible. It may even require a successor to drake just as drake arose from some shortcomings of remake.

@AlexAxthelm
Copy link
Collaborator Author

Splitting seems to me to be a pretty key functionality, even if it's not baked right into drake. I don't know if this is a feature that should be a separate extension package (I doubt that is the way to go though).

The latest push has support for splitting evely either across rows, or, if groups exist, it will split on those instead, and attempt to fill the splits as evenly as possible (one of the reasons that I picked the starwars dataset is because for hair_color there are a few groups that are very large, and many groups with 1-3 entries).

Extending drake_split()

I agree that we could should have an extensible function that checks the type of the object being split (I haven't even thought about lists). I'll make some changes on my end to accommodate this. Currently, I'm treating the grouped and ungrouped dataframes, tibbles, etc. as mostly being the same though, and the split_list function outputs a list of row indices that I select upon. I suppose the same method could be used for splitting a list.

In extending the split_<thing>, it seems prudent to start putting a ... into the drake_split arguments, so we can include some non-standard ways of defining split criteria.

Auto-splitting

As for splitting being something that happens natively in make(), I am opposed to this. drake is friendly, but definitely a package for power-users. I think that splitting automatically would remove both transparency and plan customization from the end-user.

For starters, we would have to pick a default number of splits (number of jobs? number of jobs that aren't being used on this step? nrow()/1e6? something else?) Whatever default we pick for make() would end up being a bad choice for a lot of people, either because it leave each split too large for a user's machine (my issue), or because it would over-parallelize, and introduce too much overhead.

Additionally, I think that splitting needs to be explicitly requested by the user. This is especially true in the case of grouped_dfs. Ungrouped stuff won't change order, but because I'm trying to split the work evenly, I end up moving some groups around, so that when all of the splits are reunited, they are almost certainly not in the same order. starwars by hair_color, will yield groups 10, 7, 4 first (since they are the largest), with everything else coming after them. This not not an unexpected behavior for anybody working with groups (I believe that dplyr::do also takes liberties with group ordering), but I certainly don't want it to be an automatic thing. I know that I work with lead() and lag() frequently enough that the idea of anything changing my row ordering without me know makes me a bit twitchy.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Sep 1, 2017

Good points about auto-splitting. You have convinced me that we should choose the number of splits (and whether to split at all) in advance, particularly to maintain predictability and control for end users. What to put in those splits can be decided later. And this makes me a little less worried about drake's ability to keep up with the times.

Also, I just saw drake_unsplit(), and I forgot to mention that gather() does the same thing. In fact, summaries() optionally calls gather(), with an option to turn it off.

@wlandau-lilly
Copy link
Collaborator

Also, I had assumed we would build these splitting methods into drake itself. The way our discussion is going makes me lean toward replacing magrittr with dplyr in Imports:, etc.

@AlexAxthelm
Copy link
Collaborator Author

Thanks for pointing out gather I had forgotten that it exists.

drake_unsplit() does in fact have the same function as gather(gather = "rbind")

)
}

drake_unsplit <- function(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I previously mentioned, I think we could rely on gather_plan() here. Maybe that means removing drake_unsplit() entirely, I do not know.

expand = FALSE
)
}
evaluate(
Copy link
Collaborator

@wlandau-lilly wlandau-lilly Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In preparation for #147, many functions have been deprecated and renamed due to potential name conflicts. For example, evaluate() is now evaluate_plan().

@wlandau
Copy link
Member

wlandau commented Feb 6, 2018

@AlexAxthelm, since you gave permission, I am closing this PR in favor of #233. I'm glad GitHub retains these things because your ideas are great reference material.

@wlandau wlandau closed this Feb 6, 2018
@krlmlr krlmlr mentioned this pull request Mar 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants