Initial work on splitting #79

AlexAxthelm · 2017-08-29T19:10:58Z

This is a start on building the do functionality that I mentioned in #77.

I expect this to fail on Travis and Appveyor, since I used test_with_dir() rather than the vanilla test_that(). Also there is no documentation, but the last test shows an (impractical, but fast running) example case, for sequential work. The graph for the plan is shown here:

I have no idea why the parallel pipelines are graphed with a curve to them, but notice that they all stem from the (imported blue) mtcars on the left, and reconnect at mtc_cleaned near the right. I chose to run the recombined dataframe through an lm, so that I woulnd't have to worry about test_that complaining about differently ordered rows, even though they should be the same.

this is the beginning of a parallel `do` for drake, allowing simultaneous parallel processing pipelines.

wlandau-lilly · 2017-08-30T14:23:51Z

I really like what you have done so far. Tidyverse or not, it is super easy to split up a data frame. Some preliminary thoughts:

Big datasets as targets

What if the big dataset is a target later in the workflow? There is a little hiccup in the interface if it is not yet in the environment.

drake_split(x, data_rows = 4, splits = 2)
## Error in inherits(x, "grouped_df") : object 'x' not found
x <- 1
drake_split(x, data_rows = 4, splits = 2)
##      target              command
## 1 slice_x_1 dplyr::slice(x, 1:2)
## 2 slice_x_2 dplyr::slice(x, 3:4)

And it may be expedient to specify the dataset with a quoted target name.

drake_split("targ", data_rows = 4, splits = 2)
##           target                   command
## 1 slice_"targ"_1 dplyr::slice("targ", 1:2)
## 2 slice_"targ"_2 dplyr::slice("targ", 3:4)

You might look at readd() for a possible adjustment.

readd <- function(target, character_only = FALSE, ...){
   ...
   if(!character_only) target = as.character(substitute(target))
   ...
}

In this case, it may also be unwise to hard-code the row indices passed to dplyr::slice(), since the actual number of rows in the target may be unpredictable.

Suggested number of splits

The default splits = ceiling(nrow(.data) / 1e6) seems like a great choice based on my experience with R's performance, but I wonder if we are missing a more sophisticated metric or community standard. And to my knowledge, nobody has rigorously explored the performance tradeoff between target size and the number of targets for drake specifically. This point will not stop me from accepting the PR, just thinking out loud.

AlexAxthelm · 2017-08-30T14:33:52Z

I agree, This is mostly something I have been using semi-interactively, so I've been using it on targets that already exist. The pipe operator fails outright if the object doesn't exist (kwyjibo %>% head()).

I agree that character_only is the way to go. for passing object that don't exist yet.

I probably will end up making an additional target, like the grouplist target that I've used for the grouped frames, and I'll probably end up using base selection, rather than dplyr then: frame[rowlist[[X]], ]

This will remove the need to know the number of rows ahead of time. Further, It will ensure that if the splitted frame is not in the environment, that drake won't freak out about it. Also, TODO: include a `character_only` argument to pass a character name, rather than an actual object name.

PReviously, data would need to exist for split to work.

wlandau-lilly · 2017-08-31T18:23:08Z

I forgot to address your point about the vertical positions of nodes. I struggled with this for 4.0.0, but could not get visNetwork to respond to all of my instructions in the x, y, and level columns of the nodes data frame. I hope visNetwork solves this, I vaguely remember posting an issue. The layout argument of drake::plot_graph() and drake::render_graph() offers a tiny bit of flexibility in the meantime, but it is not a complete solution.

Now that I'm using indicies rather than counts, I want to determine which list is the shortest and push to that, rather than pushing to the one that has the smallest sum (which was appropriate when I was using counts of indices)

In the default argument

AlexAxthelm · 2017-08-31T20:06:19Z

Did some work on this branch today.

Major Updates:

Changed the structure of drake_split() plans, so they have the same structure, regardless of if the .data is grouped or not.
- The actual subsetting/slicing is taken care of when make is run, rather than when the plan is made, meaning that the plan can be made before the data exists. (Must still manually define the number of splits in that case, so an estimated number of rows is still helpful to have.)
Created the terribly named analyses_split, which is a variant of analyses() that makes writing the parallel plans a bit easier.
- seriously, the function name should be changed.
- magic_wildcards argument name should probably also be changed. But it is a binary flag which takes the methods plan, and auto-magically treats the commands as ordered, and will build the full plan with the splits inserted, and dependencies upon the previous line in methods
  - weird to explain, but the example-test in the updated test-split.R has the flag enabled, allowing methods to just have ..datasets.., none of the weird step_one_..datasets.. that I needed before
moved tests back to test_that so that travis can actually try them.
Using base selection, rather than dplyr::filter
Changing phrasing from "slice" to "split" (Partially done)

AlexAxthelm · 2017-08-31T20:20:16Z

Upon seeing the CI results, I'm noticing that I inadvertently put a dependency on dplyr in here. I hadn't actually noticed that it wasn't in there before. It's kind of a hard requirement for checking on dplyr groups, which seem like a natural breakpoint for data frames.

My suggestion would be either:

add dplyr to suggests, and not check for groups if it's not installed, since the user couldn't have them anyway.
2: remove magrittr from Imports, replacing it with dplyr, and import the pipe operator from there.

I don't have much of a preference, especially since dplyr is a super common package to have installed already, and if someone has dplyr, then they also have magrittr.

wlandau-lilly · 2017-09-01T15:08:06Z

I am not sure about (1) versus (2) just yet. (If (2), we should make sure to update the unfortunate load_packages_parLapply(). Otherwise, parLapply functionality will fail on Windows R-devel.)

The reason I am not sure is that you opened such a huge door with the idea of splitting. We could split evenly across the rows, use dplyr::group_by(), leverage columns as split variables some other way, split things that are not data frames, etc. In fact, it would be really cool to formalize this API to allow different splitting methods to be added in modular fashion. Here, you might have separate functions for uniform row splitting vs group_by() row splitting, and they could share common standards in naming, style, behavior, etc.

I also wonder: if group_by() could dictate a split during make(), we might think about how to manage the number of splits differently. Here are some choices.

Distribute the data over a fixed pre-specified number of splits. Here data chunks would not always correspond 1:1 with the targets. Some targets might get multiple data chunks, some would get nothing.
Amend the workflow plan during a call to make(). We can only really know the best number of splits after we actually have a big dataset that needs splitting. What if we wait until then, modify the graph, and then proceed with the rest of the new workflow? This behavior could also help data analyses where a major strategic decision / pivot is made programatically somewhere in the middle (choose your own adventure).

(2) sounds extremely difficult, and it may not even be possible. It may even require a successor to drake just as drake arose from some shortcomings of remake.

AlexAxthelm · 2017-09-01T15:34:41Z

Splitting seems to me to be a pretty key functionality, even if it's not baked right into drake. I don't know if this is a feature that should be a separate extension package (I doubt that is the way to go though).

The latest push has support for splitting evely either across rows, or, if groups exist, it will split on those instead, and attempt to fill the splits as evenly as possible (one of the reasons that I picked the starwars dataset is because for hair_color there are a few groups that are very large, and many groups with 1-3 entries).

Extending `drake_split()`

I agree that we ~~could~~ should have an extensible function that checks the type of the object being split (I haven't even thought about lists). I'll make some changes on my end to accommodate this. Currently, I'm treating the grouped and ungrouped dataframes, tibbles, etc. as mostly being the same though, and the split_list function outputs a list of row indices that I select upon. I suppose the same method could be used for splitting a list.

In extending the split_<thing>, it seems prudent to start putting a ... into the drake_split arguments, so we can include some non-standard ways of defining split criteria.

Auto-splitting

As for splitting being something that happens natively in make(), I am opposed to this. drake is friendly, but definitely a package for power-users. I think that splitting automatically would remove both transparency and plan customization from the end-user.

For starters, we would have to pick a default number of splits (number of jobs? number of jobs that aren't being used on this step? nrow()/1e6? something else?) Whatever default we pick for make() would end up being a bad choice for a lot of people, either because it leave each split too large for a user's machine (my issue), or because it would over-parallelize, and introduce too much overhead.

Additionally, I think that splitting needs to be explicitly requested by the user. This is especially true in the case of grouped_dfs. Ungrouped stuff won't change order, but because I'm trying to split the work evenly, I end up moving some groups around, so that when all of the splits are reunited, they are almost certainly not in the same order. starwars by hair_color, will yield groups 10, 7, 4 first (since they are the largest), with everything else coming after them. This not not an unexpected behavior for anybody working with groups (I believe that dplyr::do also takes liberties with group ordering), but I certainly don't want it to be an automatic thing. I know that I work with lead() and lag() frequently enough that the idea of anything changing my row ordering without me know makes me a bit twitchy.

wlandau-lilly · 2017-09-01T16:15:19Z

Good points about auto-splitting. You have convinced me that we should choose the number of splits (and whether to split at all) in advance, particularly to maintain predictability and control for end users. What to put in those splits can be decided later. And this makes me a little less worried about drake's ability to keep up with the times.

Also, I just saw drake_unsplit(), and I forgot to mention that gather() does the same thing. In fact, summaries() optionally calls gather(), with an option to turn it off.

wlandau-lilly · 2017-09-01T16:18:57Z

Also, I had assumed we would build these splitting methods into drake itself. The way our discussion is going makes me lean toward replacing magrittr with dplyr in Imports:, etc.

AlexAxthelm · 2017-09-01T18:35:16Z

Thanks for pointing out gather I had forgotten that it exists.

drake_unsplit() does in fact have the same function as gather(gather = "rbind")

wlandau-lilly · 2017-11-27T03:22:31Z

R/split.R

+    )
+}
+
+drake_unsplit <- function(


As I previously mentioned, I think we could rely on gather_plan() here. Maybe that means removing drake_unsplit() entirely, I do not know.

wlandau-lilly · 2017-11-27T03:23:46Z

R/split.R

+      expand = FALSE
+      )
+  }
+  evaluate(


In preparation for #147, many functions have been deprecated and renamed due to potential name conflicts. For example, evaluate() is now evaluate_plan().

wlandau · 2018-02-06T05:01:04Z

@AlexAxthelm, since you gave permission, I am closing this PR in favor of #233. I'm glad GitHub retains these things because your ideas are great reference material.

Initial work on splitting

b6768f1

this is the beginning of a parallel `do` for drake, allowing simultaneous parallel processing pipelines.

Alex Axthelm added 6 commits August 30, 2017 16:55

linting

54eddef

Change splitting logic to remove need for data

e6817c6

PReviously, data would need to exist for split to work.

Remove unnecessary print statement

d083c93

Start writing more thorough test suite

9fa2625

Add uni tests for split_list for grouped frames

d0b9730

wlandau-lilly mentioned this pull request Aug 31, 2017

Other interfaces and templating options #26

Closed

Change push-to logic from sum to length

a894190

Now that I'm using indicies rather than counts, I want to determine which list is the shortest and push to that, rather than pushing to the one that has the smallest sum (which was appropriate when I was using counts of indices)

AlexAxthelm mentioned this pull request Aug 31, 2017

Linting, Part 2 #81

Merged

Alex Axthelm added 4 commits August 31, 2017 15:06

Change unsplit to recognize 'split' not 'slice'

b0f8449

In the default argument

Update example test to pass, by subseting datasets

20cb0ae

Create a function to make writing the analyses plan easier

b49b475

Change example test to demonstrate magic wildcards

c8399a4

wlandau-lilly added the TOP PRIORITY label Sep 4, 2017

AlexAxthelm mentioned this pull request Sep 7, 2017

External packages as dependencies #6

Closed

wlandau-lilly added status: priority and removed TOP PRIORITY labels Sep 24, 2017

wlandau-lilly assigned AlexAxthelm Oct 14, 2017

wlandau-lilly removed the status: priority label Oct 28, 2017

wlandau-lilly reviewed Nov 27, 2017

View reviewed changes

wlandau-lilly mentioned this pull request Nov 27, 2017

drake friendly alternative to dplyr::do #77

Closed

16 tasks

wlandau-lilly mentioned this pull request Jan 5, 2018

Have evaluate_plan handle lists of vectors of indices #187

Closed

wlandau mentioned this pull request Jan 30, 2018

Incorporate rlang tidy evaluation in drake #200

Closed

wlandau added the type: new feature label Jan 30, 2018

wlandau closed this Feb 6, 2018

krlmlr mentioned this pull request Mar 6, 2018

Meta-make #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial work on splitting #79

Initial work on splitting #79

AlexAxthelm commented Aug 29, 2017

wlandau-lilly commented Aug 30, 2017 •

edited

Loading

AlexAxthelm commented Aug 30, 2017

wlandau-lilly commented Aug 31, 2017

AlexAxthelm commented Aug 31, 2017

AlexAxthelm commented Aug 31, 2017

wlandau-lilly commented Sep 1, 2017 •

edited

Loading

AlexAxthelm commented Sep 1, 2017

wlandau-lilly commented Sep 1, 2017 •

edited

Loading

wlandau-lilly commented Sep 1, 2017

AlexAxthelm commented Sep 1, 2017

wlandau-lilly Nov 27, 2017

wlandau-lilly Nov 27, 2017 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading

Initial work on splitting #79

Initial work on splitting #79

Conversation

AlexAxthelm commented Aug 29, 2017

wlandau-lilly commented Aug 30, 2017 • edited Loading

Big datasets as targets

Suggested number of splits

AlexAxthelm commented Aug 30, 2017

wlandau-lilly commented Aug 31, 2017

AlexAxthelm commented Aug 31, 2017

AlexAxthelm commented Aug 31, 2017

wlandau-lilly commented Sep 1, 2017 • edited Loading

AlexAxthelm commented Sep 1, 2017

Extending drake_split()

Auto-splitting

wlandau-lilly commented Sep 1, 2017 • edited Loading

wlandau-lilly commented Sep 1, 2017

AlexAxthelm commented Sep 1, 2017

wlandau-lilly Nov 27, 2017

Choose a reason for hiding this comment

wlandau-lilly Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

wlandau commented Feb 6, 2018 • edited Loading

wlandau-lilly commented Aug 30, 2017 •

edited

Loading

wlandau-lilly commented Sep 1, 2017 •

edited

Loading

Extending `drake_split()`

wlandau-lilly commented Sep 1, 2017 •

edited

Loading

wlandau-lilly Nov 27, 2017 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading