DSL based on dplyr-like verbs? #233

krlmlr · 2018-02-04T06:58:45Z

Native support for verbs like tibble(), crossing(), mutate(), group_by(), summarize()
- Goal: Get rid of plan_analyses() and plan_summaries()

Sketch for "basic" example

We might want to use our own verbs, this is the current tidyverse nomenclature to communicate semantics.

drake_plan(
  small = simulate(48),
  large = simulate(64),
  datasets = tibble(dataset = list(small, large)),
  regressions = tibble(reg = list(reg1, reg2)),
  analyses = crossing(datasets, regressions) %>%
    rowwise() %>% 
    mutate(result = list(reg(dataset))),
  summary_funs = tibble(fun = list(coefficients, residuals)),
  summaries = crossing(analyses, summary_funs) %>%
    rowwise() %>% 
    mutate(summary = fun(dataset, result)),
  winners = summaries %>%
    group_by(dataset, fun) %>%
    summarize(winner = min(summary))
)

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-02-05T15:27:07Z

@AlexAxthelm: Just to clarify: We use these verbs only to describe the workflow, the actual execution plan will be a full expansion. But if we know that e.g. all analyses have been built, we don't need to look at the full expansion at all. We still need to figure out how to handle all this internally.

AlexAxthelm · 2018-02-05T15:35:52Z

You may not have a clear answer for this, but as a design question, would each element (row) inside one of these tibble targets show separately on outdated, or the graph? Or would all of analyses need rebuilt if a single one fell outdated? Not trying to be difficult, just having trouble imagining how this will play out in execution.

krlmlr · 2018-02-05T15:49:10Z

These questions are great!

We have an internal graph, there each row in analyses corresponds to an internal target which has its separate up-to-dateness. We never materialize the full internal graph, but to build analyses we of course look at all internal dependencies. We don't need to update those rows in analyses that are up to date, and we don't materialize the entire analyses data frame either unless the user asks for it. For visualization, we can add a "partially outdated" state if only some rows in analyses are stale.

AlexAxthelm · 2018-02-05T15:57:29Z

That sounds awesome. I'm fully in support of this over #77 then. A feature that I would find useful as a user, would be to have an accessible estimate on how outdated a target is, e.g 7 of 9 elements need rebuilt. Other than that, I think all my concerns are met.

wlandau · 2018-02-05T18:06:21Z

I am having trouble understanding this discussion, possibly because I do not use dplyr very much. Previously, I thought we were using the DSL to expand out a whole workflow plan data frame and then feed it to make() as usual. How long exactly will the expansion be delayed?

krlmlr · 2018-02-05T18:26:30Z

That's a point I haven't stressed enough. We use the DSL to avoid plan expansion for as long as we possibly can.

wlandau · 2018-02-05T18:45:20Z

That would certainly gain us efficiency, and it would solve the catch-22 from #77 (group_by() without having the data yet). It's also a hard and delicate problem because we will need to make changes to our internal igraph object (config$graph) while make() is happening. All the current parallel backends require everything to be planned in advance, so I think we will need #227 first. Maybe we could work on the DSL interface and #227 at the same time. Then when #227 is solved, we could remove most of the other backends and delay expansion for as long as possible.

wlandau · 2018-02-05T18:46:13Z

Thankfully, none of this should affect the cache, which is the most sensitive component of all when it comes to back compatibility.

AlexAxthelm · 2018-02-05T19:35:37Z

@krlmlr, to confirm, the drake_plan sktech that you're outlined above, will it yield a plan that looks like what we currently see when we call load_basic_example(), or will it give something more along the lines of what we see if we call that now (current version of drake)? I'm worried that there is some weirdness going on with the mutate(result = list(reg(dataset))) which is intuitive, but I don't know how to look for reg aside from searching for column names in dependencies, but then that is assuming that everything is a tibble? I guess I'm back to being a bit unclear on if the new plans are aimed at creating targets that are tibbles, or creating plans that are tibbles?

This seems like there is potential for a lot of good power here, but I'm concerned about imposing a dplyr point-of-view on an otherwise agnostic tool.

wlandau · 2018-02-05T20:18:31Z

For the sake of slow deprecation, I think we will get the chance to see how well the two interfaces coexist. Ideally, I would like drake to still be able to easily handle a data frame of text target names and text commands. I believe this format is very reachable for people new to R, and parsing is usually easy.

There is nothing incorrect about wildcard templating, but I think we could eventually offload it to the wildcard package and extend wildcard to handle tibbles of language objects. See #240 and wlandau/wildcard#7.

krlmlr · 2018-02-05T23:51:50Z

I'd say everything's a tibble, but I'm not sure about details here.

I need to take a closer look at wildcard.

wlandau · 2018-02-06T00:50:37Z

There's not actually that much to look at, it's a simple idea I took out of remakeGenerator.

wlandau · 2018-02-06T00:54:49Z

I would be in favor of turning ordinary workflow data frames into tibbles as early on as possible. It really is about time.

Speaking of tibbles, there was a discussion somewhere about fixed column width printing, but I can't seem to find it. Workflow plan commands can get long.

krlmlr · 2018-02-06T01:14:36Z

Character columns take as much space as they can get in tibble, but embedded newlines are not a problem. Does that answer your question?

AlexAxthelm · 2018-02-06T01:34:21Z

I think I finally figured out how to express my concerns about all targets being tibbles: What happens when I want to build something which is normally a tibble?

drake_plan(
foo = mtcars %>% as.tibble() %>% filter(am == 1)
)

will this try to bring all of the parallelism to bear on each row of the tibble? Is there a way to separate a a tibble which is "just a tibble" from one that is "a drake_plan™ tibble"? What happens if I want to use NSE to build my plan as a tibble outside of drake_plan?

krlmlr · 2018-02-06T06:53:16Z

We might need to invent verbs like drake_mutate() and drake_rowwise() etc. to avoid this confusion. Or we enter "drake" mode with another verb.

violetcereza · 2018-02-14T17:02:39Z

I am all for this! Dynamic branching ("delayed expansion"?) is very exciting and I recently ran into a problem with my data work that required it.

I think an entirely different set of functions should be used, so that it's clear what is a tibble operation within a target, and which is an operation involving many drake targets. I suppose these functions would only be valid inside a target definition, like n() or everything().

plan_drake(
  small = simulate(48),
  large = simulate(64),
  
  analyses = expand_target(
    reg(dataset),
    # crossing() is implicit
    reg = list(reg1, reg2),
    dataset = list(small, large)
  ),
  summaries = expand_target(
    # I wasn't sure why you had fun(dataset, result) here
    # Not yet sure how to include that in this proposal
    fun(result),
    fun = list(coefficients, residuals),
    result = analyses
  ),
  # gather_targets evaluates to a tibble
  # with a column for every expansion term
  # used previously I guess? & value column
  winners = gather_targets(summaries) %>%
    group_by(dataset, fun) %>%
    summarize(winner = min(value))
)

Expanding on this idea of special functions within target definitions, I've been imagining a feature where you could specify file targets & file dependencies from target defs instead of messing around with the single-quotes vs double-quotes thing. Or even triggers. EDIT: krlmlr had the same idea in #232 oops

plan_drake(
  imported_data = read_csv(target_filedep("data.csv")),
  report.md = target_file(knit(x)),
  always_build = target_trigger("always", fun(x))
)

wlandau · 2019-01-13T05:16:00Z

Update: I started a GitHub project for this.

I have not worked on the DSL at all this past year, but I do care a lot about it.

From my end, the size and complexity of drake's code base have been a major roadblocks. Downsizing is one of the primary development goals (with its own GitHub project). Because of #648 and issues like it, I think we are already well past apogee.

In the initial stages of the DSL, because of the sheer scope of @krlmlr's idea, I would prefer to treat the API as totally separate from how targets are declared and built. It is relatively easy to play around with how drake_plan() parses user input, but delayed specification presents a very different and very formidable set of challenges on the implementation side.

As drake goes through its downsizing phase, maybe we could work on the API piece in an external package first. To start off, we could make it use the proposed dplyr syntax to create the fully-expanded plans that drake currently requires. Once it matures enough, we could either migrate it into core drake or leave it as its own third-party API extension.

wlandau · 2019-01-13T21:51:18Z

@krlmlr, @AlexAxthelm, @dapperjapper, and @rkrug:

For #233 (comment), I think we can use something that

Avoids creating extra rows for datasets, analyses, and summary_funs.
Does not require any new special API functions.
Requires minimal to no change to the drake_plan() function itself.
Captures all the information we need.
drake already understands.

library(drake)
drake_plan(
  small = target(simulate(48), data = small),
  large = target(simulate(64), data = large),
  reg = target(
    reg_fun(data),
    do = crossing,
    by = list(reg_fun, data),
    reg_fun = list(reg1, reg2)
  ),
  summary = target(
    sum_fun(data, reg),
    do = crossing,
    by = list(sum_fun, reg),
    sum_fun = list(coefficients, residuals)
  ),
  winners = target(
    min(summary),
    do = summarize,
    by = list(data, sum_fun)
  )
)
#> # A tibble: 5 x 7
#>   target  command      data  do      by         reg_fun    sum_fun         
#>   <chr>   <chr>        <chr> <chr>   <chr>      <chr>      <chr>           
#> 1 small   simulate(48) small <NA>    <NA>       <NA>       <NA>            
#> 2 large   simulate(64) large <NA>    <NA>       <NA>       <NA>            
#> 3 reg     reg_fun(dat… <NA>  crossi… list(reg_… list(reg1… <NA>            
#> 4 summary sum_fun(dat… <NA>  crossi… list(sum_… <NA>       list(coefficien…
#> 5 winners min(summary) <NA>  summar… list(data… <NA>       <NA>

^{Created on 2019-01-13 by the reprex package (v0.2.1)}

As long as I am thinking out loud:

drake_plan(dsl = TRUE) could do the expansion up front, and the result should be compatible with the current make().
make(dsl = TRUE) could activate lazy expansion / dynamic branching.

...though dsl might not be a good choice for an argument name.

wlandau · 2019-01-13T21:54:59Z

A big thanks to @krlmlr for the target() function, which opened up more doors than I realized at the time.

wlandau · 2019-01-14T18:59:33Z

I will keep iterating on this. We might find a sweet spot with the right combination of language and custom columns. An improvement:

library(drake)
drake_plan(
  small = simulate(48),
  large = simulate(64),
  reg = target(
    reg_fun(data),
    transform = cross(reg_fun = c(reg1, reg2), data = c(small, large))
  ),
  summary = target(
    sum_fun(data, reg),
    transform = cross(sum_fun = c(coefficients, residuals), reg)
  ),
  winners = target(
    min(summary),
    transform = summarize(data, sum_fun)
  )
)
#> # A tibble: 5 x 3
#>   target  command           transform                                      
#>   <chr>   <chr>             <chr>                                          
#> 1 small   simulate(48)      <NA>                                           
#> 2 large   simulate(64)      <NA>                                           
#> 3 reg     reg_fun(data)     cross(reg_fun = c(reg1, reg2), data = c(small,…
#> 4 summary sum_fun(data, re… cross(sum_fun = c(coefficients, residuals), re…
#> 5 winners min(summary)      summarize(data, sum_fun)

^{Created on 2019-01-14 by the reprex package (v0.2.1)}

wlandau · 2019-01-15T04:51:44Z

See #674 for an experimental API inspired by the proposed DSL. The implementation is lightweight, and because it relies on a custom "transform" column in the plan, it does not interfere with any other functionality (internals or API).

wlandau · 2019-01-16T13:25:01Z

New capability: define custom groupings with a group field in target():

library(drake)
plan <- drake_plan(
  small = simulate(48),
  large = simulate(64),
  reg1 = target(
    reg_fun(data),
    transform = cross(data = c(small, large)),
    group = reg
  ),
  reg2 = target(
    reg_fun(data),
    transform = cross(data = c(small, large)),
    group = reg
  ),
  winners = target(
    min(reg),
    transform = summarize(data),
    a = 1
  )
)
  
plan
#> # A tibble: 8 x 3
#>   target        command                                                   a
#>   <chr>         <chr>                                                 <dbl>
#> 1 small         simulate(48)                                             NA
#> 2 large         simulate(64)                                             NA
#> 3 reg1_small    reg_fun(small)                                           NA
#> 4 reg1_large    reg_fun(large)                                           NA
#> 5 reg2_large    reg1_large_fun(large)                                    NA
#> 6 reg2_small    reg1_small_fun(small)                                    NA
#> 7 winners_large min(reg1_large = reg1_large, reg2_large = reg2_large)     1
#> 8 winners_small min(reg1_small = reg1_small, reg2_small = reg2_small)     1

drake_plan_source(plan)
#> drake_plan(
#>   small = simulate(48),
#>   large = simulate(64),
#>   reg1_small = reg_fun(small),
#>   reg1_large = reg_fun(large),
#>   reg2_large = reg1_large_fun(large),
#>   reg2_small = reg1_small_fun(small),
#>   winners_large = target(
#>     command = min(reg1_large = reg1_large, reg2_large = reg2_large),
#>     a = 1
#>   ),
#>   winners_small = target(
#>     command = min(reg1_small = reg1_small, reg2_small = reg2_small),
#>     a = 1
#>   )
#> )

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2019-01-16 by the reprex package (v0.2.1)}

bpbond · 2019-01-17T00:34:53Z

@wlandau What version is this work targeted for?

wlandau · 2019-01-17T01:25:49Z

The very next release: 7.0.0

lorenzwalthert · 2019-01-17T13:51:08Z

My two cents to this (FYI): So do you for now have transform and grouping? Is there likely other functionality in the future? I am not sure if it does not overload the target argument. Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea). Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.

wlandau · 2019-01-18T06:06:45Z

My two cents to this (FYI): So do you for now have transform and grouping?

Yes. It is experimental (as the documentation now indicates) but behavior seems correct so far.

Is there likely other functionality in the future?

Dynamic branching is high on the list for long-term. But for this API specifically, hopefully we will not need more features. I would prefer to keep it simple, and it already seems to cover the vast majority of the use cases for the map/reduce functions and wildcards. But I could be convinced otherwise.

I am not sure if it does not overload the target argument.

The target() function was designed to allow any custom column to be added to the plan. Behind the scenes, transformation and grouping add and manipulate custom columns, so I belive this falls within scope.

Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea).

Yes, I agree. That does not bother me so much. At the interface level specifically, this still solves the same problem as the DSL.

Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.

I plan to keep the wildcard functions around for a long time.

My personal experience with this new interface is actually more positive. It takes effort and bookkeeping to wrangle all those subplans and wildcards. I find it much easier to use transformations and grouping in a single call to drake_plan(), and I strongly suspect that it will be a breath of fresh air to users.

lorenzwalthert · 2019-01-18T19:05:45Z

Thanks @wlandau for the detailed answer, that makes sense.

wlandau · 2019-01-18T20:01:13Z

You are welcome.

After talking with @krlmlr in person yesterday at RStudio conf, I have decided to think of this approach as the proper DSL. We can open a different issue for dynamic branching.

Major changes needed to consider this issue solved:

I need to read Hadley's chapter on DSLs and refactor the internals of managing and parsing inputs. For example, there should be S3 methods to parse the transform and command objects, and the return values should be detailed lists of the information content.
Use substitute() instead of text replacement for editing commands.

command <- quote(reg_fun(x, "y", x, k))
eval(call("substitute", command, list(x = "str", k = quote(sym))))
#> reg_fun("str", "y", "str", sym)

^{Created on 2019-01-18 by the reprex package (v0.2.1)}

wlandau · 2019-01-19T21:21:28Z

Also, we should add a map() transformation because of #235 (comment).

wlandau · 2019-01-20T05:08:41Z

By the way, tidy evaluation works in the DSL. You can generate super large plans this way. A taste:

sms <- rlang::syms(letters)
drake::drake_plan(x = target(f(char), transform = map(char = !!sms)))
#> # A tibble: 26 x 2
#>    target command
#>    <chr>  <chr>  
#>  1 x_a    f(a)   
#>  2 x_b    f(b)   
#>  3 x_c    f(c)   
#>  4 x_d    f(d)   
#>  5 x_e    f(e)   
#>  6 x_f    f(f)   
#>  7 x_g    f(g)   
#>  8 x_h    f(h)   
#>  9 x_i    f(i)   
#> 10 x_j    f(j)   
#> # … with 16 more rows

^{Created on 2019-01-20 by the reprex package (v0.2.1)}

lorenzwalthert · 2019-01-20T08:31:34Z

Cool. Should functions like map() be prefixed with drake_ in order to avoid namespace conflicts with purrr?

wlandau · 2019-01-20T09:14:17Z

Fortunately, we avoid namespace conflicts entirely because map() is in 'transform', not the command. The DSL code is analyzed statically and not executed in the usual sense.

lorenzwalthert · 2019-01-20T10:56:05Z

Ok, I did not know that. I think the average user also may not know it. If it is not an exported function, I guess and there is also no documentation exported, i.e. ?map won't be helpful? I think it's similar to vars() in dplyr. You also exclusively use them within a dplyr call. However, vars() is exported and documented.

wlandau · 2019-01-20T18:48:05Z

vars() is a new one for me. Looks like it sets a good precedent.

In the specific case of drake, I am resistant to exporting map(), cross(), and reduce(). These symbols are not actually defined in the code base, and drake's API is already enormous.

Hopefully we can make the documentation friendly and thorough. I just pushed update to the ?drake_plan() help file and added a new section in the manual.

wlandau added type: new feature difficulty: advanced status: priority topic: api labels Feb 5, 2018

wlandau mentioned this issue Feb 5, 2018

drake friendly alternative to dplyr::do #77

Closed

16 tasks

AlexAxthelm mentioned this issue Feb 5, 2018

Pass parameters to evaluate_plan through a grid, rather than a series of vectors #235

Closed

This was referenced Feb 6, 2018

Pre-work leading up to difficult parallelism issues #241

Merged

Initial work on splitting #79

Closed

This was referenced Feb 8, 2018

Command that writes a file always runs #245

Closed

Use tibbles and language objects internally instead of data frames and text #247

Closed

Make it easier to load workflows structured as packages. #208

Closed

krlmlr mentioned this issue Feb 18, 2018

Subprojects? #228

Closed

wlandau mentioned this issue Feb 19, 2018

Group together related commands in the graph visualization #229

Closed

wlandau removed the depends: another issue label Jan 14, 2019

wlandau mentioned this issue Jan 15, 2019

A much friendlier API for creating large plans #674

Merged

6 tasks

wlandau mentioned this issue Jan 20, 2019

Refactor DSL internals #686

Merged

6 tasks

wlandau pushed a commit that referenced this issue Jan 20, 2019

Fix #233

8a749f9

wlandau mentioned this issue Jan 20, 2019

Add map() to the DSL #687

Merged

6 tasks

wlandau closed this as completed in #687 Jan 20, 2019

This was referenced Jan 21, 2019

Replace group with tags #688

Merged

A base R replacement to formatR::tidy_source() #563

Merged

wlandau mentioned this issue Feb 1, 2019

Use language objects for commands #700

Closed

wlandau mentioned this issue May 6, 2019

Dynamic branching #685

Closed

DSL based on dplyr-like verbs? #233

DSL based on dplyr-like verbs? #233

Comments

krlmlr commented Feb 4, 2018

Sketch for "basic" example

krlmlr commented Feb 5, 2018

AlexAxthelm commented Feb 5, 2018

krlmlr commented Feb 5, 2018 • edited Loading

AlexAxthelm commented Feb 5, 2018

wlandau commented Feb 5, 2018

krlmlr commented Feb 5, 2018 • edited Loading

wlandau commented Feb 5, 2018

wlandau commented Feb 5, 2018 • edited Loading

AlexAxthelm commented Feb 5, 2018

wlandau commented Feb 5, 2018 • edited Loading

krlmlr commented Feb 5, 2018

wlandau commented Feb 6, 2018

wlandau commented Feb 6, 2018 • edited Loading

krlmlr commented Feb 6, 2018

AlexAxthelm commented Feb 6, 2018

krlmlr commented Feb 6, 2018

violetcereza commented Feb 14, 2018 • edited Loading

wlandau commented Jan 13, 2019 • edited Loading

wlandau commented Jan 13, 2019 • edited Loading

wlandau commented Jan 13, 2019

wlandau commented Jan 14, 2019 • edited Loading

wlandau commented Jan 15, 2019

wlandau commented Jan 16, 2019 • edited Loading

bpbond commented Jan 17, 2019

wlandau commented Jan 17, 2019

lorenzwalthert commented Jan 17, 2019

wlandau commented Jan 18, 2019 • edited Loading

lorenzwalthert commented Jan 18, 2019

wlandau commented Jan 18, 2019

wlandau commented Jan 19, 2019

wlandau commented Jan 20, 2019

lorenzwalthert commented Jan 20, 2019

wlandau commented Jan 20, 2019 • edited Loading

lorenzwalthert commented Jan 20, 2019

wlandau commented Jan 20, 2019 • edited Loading

krlmlr commented Feb 5, 2018 •

edited

Loading

krlmlr commented Feb 5, 2018 •

edited

Loading

wlandau commented Feb 5, 2018 •

edited

Loading

wlandau commented Feb 5, 2018 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading

violetcereza commented Feb 14, 2018 •

edited

Loading

wlandau commented Jan 13, 2019 •

edited

Loading

wlandau commented Jan 13, 2019 •

edited

Loading

wlandau commented Jan 14, 2019 •

edited

Loading

wlandau commented Jan 16, 2019 •

edited

Loading

wlandau commented Jan 18, 2019 •

edited

Loading

wlandau commented Jan 20, 2019 •

edited

Loading

wlandau commented Jan 20, 2019 •

edited

Loading