-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass parameters to evaluate_plan
through a grid, rather than a series of vectors
#235
Comments
I see the general picture of what you're saying, and I'm trying to wrap my head around how we would solve it. It sounds like you want one wildcard for the expansion and the others to go along for the ride. How close is this to what you're after: library(magrittr)
drake_plan(
credits = check_credit_hours("school_", "funding_"),
students = check_students("school_", "funding_"),
grads = check_graduations("school_", "funding_"),
public_funds = check_public_funding("school_", "funding_"),
strings_in_dots = "literals"
) %>% evaluate_plan(
wildcard = "school_",
values = c("schoolA", "schoolB", "schoolC"),
expand = TRUE
) %>%
evaluate_plan(
wildcard = "funding_",
values = c("public", "public", "private"),
expand = FALSE
)
#> target command
#> 1 credits_schoolA check_credit_hours("schoolA", "public")
#> 2 credits_schoolB check_credit_hours("schoolB", "public")
#> 3 credits_schoolC check_credit_hours("schoolC", "private")
#> 4 students_schoolA check_students("schoolA", "public")
#> 5 students_schoolB check_students("schoolB", "public")
#> 6 students_schoolC check_students("schoolC", "private")
#> 7 grads_schoolA check_graduations("schoolA", "public")
#> 8 grads_schoolB check_graduations("schoolB", "public")
#> 9 grads_schoolC check_graduations("schoolC", "private")
#> 10 public_funds_schoolA check_public_funding("schoolA", "public")
#> 11 public_funds_schoolB check_public_funding("schoolB", "public")
#> 12 public_funds_schoolC check_public_funding("schoolC", "private") |
This is perfect. This works well for a simple, 1 to 1 matchup between targets, like above, and more complicated many to 1 matchps can be resolved using just the same pair of rules_grid <- tibble(
school_ = c("schoolA", "schoolB", "schoolC"),
funding_ = c("public", "public", "private"),
) %>%
crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013"))) %>%
print()
drake_plan(
credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"),
grads = check_graduations("school_", "funding_", "cohort_"),
public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals"
) %>% evaluate_plan(
wildcard = "school_",
values = rules_grid$school_,
expand = TRUE
) %>%
evaluate_plan(
wildcard = "funding_",
rules = rules_grid,
expand = FALSE
) In the example above, I have Thanks! 👍 |
Do we have a "usage patterns" vignette or section where we could document this? |
I think the best practices vignette is the right place. Reopening because it's now a documentation issue. |
Thanks again @AlexAxthelm! Your example is great, and I have appended a section in the best practices vignette. |
Unfortunately, the plan generated here and documented in best practices is not a valid Drake plan as it contains duplicate target names. I took a stab at a version with unique names (appended year), but I'm not happy with the solution: # Possible solution: #235 Modifed to generate unique targets as required.
rules_grid <- tibble(
# The schools and their funding types.
# Note that this solution does not handle the case of a school switching type!
school_ = c("schoolA", "schoolB", "schoolC"),
funding_ = c("public", "public", "private"),
) %>%
# Generate the full cross product of (school,funding)x(years)
crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
# Remove the two years school B didn't exist.
filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013"))) %>%
# Confirm the correct plan template.
print()
plan <- drake_plan(
# Start with the four types of checks to perform
credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"),
grads = check_graduations("school_", "funding_", "cohort_"),
public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals"
) %>% expand_plan(
# Use a forced expansion with a target suffix defined by school_year.
# I don't really like this solution but I couldn't think of a better one :-(
# Note that this duplicates each target 10 times for a total of 40.
# However, no parameter substitution is done, that is fixed in the next step.
values = paste(rules_grid$school_, rules_grid$cohort_, sep = "_")
) %>% evaluate_plan(
# Finally, substitute the correct parameter values into the commands.
# Note that since each target is duplicated 10 times, they each get a full
# complement of parameter values which are used repeatedly a total of 4 times.
rules = rules_grid,
expand = FALSE
)
print(plan, n = 40)
# Confirm depenencies and parameter mappings.
config <- drake_config(plan)
vis_drake_graph(config) |
Here is an updated solution that deals with avoiding applying public only functions on private schools and allows for schools to switch from public to private at any time. # Possible solution: #235 Modifed to generate unique targets as required.
# Version two: deal with avoiding public checks on private schools.
# Note that this solution can now handle the case of a school switching type.
rules_grid <- tibble(
# The schools and their funding types.
school_ = c("schoolA", "schoolB", "schoolC"),
funding_ = c("public", "public", "private"),
) %>%
# Generate the full cross product of (school,funding)x(years)
crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
# Remove the two years school B didn't exist.
filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013")))
# Make schoolC switch funding each year
rules_grid$funding_[rules_grid$school_ == "schoolC"] <-
c("public", "private", "public", "private")
# Confirm the correct plan template.
print(rules_grid)
plan_both <- drake_plan(
# Start with the three universal types of checks to perform (public or private)
credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"),
grads = check_graduations("school_", "funding_", "cohort_"),
# Leave this for later.
#public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals"
) %>% expand_plan(
# Use a forced expansion with a target suffix defined by school_year.
# I don't really like this solution but I couldn't think of a better one :-(
# Note that this duplicates each target 10 times for a total of 30.
# However, no parameter substitution is done, that is fixed in the next step.
values = paste(rules_grid$school_, rules_grid$cohort_, sep = "_")
) %>% evaluate_plan(
# Finally, substitute the correct parameter values into the commands.
# Note that since each target is duplicated 10 times, they each get a full
# complement of parameter values which are used repeatedly a total of 3 times.
rules = rules_grid,
expand = FALSE
)
print(plan_both, n = 30)
# Next get the rules for just the public schools. Note that a school could change
# from public to private or vis-versa in any year and this still works.
public_rules_grid <- rules_grid %>% filter(funding_ == "public")
print(public_rules_grid)
# Build the public only plans
plan_public <- drake_plan(
# Include the public only checks that shouldn't be run on private schools.
public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals"
) %>% expand_plan(
# Use a forced expansion with a target suffix defined by school_year.
# I don't really like this solution but I couldn't think of a better one :-(
# Note that this duplicates each target 10 times for a total of 10.
# However, no parameter substitution is done, that is fixed in the next step.
values = paste(public_rules_grid$school_, public_rules_grid$cohort_, sep = "_")
) %>% evaluate_plan(
# Finally, substitute the correct parameter values into the commands.
# Note that since each target is duplicated 10 times, they each get a full
# complement of parameter values which are used repeatedly a total of 1 times.
rules = public_rules_grid,
expand = FALSE
)
print(plan_public, n = 8)
# Combine the both and public only plans together
plan <- bind_plans(plan_both, plan_public)
# Note that no check_public_funding is ever performed on schoolC in odd years.
# Confirm depenencies and parameter mappings.
config <- drake_config(plan)
vis_drake_graph(config) |
@jw5, glad you're helping us with slick ways to generate plans.
Are you talking about the plan at the end of this section? Because there, I think we're fine. Here's a reprex. library(drake)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.5
#> ✔ tidyr 0.8.1 ✔ stringr 1.3.1
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ tidyr::expand() masks drake::expand()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tidyr::gather() masks drake::gather()
#> ✖ dplyr::lag() masks stats::lag()
# Generate the plan from the end of
# https://ropensci.github.io/drake/articles/best-practices.html#generating-workflow-plan-data-frames
rules_grid <- tibble::tibble(school_ = c("schoolA", "schoolB", "schoolC"), funding_ = c("public",
"public", "private"), ) %>% tidyr::crossing(cohort_ = c("2012", "2013",
"2014", "2015")) %>% dplyr::filter(!(school_ == "schoolB" & cohort_ %in%
c("2012", "2013"))) %>% print()
#> # A tibble: 10 x 3
#> school_ funding_ cohort_
#> <chr> <chr> <chr>
#> 1 schoolA public 2012
#> 2 schoolA public 2013
#> 3 schoolA public 2014
#> 4 schoolA public 2015
#> 5 schoolB public 2014
#> 6 schoolB public 2015
#> 7 schoolC private 2012
#> 8 schoolC private 2013
#> 9 schoolC private 2014
#> 10 schoolC private 2015
plan <- drake_plan(credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"), grads = check_graduations("school_",
"funding_", "cohort_"), public_funds = check_public_funding("school_",
"funding_", "cohort_"), strings_in_dots = "literals") %>% evaluate_plan(wildcard = "school_",
values = rules_grid$school_, expand = TRUE) %>% evaluate_plan(wildcard = "funding_",
rules = rules_grid, expand = FALSE)
# Do we have duplicate targets?
any(duplicated(plan$target))
#> [1] FALSE |
Well, when I cut and paste your reprex into a fresh RStudio session and
source it I get:
```R
# Do we have duplicate targets?
any(duplicated(plan$target))
[1] TRUE
drake_session()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid
Matrix products: default
BLAS: /opt/R/3.4.4/lib/R/lib/libRblas.so
LAPACK: /opt/R/3.4.4/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 ggplot2_2.2.1 knitr_1.20 Ecdat_0.3-1 Ecfun_0.1-7
[6] drake_5.1.2
loaded via a namespace (and not attached):
[1] storr_1.1.3 tidyselect_0.2.4 purrr_0.2.4 listenv_0.7.0
[5] splines_3.4.4 lattice_0.20-35 colorspace_1.3-2 testthat_2.0.0
[9] htmltools_0.3.6 yaml_2.1.19 XML_3.98-1.11 rlang_0.2.0
[13] R.oo_1.22.0 pillar_1.2.2 glue_1.2.0 withr_2.1.2
[17] R.utils_2.6.0 CodeDepends_0.5-3 jpeg_0.1-8 bindr_0.1.1
[21] plyr_1.8.4 stringr_1.3.0 munsell_0.4.3 gtable_0.2.0
[25] R.methodsS3_1.7.1 visNetwork_2.0.3 future_1.8.1 htmlwidgets_1.2
[29] codetools_0.2-15 evaluate_0.10.1 parallel_3.4.4 Rcpp_0.12.16
[33] scales_0.5.0 backports_1.1.2 formatR_1.5 jsonlite_1.5
[37] TeachingDemos_2.10 digest_0.6.15 stringi_1.2.2 dplyr_0.7.4
[41] rprojroot_1.3-2 grid_3.4.4 tools_3.4.4 magrittr_1.5
[45] lazyeval_0.2.1 tibble_1.4.2 crayon_1.3.4 future.apply_0.2.0
[49] pkgconfig_2.0.1 MASS_7.3-50 fda_2.4.7 Matrix_1.2-14
[53] lubridate_1.7.4 assertthat_0.2.0 rstudioapi_0.7 R6_2.2.2
[57] globals_0.11.0 igraph_1.2.1 compiler_3.4.4
```
=====================================================================
I then cleaned up the formatting and added some print statements and nuked
the cache at the start.
Nuking the cache apparently broke drake_session().
=====================================================================
```R
library(drake)
library(tidyverse)
#> ── Attaching packages
───────────────────────────────────────────────────────────── tidyverse
1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.5
#> ✔ tidyr 0.8.1 ✔ stringr 1.3.1
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts
────────────────────────────────────────────────────────────────
tidyverse_conflicts() ──
#> ✖ tidyr::expand() masks drake::expand()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tidyr::gather() masks drake::gather()
#> ✖ dplyr::lag() masks stats::lag()
# Added to make it more reproducible:
clean(destroy = TRUE)
# Generate the plan from the end of
#
https://ropensci.github.io/drake/articles/best-practices.html#generating-workflow-plan-data-frames
rules_grid <- tibble::tibble(school_ = c("schoolA", "schoolB", "schoolC"),
funding_ = c("public", "public", "private"), )
%>%
tidyr::crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
dplyr::filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013")))
%>%
print()
#> # A tibble: 10 x 3
#> school_ funding_ cohort_
#> <chr> <chr> <chr>
#> 1 schoolA public 2012
#> 2 schoolA public 2013
#> 3 schoolA public 2014
#> 4 schoolA public 2015
#> 5 schoolB public 2014
#> 6 schoolB public 2015
#> 7 schoolC private 2012
#> 8 schoolC private 2013
#> 9 schoolC private 2014
#> 10 schoolC private 2015
plan <- drake_plan(
credits = check_credit_hours( "school_", "funding_", "cohort_"),
students = check_students( "school_", "funding_", "cohort_"),
grads = check_graduations( "school_", "funding_", "cohort_"),
public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals") %>%
evaluate_plan(wildcard = "school_", values = rules_grid$school_, expand =
TRUE) %>%
* # Note that technically there shouldn't be a "wildcard" as its
overridden by rules.*
evaluate_plan(wildcard = "funding_", rules = rules_grid, expand = FALSE)
# Do we have duplicate targets?
print(plan, n = 100)
print(any(duplicated(plan$target)))
#> [1] FALSE
drake_session()
```
=====================================================================
sourcing this on a restarted R session yields:
=====================================================================
```R
Restarting R session...
source('~/Analyses/New/Drake/test2/reprex.R')── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──✔ ggplot2 2.2.1 ✔ purrr 0.2.4✔ tibble 1.4.2 ✔ dplyr 0.7.4✔ tidyr 0.8.0 ✔ stringr 1.3.0✔ readr 1.1.1 ✔ forcats 0.3.0── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::contains() masks drake::contains()✖ dplyr::ends_with() masks drake::ends_with()✖ dplyr::everything() masks drake::everything()✖ tidyr::expand() masks drake::expand()✖ dplyr::filter() masks stats::filter()✖ tidyr::gather() masks drake::gather()✖ dplyr::lag() masks stats::lag()✖ dplyr::matches() masks drake::matches()✖ dplyr::num_range() masks drake::num_range()✖ dplyr::one_of() masks drake::one_of()✖ dplyr::starts_with() masks drake::starts_with()
# A tibble: 10 x 3
school_ funding_ cohort_
<chr> <chr> <chr>
1 schoolA public 2012
2 schoolA public 2013
3 schoolA public 2014
4 schoolA public 2015
5 schoolB public 2014
6 schoolB public 2015
7 schoolCprivate 2012
8 schoolC private 2013
9 schoolC private 2014
10 schoolC private 2015
# A tibble: 40 x 2
target command
<chr> <chr>
1 credits_schoolA "check_credit_hours(\"schoolA\",\"public\", \"2012\")"
2 credits_schoolA"check_credit_hours(\"schoolA\", \"public\", \"2013\")"
3credits_schoolA "check_credit_hours(\"schoolA\", \"public\",\"2014\")"
4 credits_schoolA "check_credit_hours(\"schoolA\",\"public\", \"2015\")"
5 credits_schoolB"check_credit_hours(\"schoolB\", \"public\", \"2014\")"
6credits_schoolB "check_credit_hours(\"schoolB\", \"public\",\"2015\")"
7 credits_schoolC "check_credit_hours(\"schoolC\",\"private\", \"2012\")"
8 credits_schoolC"check_credit_hours(\"schoolC\", \"private\", \"2013\")"
9credits_schoolC "check_credit_hours(\"schoolC\", \"private\",\"2014\")"
10 credits_schoolC "check_credit_hours(\"schoolC\",\"private\", \"2015\")"
11 students_schoolA"check_students(\"schoolA\", \"public\", \"2012\")"
12students_schoolA "check_students(\"schoolA\", \"public\",\"2013\")"
13 students_schoolA "check_students(\"schoolA\",\"public\", \"2014\")"
14 students_schoolA"check_students(\"schoolA\", \"public\", \"2015\")"
15students_schoolB "check_students(\"schoolB\", \"public\",\"2014\")"
16 students_schoolB "check_students(\"schoolB\",\"public\", \"2015\")"
17 students_schoolC"check_students(\"schoolC\", \"private\", \"2012\")"
18students_schoolC "check_students(\"schoolC\", \"private\",\"2013\")"
19 students_schoolC "check_students(\"schoolC\",\"private\", \"2014\")"
20 students_schoolC"check_students(\"schoolC\", \"private\", \"2015\")"
21grads_schoolA "check_graduations(\"schoolA\", \"public\",\"2012\")"
22 grads_schoolA "check_graduations(\"schoolA\",\"public\", \"2013\")"
23 grads_schoolA"check_graduations(\"schoolA\", \"public\", \"2014\")"
24grads_schoolA "check_graduations(\"schoolA\", \"public\",\"2015\")"
25 grads_schoolB "check_graduations(\"schoolB\",\"public\", \"2014\")"
26 grads_schoolB"check_graduations(\"schoolB\", \"public\", \"2015\")"
27grads_schoolC "check_graduations(\"schoolC\", \"private\",\"2012\")"
28 grads_schoolC "check_graduations(\"schoolC\",\"private\", \"2013\")"
29 grads_schoolC"check_graduations(\"schoolC\", \"private\", \"2014\")"
30grads_schoolC "check_graduations(\"schoolC\", \"private\",\"2015\")"
31 public_funds_schoolA"check_public_funding(\"schoolA\", \"public\", \"2012\")"
32public_funds_schoolA "check_public_funding(\"schoolA\", \"public\",\"2013\")"
33 public_funds_schoolA "check_public_funding(\"schoolA\",\"public\", \"2014\")"
34 public_funds_schoolA"check_public_funding(\"schoolA\", \"public\", \"2015\")"
35public_funds_schoolB "check_public_funding(\"schoolB\", \"public\",\"2014\")"
36 public_funds_schoolB "check_public_funding(\"schoolB\",\"public\", \"2015\")"
37 public_funds_schoolC"check_public_funding(\"schoolC\", \"private\", \"2012\")"
38public_funds_schoolC "check_public_funding(\"schoolC\", \"private\",\"2013\")"
39 public_funds_schoolC "check_public_funding(\"schoolC\",\"private\", \"2014\")"
40 public_funds_schoolC"check_public_funding(\"schoolC\", \"private\", \"2015\")"
[1] TRUE
Error in drake_session() : No drake::make() session detected.
```
I guess I'll do a devtools github install and see if that fixes things.
Darn, I really thought I had figured out the way to reason about the
various plan evaluations.
|
Well, good news and bad news. I did a github install of drake and the
results changed. This is sourcing the same file supplied in the last
message.
```R
Restarting R session...
library(devtools)>
install_github("ropensci/drake")
Skipping install of 'drake' from a github remote, the SHA1 (aefa7a5) has not changed since last install.
Use `force = TRUE` to force installation>
source('~/Analyses/New/Drake/test2/reprex.R')
Attaching package:‘drake’
The following object is masked from ‘package:devtools’:
check
── Attaching packages ──────────────────────────────────────────────
tidyverse 1.2.1 ──✔ ggplot2 2.2.1 ✔ purrr 0.2.4✔ tibble 1.4.2
✔ dplyr 0.7.5✔ tidyr 0.8.0 ✔ stringr 1.3.0✔ readr 1.1.1
✔ forcats 0.3.0── Conflicts
─────────────────────────────────────────────────
tidyverse_conflicts() ──✖ tidyr::expand() masks drake::expand()✖
dplyr::filter() masks stats::filter()✖ tidyr::gather() masks
drake::gather()✖ dplyr::lag() masks stats::lag()
# A tibble: 10 x 3
school_ funding_ cohort_
<chr> <chr> <chr>
1 schoolA public 2012
2 schoolA public 2013
3 schoolA public 2014
4 schoolA public 2015
5 schoolB public 2014
6 schoolB public 2015
7 schoolC private 2012
8 schoolC private 2013
9 schoolC private 2014
10 schoolC private 2015
# A tibble: 12 x 2
target command
<chr> <chr>
1 credits_schoolA "check_credit_hours(\"schoolA\",\"public\", \"2012\")"
2 credits_schoolB "check_credit_hours(\"schoolB\", \"public\", \"2013\")"
3 credits_schoolC "check_credit_hours(\"schoolC\", \"public\",\"2014\")"
4 students_schoolA "check_students(\"schoolA\",\"public\", \"2015\")"
5 students_schoolB "check_students(\"schoolB\", \"public\", \"2014\")"
6 students_schoolC "check_students(\"schoolC\", \"public\",\"2015\")"
7 grads_schoolA "check_graduations(\"schoolA\", \"private\", \"2012\")"
8 grads_schoolB "check_graduations(\"schoolB\", \"private\",\"2013\")"
9 grads_schoolC "check_graduations(\"schoolC\",\"private\", \"2014\")"
10 public_funds_schoolA "check_public_funding(\"schoolA\", \"private\", \"2015\")"
11 public_funds_schoolB "check_public_funding(\"schoolB\", \"public\",\"2012\")"
12 public_funds_schoolC "check_public_funding(\"schoolC\",\"public\", \"2013\")"
[1] FALSE
Error in drake_session() : No drake::make() session detected.
```
So, yes, there are no dups now. However, the plan is not what was intended
as you are "randomly" selecting between checks and years and funding based
on "wrapping" the years and funding rather than checking each year for each
school. Note how public/private no longer is fixed to the school as defined
in the tibble.
Really would be nice is drake_session() didn't fail when no make session
detected but produced only a warning?
Jim
|
Sorry, responded by email and now no way to fix up the formatting. Bottom line, your best practices solution does pass the no-dups test, but yields: # A tibble: 12 x 2
target command
<chr> <chr>
1 credits_schoolA "check_credit_hours(\"schoolA\", \"public\", \"2012\")"
2 credits_schoolB "check_credit_hours(\"schoolB\", \"public\", \"2013\")"
3 credits_schoolC "check_credit_hours(\"schoolC\", \"public\", \"2014\")"
4 students_schoolA "check_students(\"schoolA\", \"public\", \"2015\")"
5 students_schoolB "check_students(\"schoolB\", \"public\", \"2014\")"
6 students_schoolC "check_students(\"schoolC\", \"public\", \"2015\")"
7 grads_schoolA "check_graduations(\"schoolA\", \"private\", \"2012\")"
8 grads_schoolB "check_graduations(\"schoolB\", \"private\", \"2013\")"
9 grads_schoolC "check_graduations(\"schoolC\", \"private\", \"2014\")"
10 public_funds_schoolA "check_public_funding(\"schoolA\", \"private\", \"2015\")"
11 public_funds_schoolB "check_public_funding(\"schoolB\", \"public\", \"2012\")"
12 public_funds_schoolC "check_public_funding(\"schoolC\", \"public\", \"2013\")" While I believe it should yield (from my original solution proposal): # A tibble: 40 x 2
target command
<chr> <chr>
1 credits_schoolA_2012 "check_credit_hours(\"schoolA\", \"public\", \"2012\")"
2 credits_schoolA_2013 "check_credit_hours(\"schoolA\", \"public\", \"2013\")"
3 credits_schoolA_2014 "check_credit_hours(\"schoolA\", \"public\", \"2014\")"
4 credits_schoolA_2015 "check_credit_hours(\"schoolA\", \"public\", \"2015\")"
5 credits_schoolB_2014 "check_credit_hours(\"schoolB\", \"public\", \"2014\")"
6 credits_schoolB_2015 "check_credit_hours(\"schoolB\", \"public\", \"2015\")"
7 credits_schoolC_2012 "check_credit_hours(\"schoolC\", \"private\", \"2012\")"
8 credits_schoolC_2013 "check_credit_hours(\"schoolC\", \"private\", \"2013\")"
9 credits_schoolC_2014 "check_credit_hours(\"schoolC\", \"private\", \"2014\")"
10 credits_schoolC_2015 "check_credit_hours(\"schoolC\", \"private\", \"2015\")"
11 students_schoolA_2012 "check_students(\"schoolA\", \"public\", \"2012\")"
12 students_schoolA_2013 "check_students(\"schoolA\", \"public\", \"2013\")"
13 students_schoolA_2014 "check_students(\"schoolA\", \"public\", \"2014\")"
14 students_schoolA_2015 "check_students(\"schoolA\", \"public\", \"2015\")"
15 students_schoolB_2014 "check_students(\"schoolB\", \"public\", \"2014\")"
16 students_schoolB_2015 "check_students(\"schoolB\", \"public\", \"2015\")"
17 students_schoolC_2012 "check_students(\"schoolC\", \"private\", \"2012\")"
18 students_schoolC_2013 "check_students(\"schoolC\", \"private\", \"2013\")"
19 students_schoolC_2014 "check_students(\"schoolC\", \"private\", \"2014\")"
20 students_schoolC_2015 "check_students(\"schoolC\", \"private\", \"2015\")"
21 grads_schoolA_2012 "check_graduations(\"schoolA\", \"public\", \"2012\")"
22 grads_schoolA_2013 "check_graduations(\"schoolA\", \"public\", \"2013\")"
23 grads_schoolA_2014 "check_graduations(\"schoolA\", \"public\", \"2014\")"
24 grads_schoolA_2015 "check_graduations(\"schoolA\", \"public\", \"2015\")"
25 grads_schoolB_2014 "check_graduations(\"schoolB\", \"public\", \"2014\")"
26 grads_schoolB_2015 "check_graduations(\"schoolB\", \"public\", \"2015\")"
27 grads_schoolC_2012 "check_graduations(\"schoolC\", \"private\", \"2012\")"
28 grads_schoolC_2013 "check_graduations(\"schoolC\", \"private\", \"2013\")"
29 grads_schoolC_2014 "check_graduations(\"schoolC\", \"private\", \"2014\")"
30 grads_schoolC_2015 "check_graduations(\"schoolC\", \"private\", \"2015\")"
31 public_funds_schoolA_2012 "check_public_funding(\"schoolA\", \"public\", \"2012\")"
32 public_funds_schoolA_2013 "check_public_funding(\"schoolA\", \"public\", \"2013\")"
33 public_funds_schoolA_2014 "check_public_funding(\"schoolA\", \"public\", \"2014\")"
34 public_funds_schoolA_2015 "check_public_funding(\"schoolA\", \"public\", \"2015\")"
35 public_funds_schoolB_2014 "check_public_funding(\"schoolB\", \"public\", \"2014\")"
36 public_funds_schoolB_2015 "check_public_funding(\"schoolB\", \"public\", \"2015\")"
37 public_funds_schoolC_2012 "check_public_funding(\"schoolC\", \"private\", \"2012\"…
38 public_funds_schoolC_2013 "check_public_funding(\"schoolC\", \"private\", \"2013\"…
39 public_funds_schoolC_2014 "check_public_funding(\"schoolC\", \"private\", \"2014\"…
40 public_funds_schoolC_2015 "check_public_funding(\"schoolC\", \"private\", \"2015\"… |
In that particular example, since school C does not receive public funding, we should not actually be calling By the way, I'm wrong for a different reason: the resulting data frame should be 10 rows, not 12. In 4e4cb98, which I will push soon, I patched the issue and updated the best practices vignette. The documentation website should update next time I rebuild it. > plan <- drake_plan(
+ credits = check_credit_hours("school_", "funding_", "cohort_"),
+ students = check_students("school_", "funding_", "cohort_"),
+ grads = check_graduations("school_", "funding_", "cohort_"),
+ public_funds = check_public_funding("school_", "funding_", "cohort_"),
+ strings_in_dots = "literals"
+ )[c(rep(1, 4), rep(2, 2), rep(3, 4)), ] %>%
+ evaluate_plan(
+ rules = rules_grid,
+ expand = FALSE,
+ always_rename = TRUE
+ )
> plan
# A tibble: 10 x 2
target command
<chr> <chr>
1 credits "check_credit_hours(\"schoolA\", \"public\", \"2012\")"
2 credits "check_credit_hours(\"schoolA\", \"public\", \"2013\")"
3 credits "check_credit_hours(\"schoolA\", \"public\", \"2014\")"
4 credits "check_credit_hours(\"schoolA\", \"public\", \"2015\")"
5 students "check_students(\"schoolB\", \"public\", \"2014\")"
6 students "check_students(\"schoolB\", \"public\", \"2015\")"
7 grads "check_graduations(\"schoolC\", \"private\", \"2012\")"
8 grads "check_graduations(\"schoolC\", \"private\", \"2013\")"
9 grads "check_graduations(\"schoolC\", \"private\", \"2014\")"
10 grads "check_graduations(\"schoolC\", \"private\", \"2015\")" I do want to think about better handling of custom grids and whether we should expand every matching command over the whole grid. My mind has not been on wildcards lately, though. |
I've been trying to come up with a better paradigm for the substitution rules in evaluate plan. I note that you have added a new flag "always_rename" which looks promising. It seems like the problem is made more difficult by trying to get consistent behavior for expand=T/F. So for the moment, I'll ignore it. I'm also ignoring the wildcard/value args as they are really a subset of a single rule list and could be deprecated. When you have multiple parameters being substituted at the same time via rule = list(), the primary distinction (in my mind) is whether you are generating all combinations of those parameters (as currently coded with expand = T), or if you are taking them verbatim as "rowwise" tuples of parameter values and always treating each row as a unit. This latter might be the more natural interpretation of rule = data.frame as rows are often seen a unique observations. While the former makes sense when the list contains vectors of different lengths. This leads to the suggestion of enhancing the expansion option beyond just T/F. Currently false indicates no replication of targets and just round robin substitution of parameters. However, the actual substitution appears to depend on the both the original targets (counts, ordering and parameter usage) and the rules parameter counts. I'm not a fan, but this may need to be kept for backward compatibility? With expand = T and a a set of rules the current combinatorial expansion would take place. Finally, with expand = "rowwise", each target would get expanded with each of the parameter tuples defined in a row (no combinatorics unless you did the expansion when generating the rules using for example expand.grid). Thus if you had N targets and M rows in the rules you would always end up with exactly N*M evaluated targets. Note that in some sense the rowwise expansion is more fundamental than the current combinatorics as the latter can easily be replicated using the former, but not vice-versa. ================================ In any event, it is only evaluating credits on schoolA, students on schoolB and grads on schoolC rather than each test on each school. I would have expected converging to a solution similar to my "Version two" proposal above (but with out the SchoolC varying public/private as I added to the example code). This would generate the following 36 target plan: # A tibble: 36 x 2
target command
<chr> <chr>
1 credits_schoolA_2012 "check_credit_hours(\"schoolA\", \"public\", \"2012\")"
2 credits_schoolA_2013 "check_credit_hours(\"schoolA\", \"public\", \"2013\")"
3 credits_schoolA_2014 "check_credit_hours(\"schoolA\", \"public\", \"2014\")"
4 credits_schoolA_2015 "check_credit_hours(\"schoolA\", \"public\", \"2015\")"
5 credits_schoolB_2014 "check_credit_hours(\"schoolB\", \"public\", \"2014\")"
6 credits_schoolB_2015 "check_credit_hours(\"schoolB\", \"public\", \"2015\")"
7 credits_schoolC_2012 "check_credit_hours(\"schoolC\", \"private\", \"2012\")"
8 credits_schoolC_2013 "check_credit_hours(\"schoolC\", \"private\", \"2013\")"
9 credits_schoolC_2014 "check_credit_hours(\"schoolC\", \"private\", \"2014\")"
10 credits_schoolC_2015 "check_credit_hours(\"schoolC\", \"private\", \"2015\")"
11 students_schoolA_2012 "check_students(\"schoolA\", \"public\", \"2012\")"
12 students_schoolA_2013 "check_students(\"schoolA\", \"public\", \"2013\")"
13 students_schoolA_2014 "check_students(\"schoolA\", \"public\", \"2014\")"
14 students_schoolA_2015 "check_students(\"schoolA\", \"public\", \"2015\")"
15 students_schoolB_2014 "check_students(\"schoolB\", \"public\", \"2014\")"
16 students_schoolB_2015 "check_students(\"schoolB\", \"public\", \"2015\")"
17 students_schoolC_2012 "check_students(\"schoolC\", \"private\", \"2012\")"
18 students_schoolC_2013 "check_students(\"schoolC\", \"private\", \"2013\")"
19 students_schoolC_2014 "check_students(\"schoolC\", \"private\", \"2014\")"
20 students_schoolC_2015 "check_students(\"schoolC\", \"private\", \"2015\")"
21 grads_schoolA_2012 "check_graduations(\"schoolA\", \"public\", \"2012\")"
22 grads_schoolA_2013 "check_graduations(\"schoolA\", \"public\", \"2013\")"
23 grads_schoolA_2014 "check_graduations(\"schoolA\", \"public\", \"2014\")"
24 grads_schoolA_2015 "check_graduations(\"schoolA\", \"public\", \"2015\")"
25 grads_schoolB_2014 "check_graduations(\"schoolB\", \"public\", \"2014\")"
26 grads_schoolB_2015 "check_graduations(\"schoolB\", \"public\", \"2015\")"
27 grads_schoolC_2012 "check_graduations(\"schoolC\", \"private\", \"2012\")"
28 grads_schoolC_2013 "check_graduations(\"schoolC\", \"private\", \"2013\")"
29 grads_schoolC_2014 "check_graduations(\"schoolC\", \"private\", \"2014\")"
30 grads_schoolC_2015 "check_graduations(\"schoolC\", \"private\", \"2015\")"
31 public_funds_schoolA_2012 "check_public_funding(\"schoolA\", \"public\", \"2012\")"
32 public_funds_schoolA_2013 "check_public_funding(\"schoolA\", \"public\", \"2013\")"
33 public_funds_schoolA_2014 "check_public_funding(\"schoolA\", \"public\", \"2014\")"
34 public_funds_schoolA_2015 "check_public_funding(\"schoolA\", \"public\", \"2015\")"
35 public_funds_schoolB_2014 "check_public_funding(\"schoolB\", \"public\", \"2014\")"
36 public_funds_schoolB_2015 "check_public_funding(\"schoolB\", \"public\", \"2015\")" |
I did some work since the last post, and those targets are no longer duplicated. Reprex: library(drake)
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.5
#> ✔ tidyr 0.8.1 ✔ stringr 1.3.1
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ tidyr::expand() masks drake::expand()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tidyr::gather() masks drake::gather()
#> ✖ dplyr::lag() masks stats::lag()
rules_grid <- tibble::tibble(school_ = c("schoolA", "schoolB", "schoolC"), funding_ = c("public",
"public", "private"), ) %>% tidyr::crossing(cohort_ = c("2012", "2013",
"2014", "2015")) %>% dplyr::filter(!(school_ == "schoolB" & cohort_ %in%
c("2012", "2013")))
plan <- drake_plan(credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"), grads = check_graduations("school_",
"funding_", "cohort_"), public_funds = check_public_funding("school_",
"funding_", "cohort_"), strings_in_dots = "literals")[c(rep(1, 4), rep(2,
2), rep(3, 4)), ] %>% evaluate_plan(rules = rules_grid, expand = FALSE,
always_rename = TRUE) %>% print
#> # A tibble: 10 x 2
#> target command
#> <chr> <chr>
#> 1 credits_schoolA_public_2012 "check_credit_hours(\"schoolA\", \"public…
#> 2 credits_schoolA_public_2013 "check_credit_hours(\"schoolA\", \"public…
#> 3 credits_schoolA_public_2014 "check_credit_hours(\"schoolA\", \"public…
#> 4 credits_schoolA_public_2015 "check_credit_hours(\"schoolA\", \"public…
#> 5 students_schoolB_public_2014 "check_students(\"schoolB\", \"public\", …
#> 6 students_schoolB_public_2015 "check_students(\"schoolB\", \"public\", …
#> 7 grads_schoolC_private_2012 "check_graduations(\"schoolC\", \"private…
#> 8 grads_schoolC_private_2013 "check_graduations(\"schoolC\", \"private…
#> 9 grads_schoolC_private_2014 "check_graduations(\"schoolC\", \"private…
#> 10 grads_schoolC_private_2015 "check_graduations(\"schoolC\", \"private… I will need some time to think about the rest of your comments about different modes of wildcard substitution and expansion. I am planning to put this functionality in the At this point, I see wildcards as a medium-term solution. Long-term, I still prefer to move to @krlmlr's proposed DSL interface (ref: #233, #304). |
Unfortunately, the proposed solution doesn't generate the correct answer. It pseudo-randomly combines the checks with the schools and generates only 10 results. What it should do is combine all 4 independent checks with all specified schools and years (cohorts) and generate 40 results (if you allow check_public_funding to be invoked on schoolC, or 36 if you don't). # A tibble: 40 x 2
target command
<chr> <chr>
1 credits_schoolA_2012 "check_credit_hours(\"schoolA\", \"public\", \"2012\")"
2 credits_schoolA_2013 "check_credit_hours(\"schoolA\", \"public\", \"2013\")"
3 credits_schoolA_2014 "check_credit_hours(\"schoolA\", \"public\", \"2014\")"
4 credits_schoolA_2015 "check_credit_hours(\"schoolA\", \"public\", \"2015\")"
5 credits_schoolB_2014 "check_credit_hours(\"schoolB\", \"public\", \"2014\")"
6 credits_schoolB_2015 "check_credit_hours(\"schoolB\", \"public\", \"2015\")"
7 credits_schoolC_2012 "check_credit_hours(\"schoolC\", \"private\", \"2012\")"
8 credits_schoolC_2013 "check_credit_hours(\"schoolC\", \"private\", \"2013\")"
9 credits_schoolC_2014 "check_credit_hours(\"schoolC\", \"private\", \"2014\")"
10 credits_schoolC_2015 "check_credit_hours(\"schoolC\", \"private\", \"2015\")"
11 students_schoolA_2012 "check_students(\"schoolA\", \"public\", \"2012\")"
12 students_schoolA_2013 "check_students(\"schoolA\", \"public\", \"2013\")"
13 students_schoolA_2014 "check_students(\"schoolA\", \"public\", \"2014\")"
14 students_schoolA_2015 "check_students(\"schoolA\", \"public\", \"2015\")"
15 students_schoolB_2014 "check_students(\"schoolB\", \"public\", \"2014\")"
16 students_schoolB_2015 "check_students(\"schoolB\", \"public\", \"2015\")"
17 students_schoolC_2012 "check_students(\"schoolC\", \"private\", \"2012\")"
18 students_schoolC_2013 "check_students(\"schoolC\", \"private\", \"2013\")"
19 students_schoolC_2014 "check_students(\"schoolC\", \"private\", \"2014\")"
20 students_schoolC_2015 "check_students(\"schoolC\", \"private\", \"2015\")"
21 grads_schoolA_2012 "check_graduations(\"schoolA\", \"public\", \"2012\")"
22 grads_schoolA_2013 "check_graduations(\"schoolA\", \"public\", \"2013\")"
23 grads_schoolA_2014 "check_graduations(\"schoolA\", \"public\", \"2014\")"
24 grads_schoolA_2015 "check_graduations(\"schoolA\", \"public\", \"2015\")"
25 grads_schoolB_2014 "check_graduations(\"schoolB\", \"public\", \"2014\")"
26 grads_schoolB_2015 "check_graduations(\"schoolB\", \"public\", \"2015\")"
27 grads_schoolC_2012 "check_graduations(\"schoolC\", \"private\", \"2012\")"
28 grads_schoolC_2013 "check_graduations(\"schoolC\", \"private\", \"2013\")"
29 grads_schoolC_2014 "check_graduations(\"schoolC\", \"private\", \"2014\")"
30 grads_schoolC_2015 "check_graduations(\"schoolC\", \"private\", \"2015\")"
31 public_funds_schoolA_2012 "check_public_funding(\"schoolA\", \"public\", \"2012\")"
32 public_funds_schoolA_2013 "check_public_funding(\"schoolA\", \"public\", \"2013\")"
33 public_funds_schoolA_2014 "check_public_funding(\"schoolA\", \"public\", \"2014\")"
34 public_funds_schoolA_2015 "check_public_funding(\"schoolA\", \"public\", \"2015\")"
35 public_funds_schoolB_2014 "check_public_funding(\"schoolB\", \"public\", \"2014\")"
36 public_funds_schoolB_2015 "check_public_funding(\"schoolB\", \"public\", \"2015\")"
37 public_funds_schoolC_2012 "check_public_funding(\"schoolC\", \"private\", \"2012\"…
38 public_funds_schoolC_2013 "check_public_funding(\"schoolC\", \"private\", \"2013\"…
39 public_funds_schoolC_2014 "check_public_funding(\"schoolC\", \"private\", \"2014\"…
40 public_funds_schoolC_2015 "check_public_funding(\"schoolC\", \"private\", \"2015\"… |
I think the 10-row data frame is really what we are going for here. (@AlexAxthelm, do you agree?) Setting |
https://github.com/tidyverse/glue may be a better solution to all this. Ref: #424. |
Coming back to #235 (comment), I thought of a much better solution to the original problem: just define a special wildcard for public schools. library(drake)
library(magrittr)
drake_plan(
credits = check_credit_hours(all_schools__),
students = check_students(all_schools__),
grads = check_graduations(all_schools__),
public_funds = check_public_funding(public_schools__)
) %>%
evaluate_plan(
rules = list(
all_schools__ = c("schoolA", "schoolB", "schoolC"),
public_schools__ = c("schoolA", "schoolB")
)
)
#> # A tibble: 11 x 2
#> target command
#> <chr> <chr>
#> 1 credits_schoolA check_credit_hours(schoolA)
#> 2 credits_schoolB check_credit_hours(schoolB)
#> 3 credits_schoolC check_credit_hours(schoolC)
#> 4 students_schoolA check_students(schoolA)
#> 5 students_schoolB check_students(schoolB)
#> 6 students_schoolC check_students(schoolC)
#> 7 grads_schoolA check_graduations(schoolA)
#> 8 grads_schoolB check_graduations(schoolB)
#> 9 grads_schoolC check_graduations(schoolC)
#> 10 public_funds_schoolA check_public_funding(schoolA)
#> 11 public_funds_schoolB check_public_funding(schoolB) Without that 12th row, this is the correct answer to the question posed at the top of the thread. And it only requires one call to |
Edit: |
An issue that I keep running into with
evaluate_plan()
is that setting up incomplete multiples is kind of a pain. As an example, If I have three schools that I want to run an analysis on, I might have something along the linesExcept
schoolC
will throw an error oncheck_public_funds
because they don't receive any. So at this point, I have a few options:dplyr::filter
to prune away everything I dion't want. Works okay for small numbers of exceptions, but doesn't scale well.school_type__
argument to each of my functions, which returns aNULL
when appropriate. Not a perfect solution, but it (ideally) makes the drake plan easy to maintain, and if I put thereturn(NULL)
early in the function, it's not a huge time sink overall.But, there isn't a great way to pass those arguments such that they match:
Ideally I would have something like this:
Currently, this evaluates to the same as
very_wrong
above. I'm not sure if the best option here would be to change default behaviors for rectangular objects passed torules
, or maybe add amatched_arguments
flag inevaluate_plan
, so that it can understand that not all expansions go with each other. Also, maybe I'm on a weird edge case, and a clarification on best practices around evaluate_plan would be helpful?I think this is relevant for #228 and #233.
The text was updated successfully, but these errors were encountered: