-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A base R replacement to formatR::tidy_source() #563
Conversation
Codecov Report
@@ Coverage Diff @@
## master #563 +/- ##
========================================
+ Coverage 99.91% 100% +0.08%
========================================
Files 82 82
Lines 6977 6961 -16
========================================
- Hits 6971 6961 -10
+ Misses 6 0 -6
Continue to review full report at Codecov.
|
It's a detail, but since the output of the function has changed, maybe a renaming would be appropriate to reflect the fact that it computes a hash? |
It's a good point. However, now that you mention it, I am not totally settled on the output format.
In (2) we hash only once, but the serialization is applied to a larger object of varying size. We have a long time to decide. I plan to keep this PR open at least until we are sure the next release will be version 7.0.0. |
Good point. Maybe you are right and we should not hash twice if possible. Can you quantify the speed benefits of (2) vs (1)? |
A good thing to check. The times below look about the same between options (maybe because secondary hashes are just hashes of other hashes and those first hashes are small). I think option 1 is a good choice for now because it should take up less memory (smaller return values from the library(magrittr)
library(microbenchmark)
# Return tokenization info for code string x
standardize <- function(x){
if (!length(x)){
return(as.character(NA))
}
x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
paste(collapse = "\n")
info <- parse(text = x, keep.source = TRUE) %>%
utils::getParseData(includeText = TRUE)
change <- info$token == "LEFT_ASSIGN"
info$text[change] <- "="
info$token[change] <- "EQ_ASSIGN"
info[info$token != "COMMENT" & info$terminal, c("token", "text")]
}
# Try option 1: hash, serialize the hash, and then hash again.
option_1 <- function(x, key, cache){
standardize(x) %>%
lapply(FUN = paste, collapse = " ") %>%
paste(collapse = " >> ") %>%
digest::digest(algo = "sha256", serialize = FALSE) %>% # First hash
cache$set(key = key) # Serialize and hash via `storr`
}
# Try option 2: let storr do all the serialization and hashing.
option_2 <- function(x, key, cache){
standardize(x) %>%
c() %>%
cache$set(key = key) # serialization and hashing.
}
# Borrow and modify the body of standardize() as example code
# Create text of variable sizes, up to 32 times the original.
random_code <- function(){
text <- deparse(body(standardize))
n <- sample.int(n = 5, size = 1)
for (i in seq_len(n)){
text <- c(text, text)
}
text <- c("{", text, "}")
paste(text, collapse = "\n") %>%
gsub(pattern = "ASSIGN", replacement = as.character(runif(1)))
}
# One benchmark
benchmark <- function(fun){
set.seed(0)
times <- 100
keys <- as.character(seq_len(times))
code <- replicate(times, random_code())
set.seed(0)
cache <- storr::storr_rds(tempfile())
on.exit(cache$destroy())
loop <- function(){
for (i in seq_len(times)){
fun(x = code[i], key = keys[i], cache = cache)
}
}
microbenchmark(loop(), times = 100)
}
# Run the benchmarks
print(benchmark(option_1))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> loop() 6.708124 6.89136 7.010094 6.949077 7.093694 7.58899 100
print(benchmark(option_2))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> loop() 6.813503 6.916124 7.042026 6.998259 7.084101 7.605348 100 Created on 2018-10-30 by the reprex package (v0.2.1) |
Not sure if related but I remember when I removed a lot of old big targets from the cache that I did not use anymore in my workflow plan, drake got considerably faster. |
Hmm... I would be surprised if the sizes of targets mattered, but the number sure does. I bet there are still many linear-time lookups in |
Edit: size could also matter in the sense that immediate dependencies need to be loaded into memory. If there is more slowdown with respect to the sizes of targets than that can explain, something is probably wrong. |
For the record, init_cache <- function(n){
s <- storr::storr_rds(tempfile())
keys <- as.character(seq_len(n))
values <- lapply(keys, digest::digest)
s$mset(key = keys, value = values)
s
}
storrs <- lapply(10^seq_len(4), init_cache)
microbenchmark::microbenchmark(
storrs[[4]]$get("1", use_cache = FALSE),
storrs[[3]]$get("1", use_cache = FALSE),
storrs[[2]]$get("1", use_cache = FALSE),
storrs[[1]]$get("1", use_cache = FALSE)
)
#> Unit: microseconds
#> expr min lq mean median
#> storrs[[4]]$get("1", use_cache = FALSE) 91.015 92.0180 102.7160 92.9660
#> storrs[[3]]$get("1", use_cache = FALSE) 90.391 92.0310 100.0994 92.7900
#> storrs[[2]]$get("1", use_cache = FALSE) 90.807 91.9485 103.4817 92.5740
#> storrs[[1]]$get("1", use_cache = FALSE) 90.234 91.8970 100.6319 92.8065
#> uq max neval cld
#> 94.0305 598.719 100 a
#> 93.6660 643.842 100 a
#> 93.5400 642.429 100 a
#> 93.5760 853.714 100 a
microbenchmark::microbenchmark(
storrs[[4]]$set("1", runif(1), use_cache = FALSE),
storrs[[3]]$set("1", runif(1), use_cache = FALSE),
storrs[[2]]$set("1", runif(1), use_cache = FALSE),
storrs[[1]]$set("1", runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#> expr min lq
#> storrs[[4]]$set("1", runif(1), use_cache = FALSE) 215.088 220.9315
#> storrs[[3]]$set("1", runif(1), use_cache = FALSE) 215.603 222.4475
#> storrs[[2]]$set("1", runif(1), use_cache = FALSE) 211.361 217.7670
#> storrs[[1]]$set("1", runif(1), use_cache = FALSE) 210.187 214.8310
#> mean median uq max neval cld
#> 574.8408 225.4380 244.4220 10302.004 100 a
#> 513.0485 228.6505 251.5465 5063.073 100 a
#> 600.5828 224.0225 249.5070 11543.353 100 a
#> 356.2935 221.6010 241.9280 2427.105 100 a Created on 2018-10-30 by the reprex package (v0.2.1) |
May I ask what the rationale behind replacing |
|
On a related note: I don't know how this is handled now but when i updated drake the last time, I had to re-build the whole cache. I assume this won't be any different this time? I think a warning should be issued after the installation so people could still go back to the old installation for any reason and use the built cache. Just in case it is not possible to re-build the cache for some reason, or it needs so much time that the user may want to defer it. If this is a thing, I think we best open a new issue. |
If we go through with #562 and #563, yes. I am still wondering if the change is worth the removal of a single dependency, and I could be convinced either way. If we are going to do it, it is better sooner rather than later, especially since 7.0.0 will have large changes already. But it is also important to fulfill
Seems like we would want to control the installation process: look at the user's prior version of |
Looking back at this issue, I have a couple comments:
|
A positive note on speed: text <- paste(deparse(digest::digest), collapse = "\n")
microbenchmark::microbenchmark(
deparse(parse(text = text, keep.source = FALSE)[[1]]),
formatR::tidy_source(
source = NULL,
comment = FALSE,
blank = FALSE,
arrow = TRUE,
brace.newline = FALSE,
indent = 4,
output = FALSE,
text = text,
width.cutoff = 119
)
)
#> Unit: microseconds
#> expr
#> deparse(parse(text = text, keep.source = FALSE)[[1]])
#> formatR::tidy_source(source = NULL, comment = FALSE, blank = FALSE, arrow = TRUE, brace.newline = FALSE, indent = 4, output = FALSE, text = text, width.cutoff = 119)
#> min lq mean median uq max neval
#> 551.514 566.5695 589.0592 572.3665 594.850 749.532 100
#> 3728.347 3897.0460 4218.4730 3996.0425 4161.814 7441.496 100 Created on 2019-01-01 by the reprex package (v0.2.1) |
Hmm... |
I am not sure if I can follow. Am I still seeing a |
Sorry for the confusion. That remaining call to deparse() is for imported functions because hashes of functions are brittle: #345 (comment). My question: are the hashes of expressions, symbols, and language objects in general just as brittle? If x is a language object, is there anything that would make digest(x) unreliable? We can assume attributes(x) is NULL. Also cc @wch. |
I actually found the answer to my question while refining #233 today. Strange things happen with language objects in R version 3.4.4. The two language objects below should be the same, but they have different internals because they were generated in different ways. x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")
lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))
print(lang1)
#> f(list(a = a, b = b))
print(lang2)
#> f(list(a = a, b = b))
library(digest)
digest(lang1)
#> [1] "869aef282c5349417ab12590f1359de5"
digest(lang2)
#> [1] "b6630dff6ebb37dac41c6164381efcc7"
deparse(lang1)
#> [1] "f(structure(list(a = a, b = b), .Names = c(\"a\", \"b\")))"
deparse(lang2)
#> [1] "f(list(a = a, b = b))"
print(drake:::safe_deparse)
#> function (x)
#> {
#> paste(deparse(x, control = c("keepInteger", "keepNA")), collapse = "\n")
#> }
#> <environment: namespace:drake>
drake:::safe_deparse(lang1)
#> [1] "f(list(a = a, b = b))"
drake:::safe_deparse(lang2)
#> [1] "f(list(a = a, b = b))"
digest(drake:::safe_deparse(lang1))
#> [1] "c19a8f963e567ebb171494483dfb1697"
digest(drake:::safe_deparse(lang2))
#> [1] "c19a8f963e567ebb171494483dfb1697" Created on 2019-01-21 by the reprex package (v0.2.1) |
Hmm... the same example on R 3.5.2 gives 2 different answers. x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")
lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))
deparse(lang1, control = NULL)
#> [1] "f(list(a, b))"
deparse(lang2, control = NULL)
#> [1] "f(list(a = a, b = b))" Created on 2019-01-21 by the reprex package (v0.2.1) Hopefully deparsing prevents most other edge cases. Other remarks:
|
Summary
NOTE: this PR will invalidate all targets in all workflows, so I am waiting until the next major release (version 7.0.0) to merge this it!
In the interest of reducing
drake
's dependencies on other packages, this PR introduces a replacement toformatR::tidy_source()
that only depends on packagesbase
andutils
. Thanks to @hrbrmstr and @lorenzwalthert for their solutions in #562.Related GitHub issues and pull requests
Checklist
drake
's code of conduct, and I agree to follow its rules.testthat
unit tests totests/testthat
to confirm that any new features or functionality work correctly.devtools::check()