Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A base R replacement to formatR::tidy_source() #563

Merged
merged 26 commits into from
Jan 2, 2019
Merged

A base R replacement to formatR::tidy_source() #563

merged 26 commits into from
Jan 2, 2019

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Oct 28, 2018

Summary

NOTE: this PR will invalidate all targets in all workflows, so I am waiting until the next major release (version 7.0.0) to merge this it!

In the interest of reducing drake's dependencies on other packages, this PR introduces a replacement to formatR::tidy_source() that only depends on packages base and utils. Thanks to @hrbrmstr and @lorenzwalthert for their solutions in #562.

Related GitHub issues and pull requests

Checklist

  • I have read drake's code of conduct, and I agree to follow its rules.
  • I have listed any substantial changes in the development news.
  • I have added testthat unit tests to tests/testthat to confirm that any new features or functionality work correctly.
  • I have tested this pull request locally with devtools::check()
  • This pull request is ready for review.
  • I think this pull request is ready to merge.

@codecov-io
Copy link

codecov-io commented Oct 28, 2018

Codecov Report

Merging #563 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #563      +/-   ##
========================================
+ Coverage   99.91%   100%   +0.08%     
========================================
  Files          82     82              
  Lines        6977   6961      -16     
========================================
- Hits         6971   6961      -10     
+ Misses          6      0       -6
Impacted Files Coverage Δ
R/test.R 100% <ø> (+6.74%) ⬆️
R/config.R 100% <ø> (ø) ⬆️
R/commands.R 100% <100%> (ø) ⬆️
R/utils.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ed50b7...5a82537. Read the comment docs.

@lorenzwalthert
Copy link
Contributor

It's a detail, but since the output of the function has changed, maybe a renaming would be appropriate to reflect the fact that it computes a hash?

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

It's a good point. However, now that you mention it, I am not totally settled on the output format. drake already serializes and hashes (via storr) whatever gets returned from standardize_command() (which calls standardize_code() at the end). I am not sure if it is faster to

  1. Hash once and then hash and serialize the first hash (current behavior) or
  2. Return the table of tokens, which will then get serialized and hashed once.

In (2) we hash only once, but the serialization is applied to a larger object of varying size.

We have a long time to decide. I plan to keep this PR open at least until we are sure the next release will be version 7.0.0.

@lorenzwalthert
Copy link
Contributor

Good point. Maybe you are right and we should not hash twice if possible. Can you quantify the speed benefits of (2) vs (1)?

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

A good thing to check. The times below look about the same between options (maybe because secondary hashes are just hashes of other hashes and those first hashes are small). I think option 1 is a good choice for now because it should take up less memory (smaller return values from the standardize_*() functions).

library(magrittr)
library(microbenchmark)

# Return tokenization info for code string x
standardize <- function(x){
  if (!length(x)){
    return(as.character(NA))
  }
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")]
}

# Try option 1: hash, serialize the hash, and then hash again.
option_1 <- function(x, key, cache){
  standardize(x) %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE) %>% # First hash
    cache$set(key = key) # Serialize and hash via `storr`
}

# Try option 2: let storr do all the serialization and hashing.
option_2 <- function(x, key, cache){
  standardize(x) %>%
    c() %>%
    cache$set(key = key) # serialization and hashing.
}

# Borrow and modify the body of standardize() as example code
# Create text of variable sizes, up to 32 times the original.
random_code <- function(){
  text <- deparse(body(standardize))
  n <- sample.int(n = 5, size = 1)
  for (i in seq_len(n)){
    text <- c(text, text)
  }
  text <- c("{", text, "}")
  paste(text, collapse = "\n") %>%
    gsub(pattern = "ASSIGN", replacement = as.character(runif(1)))
}

# One benchmark
benchmark <- function(fun){
  set.seed(0)
  times <- 100
  keys <- as.character(seq_len(times))
  code <- replicate(times, random_code())
  set.seed(0)
  cache <- storr::storr_rds(tempfile())
  on.exit(cache$destroy())
  loop <- function(){
    for (i in seq_len(times)){
      fun(x = code[i], key = keys[i], cache = cache)
    }
  }
  microbenchmark(loop(), times = 100)
}

# Run the benchmarks
print(benchmark(option_1))
#> Unit: seconds
#>    expr      min      lq     mean   median       uq     max neval
#>  loop() 6.708124 6.89136 7.010094 6.949077 7.093694 7.58899   100
print(benchmark(option_2))
#> Unit: seconds
#>    expr      min       lq     mean   median       uq      max neval
#>  loop() 6.813503 6.916124 7.042026 6.998259 7.084101 7.605348   100

Created on 2018-10-30 by the reprex package (v0.2.1)

@lorenzwalthert
Copy link
Contributor

Not sure if related but I remember when I removed a lot of old big targets from the cache that I did not use anymore in my workflow plan, drake got considerably faster.

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

Hmm... I would be surprised if the sizes of targets mattered, but the number sure does. I bet there are still many linear-time lookups in drake's code base.

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

Edit: size could also matter in the sense that immediate dependencies need to be loaded into memory. If there is more slowdown with respect to the sizes of targets than that can explain, something is probably wrong.

@wlandau
Copy link
Member Author

wlandau commented Oct 31, 2018

For the record, $set() and $get() appear to be constant-time in storr. Encouraging.

init_cache <- function(n){
  s <- storr::storr_rds(tempfile())
  keys <- as.character(seq_len(n))
  values <- lapply(keys, digest::digest)
  s$mset(key = keys, value = values)
  s
}

storrs <- lapply(10^seq_len(4), init_cache)

microbenchmark::microbenchmark(
  storrs[[4]]$get("1", use_cache = FALSE),
  storrs[[3]]$get("1", use_cache = FALSE),
  storrs[[2]]$get("1", use_cache = FALSE),
  storrs[[1]]$get("1", use_cache = FALSE)
)
#> Unit: microseconds
#>                                     expr    min      lq     mean  median
#>  storrs[[4]]$get("1", use_cache = FALSE) 91.015 92.0180 102.7160 92.9660
#>  storrs[[3]]$get("1", use_cache = FALSE) 90.391 92.0310 100.0994 92.7900
#>  storrs[[2]]$get("1", use_cache = FALSE) 90.807 91.9485 103.4817 92.5740
#>  storrs[[1]]$get("1", use_cache = FALSE) 90.234 91.8970 100.6319 92.8065
#>       uq     max neval cld
#>  94.0305 598.719   100   a
#>  93.6660 643.842   100   a
#>  93.5400 642.429   100   a
#>  93.5760 853.714   100   a

microbenchmark::microbenchmark(
  storrs[[4]]$set("1", runif(1), use_cache = FALSE),
  storrs[[3]]$set("1", runif(1), use_cache = FALSE),
  storrs[[2]]$set("1", runif(1), use_cache = FALSE),
  storrs[[1]]$set("1", runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#>                                               expr     min       lq
#>  storrs[[4]]$set("1", runif(1), use_cache = FALSE) 215.088 220.9315
#>  storrs[[3]]$set("1", runif(1), use_cache = FALSE) 215.603 222.4475
#>  storrs[[2]]$set("1", runif(1), use_cache = FALSE) 211.361 217.7670
#>  storrs[[1]]$set("1", runif(1), use_cache = FALSE) 210.187 214.8310
#>      mean   median       uq       max neval cld
#>  574.8408 225.4380 244.4220 10302.004   100   a
#>  513.0485 228.6505 251.5465  5063.073   100   a
#>  600.5828 224.0225 249.5070 11543.353   100   a
#>  356.2935 221.6010 241.9280  2427.105   100   a

Created on 2018-10-30 by the reprex package (v0.2.1)

@lorenzwalthert
Copy link
Contributor

May I ask what the rationale behind replacing %>% in 69ff455 is? Is drake moving away from %>%?

@wlandau
Copy link
Member Author

wlandau commented Nov 4, 2018

Yes: #570, #571. Profiling the code with provfis is much easier this way. No more tall stacks of fseq() and freduce(). I am already learning a lot now that the clutter is gone.

@wlandau
Copy link
Member Author

wlandau commented Dec 6, 2018

@lorenzwalthert
Copy link
Contributor

On a related note: I don't know how this is handled now but when i updated drake the last time, I had to re-build the whole cache. I assume this won't be any different this time? I think a warning should be issued after the installation so people could still go back to the old installation for any reason and use the built cache. Just in case it is not possible to re-build the cache for some reason, or it needs so much time that the user may want to defer it. If this is a thing, I think we best open a new issue.

@wlandau
Copy link
Member Author

wlandau commented Dec 6, 2018

On a related note: I don't know how this is handled now but when i updated drake the last time, I had to re-build the whole cache. I assume this won't be any different this time?

If we go through with #562 and #563, yes. I am still wondering if the change is worth the removal of a single dependency, and I could be convinced either way. If we are going to do it, it is better sooner rather than later, especially since 7.0.0 will have large changes already. But it is also important to fulfill drake's promise of keeping the right targets up to date.

I think a warning should be issued after the installation so people could still go back to the old installation for any reason and use the built cache.

Seems like we would want to control the installation process: look at the user's prior version of drake, install the package, and give a warning if the old and new versions span a change in the cache format. Is that possible to do in a package? It's a good idea, and I have often wanted to add special installation commands for different reasons. Either way, I think it is important to look at individual projects too.

@wlandau
Copy link
Member Author

wlandau commented Jan 2, 2019

Looking back at this issue, I have a couple comments:

  1. I think now is the time to move away from formatR. Because of Pre-encode file paths internally #630, Use manual base 64 encoding, file paths only. #633, and Cleaner encoding and case insensitivity #641 (which are well worth the speed improvements) the next release will invalidate targets in old projects anyway.
  2. I no longer think we should coerce EQ_ASSIGN to LEFT_ASSIGN. The reason is consistency: functions already get fingerprinted by deparsing, which does not change = to <-. The behavior for commands in the plan should agree with the behavior for functions. So I think we should go with a simple deparse(parse(text = as.character(command), keep.source = FALSE)[[1]]) instead of utils::getParseData().

@wlandau
Copy link
Member Author

wlandau commented Jan 2, 2019

A positive note on speed:

text <- paste(deparse(digest::digest), collapse = "\n")
microbenchmark::microbenchmark(
  deparse(parse(text = text, keep.source = FALSE)[[1]]),
  formatR::tidy_source(
    source = NULL,
    comment = FALSE,
    blank = FALSE,
    arrow = TRUE,
    brace.newline = FALSE,
    indent = 4,
    output = FALSE,
    text = text,
    width.cutoff = 119
  )
)
#> Unit: microseconds
#>                                                                                                                                                                             expr
#>                                                                                                                            deparse(parse(text = text, keep.source = FALSE)[[1]])
#>  formatR::tidy_source(source = NULL, comment = FALSE, blank = FALSE,      arrow = TRUE, brace.newline = FALSE, indent = 4, output = FALSE,      text = text, width.cutoff = 119)
#>       min        lq      mean    median       uq      max neval
#>   551.514  566.5695  589.0592  572.3665  594.850  749.532   100
#>  3728.347 3897.0460 4218.4730 3996.0425 4161.814 7441.496   100

Created on 2019-01-01 by the reprex package (v0.2.1)

@wlandau wlandau merged commit a1cd1b0 into master Jan 2, 2019
@wlandau wlandau deleted the 562 branch January 2, 2019 01:40
@wlandau
Copy link
Member Author

wlandau commented Jan 14, 2019

Hmm... deparse() is slow. In 8ab9d13, I decided to just removed all the attributes (including srcref, srcfile, and wholeSrcref) and return the language object without deparsing it. The speedup was about 2-3% for this workflow. @lorenzwalthert, is there anything wrong with this approach? Any hidden pitfalls I am missing?

@lorenzwalthert
Copy link
Contributor

lorenzwalthert commented Jan 14, 2019

I am not sure if I can follow. Am I still seeing a deparse() in 8ab9d13, right? I think working with the language object can introduce additional problems downstream because there are not methods for it for some functions, but for character vectors there might be as it is much more common. On the other hand, why not deferring converting to character as long as we don't need it? However, I think I am not the right person to answer because I lost track of what's going on here (and can't regain it from going through the code and conversation). :-( FYI in styler, we never work with the language object, so I don't have experience.

@wlandau
Copy link
Member Author

wlandau commented Jan 14, 2019

Sorry for the confusion. That remaining call to deparse() is for imported functions because hashes of functions are brittle: #345 (comment). My question: are the hashes of expressions, symbols, and language objects in general just as brittle? If x is a language object, is there anything that would make digest(x) unreliable? We can assume attributes(x) is NULL. Also cc @wch.

@wlandau
Copy link
Member Author

wlandau commented Jan 22, 2019

I actually found the answer to my question while refining #233 today. Strange things happen with language objects in R version 3.4.4. The two language objects below should be the same, but they have different internals because they were generated in different ways. drake now uses an internal safe_deparse() function to standardize commands (demonstrated below). The change invalidates targets again, but I think it is the right call, and I am glad I caught it before the next CRAN release.

x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")

lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))

print(lang1)
#> f(list(a = a, b = b))
print(lang2)
#> f(list(a = a, b = b))

library(digest)
digest(lang1)
#> [1] "869aef282c5349417ab12590f1359de5"
digest(lang2)
#> [1] "b6630dff6ebb37dac41c6164381efcc7"

deparse(lang1)
#> [1] "f(structure(list(a = a, b = b), .Names = c(\"a\", \"b\")))"
deparse(lang2)
#> [1] "f(list(a = a, b = b))"

print(drake:::safe_deparse)
#> function (x) 
#> {
#>     paste(deparse(x, control = c("keepInteger", "keepNA")), collapse = "\n")
#> }
#> <environment: namespace:drake>

drake:::safe_deparse(lang1)
#> [1] "f(list(a = a, b = b))"
drake:::safe_deparse(lang2)
#> [1] "f(list(a = a, b = b))"

digest(drake:::safe_deparse(lang1))
#> [1] "c19a8f963e567ebb171494483dfb1697"
digest(drake:::safe_deparse(lang2))
#> [1] "c19a8f963e567ebb171494483dfb1697"

Created on 2019-01-21 by the reprex package (v0.2.1)

@wlandau
Copy link
Member Author

wlandau commented Jan 22, 2019

Hmm... the same example on R 3.5.2 gives 2 different answers.

x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")

lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))

deparse(lang1, control = NULL)
#> [1] "f(list(a, b))"
deparse(lang2, control = NULL)
#> [1] "f(list(a = a, b = b))"

Created on 2019-01-21 by the reprex package (v0.2.1)

Hopefully deparsing prevents most other edge cases. Other remarks:

  1. For DSL based on dplyr-like verbs? #233, combine() should not try to produce named lists. What I mean is that in this example, the command for larger should always stay do.call(rbind, list(data_1_3, data_2_4)). Outputting do.call(rbind, list(data_1_3 = data_1_3, data_2_4 = data_2_4)) might seem more useful, but the names might get dropped in old versions of R. (Either that or you get an unpredictable structure(..., .Names = ...).
  2. At one point, @krlmlr suggested that drake might be better off using language objects for commands instead of character strings. But until we figure out a better way to handle language objects, I am actually kind of glad my naivete accidentally made drake a little more robust. Maybe we should even prohibit users from supplying a column of language objects as commands in the plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants