A base R replacement to formatR::tidy_source() #563

wlandau · 2018-10-28T22:58:11Z

Summary

NOTE: this PR will invalidate all targets in all workflows, so I am waiting until the next major release (version 7.0.0) to merge this it!

In the interest of reducing drake's dependencies on other packages, this PR introduces a replacement to formatR::tidy_source() that only depends on packages base and utils. Thanks to @hrbrmstr and @lorenzwalthert for their solutions in #562.

Related GitHub issues and pull requests

Ref: A base R replacement for formatR::tidy_source() #562

Checklist

I have read drake's code of conduct, and I agree to follow its rules.
I have listed any substantial changes in the development news.
I have added testthat unit tests to tests/testthat to confirm that any new features or functionality work correctly.
I have tested this pull request locally with devtools::check()
This pull request is ready for review.
I think this pull request is ready to merge.

codecov-io · 2018-10-28T23:10:56Z

Codecov Report

Merging #563 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #563      +/-   ##
========================================
+ Coverage   99.91%   100%   +0.08%     
========================================
  Files          82     82              
  Lines        6977   6961      -16     
========================================
- Hits         6971   6961      -10     
+ Misses          6      0       -6

Impacted Files	Coverage Δ
R/test.R	`100% <ø> (+6.74%)`	⬆️
R/config.R	`100% <ø> (ø)`	⬆️
R/commands.R	`100% <100%> (ø)`	⬆️
R/utils.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ed50b7...5a82537. Read the comment docs.

lorenzwalthert · 2018-10-30T11:32:26Z

It's a detail, but since the output of the function has changed, maybe a renaming would be appropriate to reflect the fact that it computes a hash?

wlandau · 2018-10-30T12:39:38Z

It's a good point. However, now that you mention it, I am not totally settled on the output format. drake already serializes and hashes (via storr) whatever gets returned from standardize_command() (which calls standardize_code() at the end). I am not sure if it is faster to

Hash once and then hash and serialize the first hash (current behavior) or
Return the table of tokens, which will then get serialized and hashed once.

In (2) we hash only once, but the serialization is applied to a larger object of varying size.

We have a long time to decide. I plan to keep this PR open at least until we are sure the next release will be version 7.0.0.

lorenzwalthert · 2018-10-30T14:13:41Z

Good point. Maybe you are right and we should not hash twice if possible. Can you quantify the speed benefits of (2) vs (1)?

wlandau · 2018-10-30T21:37:44Z

A good thing to check. The times below look about the same between options (maybe because secondary hashes are just hashes of other hashes and those first hashes are small). I think option 1 is a good choice for now because it should take up less memory (smaller return values from the standardize_*() functions).

library(magrittr)
library(microbenchmark)

# Return tokenization info for code string x
standardize <- function(x){
  if (!length(x)){
    return(as.character(NA))
  }
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")]
}

# Try option 1: hash, serialize the hash, and then hash again.
option_1 <- function(x, key, cache){
  standardize(x) %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE) %>% # First hash
    cache$set(key = key) # Serialize and hash via `storr`
}

# Try option 2: let storr do all the serialization and hashing.
option_2 <- function(x, key, cache){
  standardize(x) %>%
    c() %>%
    cache$set(key = key) # serialization and hashing.
}

# Borrow and modify the body of standardize() as example code
# Create text of variable sizes, up to 32 times the original.
random_code <- function(){
  text <- deparse(body(standardize))
  n <- sample.int(n = 5, size = 1)
  for (i in seq_len(n)){
    text <- c(text, text)
  }
  text <- c("{", text, "}")
  paste(text, collapse = "\n") %>%
    gsub(pattern = "ASSIGN", replacement = as.character(runif(1)))
}

# One benchmark
benchmark <- function(fun){
  set.seed(0)
  times <- 100
  keys <- as.character(seq_len(times))
  code <- replicate(times, random_code())
  set.seed(0)
  cache <- storr::storr_rds(tempfile())
  on.exit(cache$destroy())
  loop <- function(){
    for (i in seq_len(times)){
      fun(x = code[i], key = keys[i], cache = cache)
    }
  }
  microbenchmark(loop(), times = 100)
}

# Run the benchmarks
print(benchmark(option_1))
#> Unit: seconds
#>    expr      min      lq     mean   median       uq     max neval
#>  loop() 6.708124 6.89136 7.010094 6.949077 7.093694 7.58899   100
print(benchmark(option_2))
#> Unit: seconds
#>    expr      min       lq     mean   median       uq      max neval
#>  loop() 6.813503 6.916124 7.042026 6.998259 7.084101 7.605348   100

Created on 2018-10-30 by the reprex package (v0.2.1)

lorenzwalthert · 2018-10-30T22:08:59Z

Not sure if related but I remember when I removed a lot of old big targets from the cache that I did not use anymore in my workflow plan, drake got considerably faster.

wlandau · 2018-10-30T22:12:04Z

Hmm... I would be surprised if the sizes of targets mattered, but the number sure does. I bet there are still many linear-time lookups in drake's code base.

wlandau · 2018-10-30T22:14:29Z

Edit: size could also matter in the sense that immediate dependencies need to be loaded into memory. If there is more slowdown with respect to the sizes of targets than that can explain, something is probably wrong.

wlandau · 2018-10-31T02:24:09Z

For the record, $set() and $get() appear to be constant-time in storr. Encouraging.

init_cache <- function(n){
  s <- storr::storr_rds(tempfile())
  keys <- as.character(seq_len(n))
  values <- lapply(keys, digest::digest)
  s$mset(key = keys, value = values)
  s
}

storrs <- lapply(10^seq_len(4), init_cache)

microbenchmark::microbenchmark(
  storrs[[4]]$get("1", use_cache = FALSE),
  storrs[[3]]$get("1", use_cache = FALSE),
  storrs[[2]]$get("1", use_cache = FALSE),
  storrs[[1]]$get("1", use_cache = FALSE)
)
#> Unit: microseconds
#>                                     expr    min      lq     mean  median
#>  storrs[[4]]$get("1", use_cache = FALSE) 91.015 92.0180 102.7160 92.9660
#>  storrs[[3]]$get("1", use_cache = FALSE) 90.391 92.0310 100.0994 92.7900
#>  storrs[[2]]$get("1", use_cache = FALSE) 90.807 91.9485 103.4817 92.5740
#>  storrs[[1]]$get("1", use_cache = FALSE) 90.234 91.8970 100.6319 92.8065
#>       uq     max neval cld
#>  94.0305 598.719   100   a
#>  93.6660 643.842   100   a
#>  93.5400 642.429   100   a
#>  93.5760 853.714   100   a

microbenchmark::microbenchmark(
  storrs[[4]]$set("1", runif(1), use_cache = FALSE),
  storrs[[3]]$set("1", runif(1), use_cache = FALSE),
  storrs[[2]]$set("1", runif(1), use_cache = FALSE),
  storrs[[1]]$set("1", runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#>                                               expr     min       lq
#>  storrs[[4]]$set("1", runif(1), use_cache = FALSE) 215.088 220.9315
#>  storrs[[3]]$set("1", runif(1), use_cache = FALSE) 215.603 222.4475
#>  storrs[[2]]$set("1", runif(1), use_cache = FALSE) 211.361 217.7670
#>  storrs[[1]]$set("1", runif(1), use_cache = FALSE) 210.187 214.8310
#>      mean   median       uq       max neval cld
#>  574.8408 225.4380 244.4220 10302.004   100   a
#>  513.0485 228.6505 251.5465  5063.073   100   a
#>  600.5828 224.0225 249.5070 11543.353   100   a
#>  356.2935 221.6010 241.9280  2427.105   100   a

^{Created on 2018-10-30 by the reprex package (v0.2.1)}

lorenzwalthert · 2018-11-03T23:46:26Z

May I ask what the rationale behind replacing %>% in 69ff455 is? Is drake moving away from %>%?

wlandau · 2018-11-04T00:19:32Z

Yes: #570, #571. Profiling the code with provfis is much easier this way. No more tall stacks of fseq() and freduce(). I am already learning a lot now that the clutter is gone.

wlandau · 2018-12-06T18:40:13Z

Note to self: just before merging, set the minimum compatible cache version to the release just prior to 7.0.0.

lorenzwalthert · 2018-12-06T20:00:59Z

On a related note: I don't know how this is handled now but when i updated drake the last time, I had to re-build the whole cache. I assume this won't be any different this time? I think a warning should be issued after the installation so people could still go back to the old installation for any reason and use the built cache. Just in case it is not possible to re-build the cache for some reason, or it needs so much time that the user may want to defer it. If this is a thing, I think we best open a new issue.

wlandau · 2018-12-06T20:49:46Z

On a related note: I don't know how this is handled now but when i updated drake the last time, I had to re-build the whole cache. I assume this won't be any different this time?

If we go through with #562 and #563, yes. I am still wondering if the change is worth the removal of a single dependency, and I could be convinced either way. If we are going to do it, it is better sooner rather than later, especially since 7.0.0 will have large changes already. But it is also important to fulfill drake's promise of keeping the right targets up to date.

I think a warning should be issued after the installation so people could still go back to the old installation for any reason and use the built cache.

Seems like we would want to control the installation process: look at the user's prior version of drake, install the package, and give a warning if the old and new versions span a change in the cache format. Is that possible to do in a package? It's a good idea, and I have often wanted to add special installation commands for different reasons. Either way, I think it is important to look at individual projects too.

wlandau · 2019-01-02T00:53:53Z

Looking back at this issue, I have a couple comments:

I think now is the time to move away from formatR. Because of Pre-encode file paths internally #630, Use manual base 64 encoding, file paths only. #633, and Cleaner encoding and case insensitivity #641 (which are well worth the speed improvements) the next release will invalidate targets in old projects anyway.
I no longer think we should coerce EQ_ASSIGN to LEFT_ASSIGN. The reason is consistency: functions already get fingerprinted by deparsing, which does not change = to <-. The behavior for commands in the plan should agree with the behavior for functions. So I think we should go with a simple deparse(parse(text = as.character(command), keep.source = FALSE)[[1]]) instead of utils::getParseData().

wlandau · 2019-01-02T01:18:52Z

A positive note on speed:

text <- paste(deparse(digest::digest), collapse = "\n")
microbenchmark::microbenchmark(
  deparse(parse(text = text, keep.source = FALSE)[[1]]),
  formatR::tidy_source(
    source = NULL,
    comment = FALSE,
    blank = FALSE,
    arrow = TRUE,
    brace.newline = FALSE,
    indent = 4,
    output = FALSE,
    text = text,
    width.cutoff = 119
  )
)
#> Unit: microseconds
#>                                                                                                                                                                             expr
#>                                                                                                                            deparse(parse(text = text, keep.source = FALSE)[[1]])
#>  formatR::tidy_source(source = NULL, comment = FALSE, blank = FALSE,      arrow = TRUE, brace.newline = FALSE, indent = 4, output = FALSE,      text = text, width.cutoff = 119)
#>       min        lq      mean    median       uq      max neval
#>   551.514  566.5695  589.0592  572.3665  594.850  749.532   100
#>  3728.347 3897.0460 4218.4730 3996.0425 4161.814 7441.496   100

^{Created on 2019-01-01 by the reprex package (v0.2.1)}

wlandau · 2019-01-14T15:34:13Z

Hmm... deparse() is slow. In 8ab9d13, I decided to just removed all the attributes (including srcref, srcfile, and wholeSrcref) and return the language object without deparsing it. The speedup was about 2-3% for this workflow. @lorenzwalthert, is there anything wrong with this approach? Any hidden pitfalls I am missing?

lorenzwalthert · 2019-01-14T18:27:59Z

I am not sure if I can follow. Am I still seeing a deparse() in 8ab9d13, right? I think working with the language object can introduce additional problems downstream because there are not methods for it for some functions, but for character vectors there might be as it is much more common. On the other hand, why not deferring converting to character as long as we don't need it? However, I think I am not the right person to answer because I lost track of what's going on here (and can't regain it from going through the code and conversation). :-( FYI in styler, we never work with the language object, so I don't have experience.

wlandau · 2019-01-14T20:07:17Z

Sorry for the confusion. That remaining call to deparse() is for imported functions because hashes of functions are brittle: #345 (comment). My question: are the hashes of expressions, symbols, and language objects in general just as brittle? If x is a language object, is there anything that would make digest(x) unreliable? We can assume attributes(x) is NULL. Also cc @wch.

wlandau · 2019-01-22T01:18:56Z

I actually found the answer to my question while refining #233 today. Strange things happen with language objects in R version 3.4.4. The two language objects below should be the same, but they have different internals because they were generated in different ways. drake now uses an internal safe_deparse() function to standardize commands (demonstrated below). The change invalidates targets again, but I think it is the right call, and I am glad I caught it before the next CRAN release.

x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")

lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))

print(lang1)
#> f(list(a = a, b = b))
print(lang2)
#> f(list(a = a, b = b))

library(digest)
digest(lang1)
#> [1] "869aef282c5349417ab12590f1359de5"
digest(lang2)
#> [1] "b6630dff6ebb37dac41c6164381efcc7"

deparse(lang1)
#> [1] "f(structure(list(a = a, b = b), .Names = c(\"a\", \"b\")))"
deparse(lang2)
#> [1] "f(list(a = a, b = b))"

print(drake:::safe_deparse)
#> function (x) 
#> {
#>     paste(deparse(x, control = c("keepInteger", "keepNA")), collapse = "\n")
#> }
#> <environment: namespace:drake>

drake:::safe_deparse(lang1)
#> [1] "f(list(a = a, b = b))"
drake:::safe_deparse(lang2)
#> [1] "f(list(a = a, b = b))"

digest(drake:::safe_deparse(lang1))
#> [1] "c19a8f963e567ebb171494483dfb1697"
digest(drake:::safe_deparse(lang2))
#> [1] "c19a8f963e567ebb171494483dfb1697"

^{Created on 2019-01-21 by the reprex package (v0.2.1)}

wlandau · 2019-01-22T01:37:39Z

Hmm... the same example on R 3.5.2 gives 2 different answers.

x <- rlang::syms(c("a", "b"))
names(x) <- c("a", "b")

lang1 <- substitute(f(x), list(x = x))
lang2 <- quote(f(list(a = a, b = b)))

deparse(lang1, control = NULL)
#> [1] "f(list(a, b))"
deparse(lang2, control = NULL)
#> [1] "f(list(a = a, b = b))"

^{Created on 2019-01-21 by the reprex package (v0.2.1)}

Hopefully deparsing prevents most other edge cases. Other remarks:

For DSL based on dplyr-like verbs? #233, combine() should not try to produce named lists. What I mean is that in this example, the command for larger should always stay do.call(rbind, list(data_1_3, data_2_4)). Outputting do.call(rbind, list(data_1_3 = data_1_3, data_2_4 = data_2_4)) might seem more useful, but the names might get dropped in old versions of R. (Either that or you get an unpredictable structure(..., .Names = ...).
At one point, @krlmlr suggested that drake might be better off using language objects for commands instead of character strings. But until we figure out a better way to handle language objects, I am actually kind of glad my naivete accidentally made drake a little more robust. Maybe we should even prohibit users from supplying a column of language objects as commands in the plan.

wlandau-lilly added 2 commits October 28, 2018 18:53

Fix #562

d199fdb

Rearrange a couple functions

308254c

wlandau self-assigned this Oct 28, 2018

wlandau added this to the Version 7.0.0 milestone Oct 28, 2018

wlandau mentioned this pull request Oct 28, 2018

Assess the feasibility of CodeDepends for all the static code analysis #41

Closed

wlandau added DO NOT MERGE ⚠️ topic: dependencies labels Oct 28, 2018

wlandau-lilly added 3 commits October 28, 2018 20:03

Remove uncovered code

15fbbb0

Update docs

bd86659

Merge branch 'master' into 562

ba0e3a5

wlandau mentioned this pull request Oct 30, 2018

A base R replacement for formatR::tidy_source() #562

Closed

wlandau-lilly added 4 commits October 29, 2018 22:48

Use parse data

65ed003

Fix standardize_code()

341b481

Fix #562

3b7455e

Merge branch '562' of github.com:ropensci/drake into 562

cd478eb

wlandau mentioned this pull request Oct 31, 2018

Review the code for non-constant-time storr operations #566

Closed

wlandau-lilly added 3 commits November 3, 2018 16:57

Remove magrittr pipes

69ff455

Merge branch 'master' into 562

89d48c0

Fix a test

4980d93

Move around tests

a2715d7

wlandau mentioned this pull request Dec 4, 2018

Remove dependency on dplyr #593

Closed

Merge branch 'master' into 562

2a78453

wlandau mentioned this pull request Dec 6, 2018

Make users pause if the cache was created with an old version of drake. #602

Closed

wlandau added the depends: a future release label Dec 6, 2018

wlandau-lilly added 4 commits December 9, 2018 10:45

Merge master

9bde196

Merge branch 'master' into 562

60a4462

Fix #562

ae4f4b8

Merge branch 'master' into 562

dacb836

wlandau force-pushed the 562 branch from 66f9675 to dacb836 Compare January 2, 2019 00:53

wlandau-lilly added 4 commits January 1, 2019 19:54

Merge branch 'master' into 562

f4ddf0c

Use simple deparsing to standardize commands

cf6f9aa

Add back lost files

a2c0c10

Use NA_character_

5274ed3

wlandau removed DO NOT MERGE ⚠️ depends: a future release labels Jan 2, 2019

wlandau-lilly added 2 commits January 1, 2019 20:21

Try to avoid unnecessary hidden deparsing

f4bf10f

Refresh codemeta.json

5a82537

wlandau merged commit a1cd1b0 into master Jan 2, 2019

wlandau deleted the 562 branch January 2, 2019 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A base R replacement to formatR::tidy_source() #563

A base R replacement to formatR::tidy_source() #563

wlandau commented Oct 28, 2018 •

edited

Loading

codecov-io commented Oct 28, 2018 •

edited

Loading

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 •

edited

Loading

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 •

edited

Loading

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 31, 2018

lorenzwalthert commented Nov 3, 2018

wlandau commented Nov 4, 2018 •

edited

Loading

wlandau commented Dec 6, 2018

lorenzwalthert commented Dec 6, 2018

wlandau commented Dec 6, 2018

wlandau commented Jan 2, 2019

wlandau commented Jan 2, 2019

wlandau commented Jan 14, 2019

lorenzwalthert commented Jan 14, 2019 •

edited

Loading

wlandau commented Jan 14, 2019 •

edited

Loading

wlandau commented Jan 22, 2019

wlandau commented Jan 22, 2019 •

edited

Loading

A base R replacement to formatR::tidy_source() #563

A base R replacement to formatR::tidy_source() #563

Conversation

wlandau commented Oct 28, 2018 • edited Loading

Summary

Related GitHub issues and pull requests

Checklist

codecov-io commented Oct 28, 2018 • edited Loading

Codecov Report

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 • edited Loading

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 • edited Loading

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018 • edited Loading

wlandau commented Oct 30, 2018 • edited Loading

wlandau commented Oct 31, 2018

lorenzwalthert commented Nov 3, 2018

wlandau commented Nov 4, 2018 • edited Loading

wlandau commented Dec 6, 2018

lorenzwalthert commented Dec 6, 2018

wlandau commented Dec 6, 2018

wlandau commented Jan 2, 2019

wlandau commented Jan 2, 2019

wlandau commented Jan 14, 2019

lorenzwalthert commented Jan 14, 2019 • edited Loading

wlandau commented Jan 14, 2019 • edited Loading

wlandau commented Jan 22, 2019

wlandau commented Jan 22, 2019 • edited Loading

wlandau commented Oct 28, 2018 •

edited

Loading

codecov-io commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Nov 4, 2018 •

edited

Loading

lorenzwalthert commented Jan 14, 2019 •

edited

Loading

wlandau commented Jan 14, 2019 •

edited

Loading

wlandau commented Jan 22, 2019 •

edited

Loading