Fingerprint indices instead of entire data object #259

juliasilge · 2021-09-23T18:24:27Z

Closes #254

This PR switches out what we use as the "fingerprint"; instead of including the entire data object (which we then serialize many times), we now hash only the indices of which observations are in/out of the resample. This will still ensure the identical fingerprint for the same data object, which I believe will work for our use cases, for example in stacks and finetune. @simonpcouch would you mind taking a look at this and seeing if you have any concerns about this change?

It does speed things up modestly for most data sets, as shown in the issue, but it especially speeds up the kind of resampling that Daniel reported in the original issue:

library(tidyverse)
library(rsample)
library(slider)

new_new_rset <-  function(splits, ids, attrib = NULL,
                          subclass = character()) {
  stopifnot(is.list(splits))
  if (!is_tibble(ids)) {
    ids <- tibble(id = ids)
  } else {
    if (!all(grepl("^id", names(ids)))) {
      rlang::abort("The `ids` tibble column names should start with 'id'.")
    }
  }
  either_type <- function(x)
    is.character(x) | is.factor(x)
  ch_check <- vapply(ids, either_type, c(logical = TRUE))
  if (!all(ch_check)) {
    rlang::abort("All ID columns should be character or factor vectors.")
  }
  
  if (!is_tibble(splits)) {
    splits <- tibble(splits = splits)
  } else {
    if (ncol(splits) > 1 | names(splits)[1] != "splits") {
      rlang::abort(
        "The `splits` tibble should have a single column named `splits`."
      )
    }
  }
  
  where_rsplits <- vapply(splits[["splits"]], rsample:::is_rsplit, logical(1))
  
  if (!all(where_rsplits)) {
    rlang::abort("Each element of `splits` must be an `rsplit` object.")
  }
  
  if (nrow(ids) != nrow(splits)) {
    rlang::abort("Split and ID vectors have different lengths.")
  }
  
  # Create another element to the splits that is a tibble containing
  # an identifier for each id column so that, in isolation, the resample
  # id can be known just based on the `rsplit` object. This can then be
  # accessed using the `labels` method for `rsplits`
  
  splits$splits <- map2(
    splits$splits,
    rsample:::split_unnamed(ids, rlang::seq2(1L, nrow(ids))),
    rsample:::add_id
  )
  
  res <- bind_cols(splits, ids)
  
  if (!is.null(attrib)) {
    if (any(names(attrib) == "")) {
      rlang::abort("`attrib` should be a fully named list.")
    }
    for (i in names(attrib)) {
      attr(res, i) <- attrib[[i]]
    }
  }
  
  if (length(subclass) > 0) {
    res <- add_class(res, cls = subclass)
  }
  
  fingerprint <- list(map(splits$splits, "in_id"), map(splits$splits, "out_id"))
  fingerprint <- rlang::hash(fingerprint)
  attr(res, "fingerprint") <- fingerprint
  
  res
}

df <- tibble(date = lubridate::make_date(1:500))

for (i in paste0("x",1:100)) {
  df[[i]] <- runif(500)
}

df <- df %>% sample_n(1e5, replace = TRUE) %>% arrange(date)
index <- df[["date"]]
seq <- vctrs::vec_seq_along(df)

id_in <- slider::slide_period(
  .x = seq,
  .i = index,
  .period = "year",
  .f = identity,
  .every = 1L,
  .origin = NULL,
  .before = 0L,
  .after = 0L,
  .complete = TRUE
)

id_out <- slider::slide_period(
  .x = seq,
  .i = index,
  .period = "year",
  .f = identity,
  .every = 1L,
  .origin = NULL,
  .before = -5L,
  .after = 20L,
  .complete = TRUE
)

indices <- rsample:::compute_complete_indices(id_in, id_out)

splits <- purrr::map(
  indices,
  ~ make_splits(.x, data = df, class = "sliding_period_split")
)

ids <- rsample:::names0(length(indices), prefix = "Slice")

bench::mark(iterations = 5, check = FALSE,
            old = new_rset(splits, ids),
            new = new_new_rset(splits, ids)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           1.64s    1.65s     0.607     278KB    0.404
#> 2 new         11.08ms  11.18ms    88.9       466KB   22.2

^{Created on 2021-09-23 by the reprex package (v2.0.1)}

If you have bigger data, it speeds it up way more.

simonpcouch · 2021-09-23T19:20:29Z

Thanks for the holler, @juliasilge! Just read through and briefly tested on {stacks}—thumbs up from me. :-)

I'll keep an eye on this repo for the next CRAN release, at which point I'll regenerate the example objects to bring along their new hashes.

github-actions · 2021-10-12T01:14:09Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

juliasilge · 2021-11-10T15:26:40Z

@simonpcouch This change is now on CRAN if you want to regenerate the example objects in stacks, like you mentioned.

simonpcouch · 2021-11-10T16:53:05Z

Thanks for the ping! On it. :)

juliasilge added 2 commits September 23, 2021 11:50

Fingerprint indices instead of entire data object

3497c08

Only map through splits once, update NEWS

87ffd6d

juliasilge requested a review from topepo September 23, 2021 19:22

topepo approved these changes Sep 27, 2021

View reviewed changes

juliasilge merged commit ec6dcc4 into master Sep 27, 2021

juliasilge deleted the fingerprint-hash branch September 27, 2021 21:15

github-actions bot locked and limited conversation to collaborators Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fingerprint indices instead of entire data object #259

Fingerprint indices instead of entire data object #259

juliasilge commented Sep 23, 2021

simonpcouch commented Sep 23, 2021

github-actions bot commented Oct 12, 2021

juliasilge commented Nov 10, 2021

simonpcouch commented Nov 10, 2021

Fingerprint indices instead of entire data object #259

Fingerprint indices instead of entire data object #259

Conversation

juliasilge commented Sep 23, 2021

simonpcouch commented Sep 23, 2021

github-actions bot commented Oct 12, 2021

juliasilge commented Nov 10, 2021

simonpcouch commented Nov 10, 2021