Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint indices instead of entire data object #259

Merged
merged 2 commits into from
Sep 27, 2021

Conversation

juliasilge
Copy link
Member

Closes #254

This PR switches out what we use as the "fingerprint"; instead of including the entire data object (which we then serialize many times), we now hash only the indices of which observations are in/out of the resample. This will still ensure the identical fingerprint for the same data object, which I believe will work for our use cases, for example in stacks and finetune. @simonpcouch would you mind taking a look at this and seeing if you have any concerns about this change?

It does speed things up modestly for most data sets, as shown in the issue, but it especially speeds up the kind of resampling that Daniel reported in the original issue:

library(tidyverse)
library(rsample)
library(slider)

new_new_rset <-  function(splits, ids, attrib = NULL,
                          subclass = character()) {
  stopifnot(is.list(splits))
  if (!is_tibble(ids)) {
    ids <- tibble(id = ids)
  } else {
    if (!all(grepl("^id", names(ids)))) {
      rlang::abort("The `ids` tibble column names should start with 'id'.")
    }
  }
  either_type <- function(x)
    is.character(x) | is.factor(x)
  ch_check <- vapply(ids, either_type, c(logical = TRUE))
  if (!all(ch_check)) {
    rlang::abort("All ID columns should be character or factor vectors.")
  }
  
  if (!is_tibble(splits)) {
    splits <- tibble(splits = splits)
  } else {
    if (ncol(splits) > 1 | names(splits)[1] != "splits") {
      rlang::abort(
        "The `splits` tibble should have a single column named `splits`."
      )
    }
  }
  
  where_rsplits <- vapply(splits[["splits"]], rsample:::is_rsplit, logical(1))
  
  if (!all(where_rsplits)) {
    rlang::abort("Each element of `splits` must be an `rsplit` object.")
  }
  
  if (nrow(ids) != nrow(splits)) {
    rlang::abort("Split and ID vectors have different lengths.")
  }
  
  # Create another element to the splits that is a tibble containing
  # an identifier for each id column so that, in isolation, the resample
  # id can be known just based on the `rsplit` object. This can then be
  # accessed using the `labels` method for `rsplits`
  
  splits$splits <- map2(
    splits$splits,
    rsample:::split_unnamed(ids, rlang::seq2(1L, nrow(ids))),
    rsample:::add_id
  )
  
  res <- bind_cols(splits, ids)
  
  if (!is.null(attrib)) {
    if (any(names(attrib) == "")) {
      rlang::abort("`attrib` should be a fully named list.")
    }
    for (i in names(attrib)) {
      attr(res, i) <- attrib[[i]]
    }
  }
  
  if (length(subclass) > 0) {
    res <- add_class(res, cls = subclass)
  }
  
  fingerprint <- list(map(splits$splits, "in_id"), map(splits$splits, "out_id"))
  fingerprint <- rlang::hash(fingerprint)
  attr(res, "fingerprint") <- fingerprint
  
  res
}

df <- tibble(date = lubridate::make_date(1:500))

for (i in paste0("x",1:100)) {
  df[[i]] <- runif(500)
}

df <- df %>% sample_n(1e5, replace = TRUE) %>% arrange(date)
index <- df[["date"]]
seq <- vctrs::vec_seq_along(df)

id_in <- slider::slide_period(
  .x = seq,
  .i = index,
  .period = "year",
  .f = identity,
  .every = 1L,
  .origin = NULL,
  .before = 0L,
  .after = 0L,
  .complete = TRUE
)

id_out <- slider::slide_period(
  .x = seq,
  .i = index,
  .period = "year",
  .f = identity,
  .every = 1L,
  .origin = NULL,
  .before = -5L,
  .after = 20L,
  .complete = TRUE
)

indices <- rsample:::compute_complete_indices(id_in, id_out)

splits <- purrr::map(
  indices,
  ~ make_splits(.x, data = df, class = "sliding_period_split")
)

ids <- rsample:::names0(length(indices), prefix = "Slice")

bench::mark(iterations = 5, check = FALSE,
            old = new_rset(splits, ids),
            new = new_new_rset(splits, ids)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           1.64s    1.65s     0.607     278KB    0.404
#> 2 new         11.08ms  11.18ms    88.9       466KB   22.2

Created on 2021-09-23 by the reprex package (v2.0.1)

If you have bigger data, it speeds it up way more.

@simonpcouch
Copy link
Contributor

Thanks for the holler, @juliasilge! Just read through and briefly tested on {stacks}—thumbs up from me. :-)

I'll keep an eye on this repo for the next CRAN release, at which point I'll regenerate the example objects to bring along their new hashes.

@juliasilge juliasilge merged commit ec6dcc4 into master Sep 27, 2021
@juliasilge juliasilge deleted the fingerprint-hash branch September 27, 2021 21:15
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 12, 2021
@juliasilge
Copy link
Member Author

@simonpcouch This change is now on CRAN if you want to regenerate the example objects in stacks, like you mentioned.

@simonpcouch
Copy link
Contributor

Thanks for the ping! On it. :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Computing the fingerprint takes 95% of execution time
3 participants