Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reshuffle_rset #329

Merged
merged 13 commits into from
Jul 7, 2022
Merged

Add reshuffle_rset #329

merged 13 commits into from
Jul 7, 2022

Conversation

mikemahoney218
Copy link
Member

@mikemahoney218 mikemahoney218 commented Jul 1, 2022

This PR fixes #79 by adding a function to "reshuffle" rset objects, returning an object generated using the same arguments as the original but using the current random seed.

I also added withr as a dependency, as in the process of testing this I ran into an error installing the package because an analysis fold in rset_subclasses$group_bootstraps had 0 rows. Setting the seed explicitly should protect against that, I believe.

library(rsample)

set.seed(123)

(x <- group_vfold_cv(mtcars, cyl, 3))
#> # Group 3-fold cross-validation 
#> # A tibble: 3 × 2
#>   splits          id       
#>   <list>          <chr>    
#> 1 <split [21/11]> Resample1
#> 2 <split [18/14]> Resample2
#> 3 <split [25/7]>  Resample3

x |> reshuffle_rset()
#> # Group 3-fold cross-validation 
#> # A tibble: 3 × 2
#>   splits          id       
#>   <list>          <chr>    
#> 1 <split [21/11]> Resample1
#> 2 <split [25/7]>  Resample2
#> 3 <split [18/14]> Resample3

Created on 2022-07-01 by the reprex package (v2.0.1)

(There's only 3 levels of cyl, so the splits are the same but the ordering changes)

@mikemahoney218 mikemahoney218 marked this pull request as ready for review July 1, 2022 17:13
@mikemahoney218 mikemahoney218 marked this pull request as draft July 1, 2022 17:27
@mikemahoney218
Copy link
Member Author

Sorry -- pulled the trigger a little too fast on reviews. I need to go through and make sure this works with non-default parameters, as well (specifically stratification).

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Jul 4, 2022

Wondering if there's a reason not to make reshuffle() an S3 generic like reverse_splits(), with methods for both rsplit and rset objects? I haven't looked this over closely so it's possible/likely I'm missing something obvious. It seems like this might fall under the same paradigm of post hoc modifications to existing splits or sets.

@mikemahoney218
Copy link
Member Author

Wondering if there's a reason not to make reshuffle() an S3 generic like reverse_splits(), with methods for both rsplit and rset objects?

Main reason to not support rsplits in the first draft is because we don't attach most of the relevant attributes at the rsplit level, only to the rset as a whole. It's also easy to reconstruct the rset, because the user has already specified every relevant behavior; because the user doesn't really directly construct rsplits, I think there's a good bit more undefined behavior. For instance, the expected behavior of shuffling a vfold split isn't an obvious thing to me; even if you know what v is (which currently isn't stored in the split anywhere), how do you know what observations are assigned to assessment in the other splits? Shuffling the rsets is easier, because we're just repeating the same steps that made the first one.

I had initially planned to make this an s3 method dispatching on the rset subclass so that, for instance, group_vfold_cv() could have different behavior than vfold_cv(), but then I realized it wasn't necessary for the classes inside of rsample itself. If there's a use case elsewhere I think it'd be an easy thing to generic-ize (and please let me know if you have one), and this PR would just export the generic with a default and an rset method. But if there's not a reason to make it generic yet, I personally think it's probably easiest to not make a new generic.

@mattwarkentin
Copy link
Contributor

Thanks for the detailed response. Sorry, I should have made my question/comment more specific. I totally agree that it makes sense to "reshuffle" at the set level, and not at the split level within a set (what this would entail isn't obvious to me, either). I was just wondering if it makes sense to support reshuffling initial_split()/initial_time_split() objects also. I guess really those are the only two functions (other than make_splits()) whereby the user does construct rsplits directly, rather than rsets. Otherwise I agree with everything else you said.

@mikemahoney218
Copy link
Member Author

mikemahoney218 commented Jul 5, 2022

Yeah, I think those are the only times users directly create rsplits -- and initial_time_split is a weird one because reshuffling wouldn't do anything.

Supporting initial_split() has the same challenge that we currently don't store many attributes on individual splits:

> rsample::initial_split(mtcars) |> attributes()
$names
[1] "data"   "in_id"  "out_id" "id"    

$class
[1] "initial_split" "mc_split"      "rsplit"       

Beyond that, I think we might not actually want to directly support "shuffling" the initial split, because theoretically that assessment set should be completely reserved until you're done tuning and editing your model. It's not that hard to get around that and just call initial_split() again, if you're really determined to do a bad thing (that is, introduce data leakage into your pipeline), but I think we'd want to exclude it from "shuffling" as a bit of a guardrail.

So for those two reasons I don't think it makes sense to support rsplits directly with this first draft.

@mikemahoney218
Copy link
Member Author

@juliasilge @hfrick Question for you:

Right now we usually set strata to !is.null(strata), meaning that we don't capture the actual column being used for stratification. That means we can't reconstruct stratified rsets after the fact (as best as I can tell).

I'm wondering how comfortable we'd be with changing this attribute, so that it contained the column to stratify on. This might break things, if people are checking for strata via identical(). An alternative is to add a new attribute, for instance strata_col, and rename it to "strata" inside reshuffle_rset(). Which feels better?

@juliasilge
Copy link
Member

@mikemahoney218 I think it is unlikely that folks are checking that, so I would vote for making the change (be sure to add to NEWS) and then we can keep an eye for this specifically when I do revdeps for rsample before the next release.

@mikemahoney218
Copy link
Member Author

Alright, I'm going to go through and update rsample functions to change the attribute.

One place that I don't think we can update is in caret2rsample, where this attribute is being set to T/F based on the resampling method used:

rsample/R/caret.R

Lines 152 to 159 in 4b13e36

if (grepl("cv$", object$method)) {
out <- list(
v = object$number,
repeats = ifelse(!is.na(object$repeats),
object$repeats, 1
),
strata = TRUE
)

I don't fully know what's going on here, but I don't think we're going to have that many people converting from caret to rsample and then attempting to shuffle their splits, so I think we're probably fine to leave this one as is. I'll throw a friendlier error if strata is TRUE, though (which will also handle older saved objects loaded with a newer version of the package).

@@ -136,7 +136,7 @@ delayedAssign("rset_subclasses", {
sliding_window = sliding_window(test_data()),
sliding_index = sliding_index(test_data(), index),
sliding_period = sliding_period(test_data(), index, "week"),
manual_rset = manual_rset(bootstraps(test_data())$splits[1:2], c("ID1", "ID2")),
manual_rset = manual_rset(list(initial_time_split(test_data()), initial_time_split(test_data())), c("ID1", "ID2")),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bootstraps here means that manual_rset "spent" randomness, which had knock on effects in testing (because we exclude manual_rset from things like reshuffling, but then don't have the same seed active by the time we're rebuilding permutations()). Changing this to initial_time_split() avoids the issue and didn't need any changes in testing.

@mikemahoney218 mikemahoney218 marked this pull request as ready for review July 6, 2022 14:26
@mikemahoney218
Copy link
Member Author

Ok, I think this is good for a review now.

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I don't think that testing approach is too fancy or too confusing. 👍

I do have a question about handling strata in this way. A character vector or FALSE seems somewhat awkward/confusing to me. Can we do a character or NULL instead? It seems like it would work in my perusal here.

@mikemahoney218
Copy link
Member Author

Yeah, NULL feels better -- my only reason to leave it as FALSE was to change as little behavior as possible. Just changed to NULL.

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice! Thank you 🙏

@juliasilge juliasilge merged commit 86f56df into main Jul 7, 2022
@juliasilge juliasilge deleted the mike/reshuffle_rset branch July 7, 2022 17:56
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reshuffle rset objects
3 participants