-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reshuffle_rset #329
Add reshuffle_rset #329
Conversation
Sorry -- pulled the trigger a little too fast on reviews. I need to go through and make sure this works with non-default parameters, as well (specifically stratification). |
Wondering if there's a reason not to make |
Main reason to not support rsplits in the first draft is because we don't attach most of the relevant attributes at the rsplit level, only to the rset as a whole. It's also easy to reconstruct the rset, because the user has already specified every relevant behavior; because the user doesn't really directly construct rsplits, I think there's a good bit more undefined behavior. For instance, the expected behavior of shuffling a vfold split isn't an obvious thing to me; even if you know what I had initially planned to make this an s3 method dispatching on the rset subclass so that, for instance, |
Thanks for the detailed response. Sorry, I should have made my question/comment more specific. I totally agree that it makes sense to "reshuffle" at the set level, and not at the split level within a set (what this would entail isn't obvious to me, either). I was just wondering if it makes sense to support reshuffling |
Yeah, I think those are the only times users directly create rsplits -- and Supporting > rsample::initial_split(mtcars) |> attributes()
$names
[1] "data" "in_id" "out_id" "id"
$class
[1] "initial_split" "mc_split" "rsplit" Beyond that, I think we might not actually want to directly support "shuffling" the initial split, because theoretically that assessment set should be completely reserved until you're done tuning and editing your model. It's not that hard to get around that and just call So for those two reasons I don't think it makes sense to support rsplits directly with this first draft. |
@juliasilge @hfrick Question for you: Right now we usually set I'm wondering how comfortable we'd be with changing this attribute, so that it contained the column to stratify on. This might break things, if people are checking for |
@mikemahoney218 I think it is unlikely that folks are checking that, so I would vote for making the change (be sure to add to NEWS) and then we can keep an eye for this specifically when I do revdeps for rsample before the next release. |
Alright, I'm going to go through and update rsample functions to change the attribute. One place that I don't think we can update is in Lines 152 to 159 in 4b13e36
I don't fully know what's going on here, but I don't think we're going to have that many people converting from caret to rsample and then attempting to shuffle their splits, so I think we're probably fine to leave this one as is. I'll throw a friendlier error if |
@@ -136,7 +136,7 @@ delayedAssign("rset_subclasses", { | |||
sliding_window = sliding_window(test_data()), | |||
sliding_index = sliding_index(test_data(), index), | |||
sliding_period = sliding_period(test_data(), index, "week"), | |||
manual_rset = manual_rset(bootstraps(test_data())$splits[1:2], c("ID1", "ID2")), | |||
manual_rset = manual_rset(list(initial_time_split(test_data()), initial_time_split(test_data())), c("ID1", "ID2")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bootstraps
here means that manual_rset
"spent" randomness, which had knock on effects in testing (because we exclude manual_rset
from things like reshuffling, but then don't have the same seed active by the time we're rebuilding permutations()
). Changing this to initial_time_split()
avoids the issue and didn't need any changes in testing.
Ok, I think this is good for a review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I don't think that testing approach is too fancy or too confusing. 👍
I do have a question about handling strata in this way. A character vector or FALSE
seems somewhat awkward/confusing to me. Can we do a character or NULL
instead? It seems like it would work in my perusal here.
…nto mike/reshuffle_rset
Yeah, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice! Thank you 🙏
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
This PR fixes #79 by adding a function to "reshuffle" rset objects, returning an object generated using the same arguments as the original but using the current random seed.
I also added
withr
as a dependency, as in the process of testing this I ran into an error installing the package because an analysis fold inrset_subclasses$group_bootstraps
had 0 rows. Setting the seed explicitly should protect against that, I believe.Created on 2022-07-01 by the reprex package (v2.0.1)
(There's only 3 levels of
cyl
, so the splits are the same but the ordering changes)