-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stratification in grouped resampling #317
Comments
Posted on RStudio Community and Twitter |
Thank you for providing these grouping functions! The first use case is for survey statisticians working with complex sampling designs. For starters, {rsample} could allow stratification when strata are constant within each group, and throw an error otherwise. That alone would already be useful to survey statisticians, even if it wouldn't cover other use cases. Also in this situation, if some strata have too few groups (for ex. the user requests 5-fold CV but some strata only have 4 groups), I'd want the user to get an informative error insisting that they themselves should pool strata. The strata for survey designs are usually chosen carefully, with subject matter expertise, and it would not be meaningful for my software to pool them for me.
A bit more detail: what {rsample} calls "groups" is analogous to what survey statisticians would call "clusters" or "PSUs" (primary sampling units). For ex., maybe we can't collect data by surveying a random sample of individual schoolchildren, because we don't have a complete list of children. Instead we might have a list of schools. So it may be more practical to take a random sample of schools, then survey all children within the sampled schools. Schools would be the PSUs, and children would be the ultimate sampling units. Also, in typical survey designs, the PSUs are nested within strata. For ex., maybe the strata are rural vs urban schools; or maybe they are elementary, vs middle, vs high schools. We'd divide our list of schools into strata first, then sample PSUs separately within each. In Wieczorek, Guerin, & McMahon (2022) https://doi.org/10.1002/sta4.454 we argue that sample splitting or CV with survey data should mimic the survey design. So CV folds should be stratified by the design strata, and grouped by the design PSUs. Within each stratum separately, we partition the PSUs into folds; and then we combine across strata for each fold. Here is how we approach it in our {surveyCV} R package: https://github.com/ColbyStatSvyRsch/surveyCV/blob/master/R/folds.svy.R Finally, in terms of pooling (group x stratum) sets, it would usually be most meaningful to pool together a few strata based on their meaning, using subject matter expertise -- not based on metrics in the data. The user needs to think about: "If I hadn't been able to design a survey using these fine-grained strata, what coarser strata would I have used instead?" Also paging @bschneidr who might have further suggestions. |
Second use case: when your data are iid (so you don't need to mimic a survey design), and you are stratifying only because you have a categorical variable with rare categories. You want to ensure that each fold has enough data from even the rarest category. Otherwise you simply can't fit or evaluate a model that handles this category vs the others. In that case, I would want to find a way to stratify at the group level. Maybe I'd check: Which groups have at least one unit in the rare category? Then stratify on this. Randomly assign the "at least one rare unit" groups to folds first, then assign the remaining "no rare units" groups to folds next. Regardless of the algorithm, I'd want it to respect these two constraints: Personally, with iid data I would not want the software to try too hard to get equal rates in each category across folds. Since that kind of stratification wasn't in the original iid sampling design, you are artificially making your folds "too similar." If the goal of CV is to judge "how well does this algorithm tend to work on training sets like this one?"... But again, that's my view based on thinking about it from the survey sampling perspective. I would be curious to hear more about other reasons why people do stratified CV, and what benefits they believe it's providing them. |
Thank you @civilstat ! I appreciate your comments a lot. I'm personally still thinking through them (and the package is currently changing hands, so it might be a second before anyone else gets a chance to look), but I wanted to make sure you knew you weren't just shouting into the void 😄
This makes sense to me for a survey application, and might be a good place to start from. We often recommend stratifying based on the outcome variable, which is where you'd start running into the non-constant strata within a group (for instance, you've got measurements for a number of lakes across the state, and want to group them based on counties and stratify by nitrogen content). But it makes sense that the most straightforward approach would still be useful.
Agreed 😄 Still thinking over the rest 😄 |
Thanks @mikemahoney218 and I apologize for leaving too-long comments---I didn't have time to make them shorter :) I admit I had been thinking mostly about cross-validation within the training set. But I read the link you posted, and I agree that stratifying on the outcome makes sense for the initial split. If we are specifically setting aside a holdout set which we'll only touch once, we want that holdout set to cover a wide range of possible outcomes. Then stratifying makes sense: else we might evaluate our final models only on the moderate response values and we'll never know how it performs on cases with high or low response values. For cross-validation or bootstrapping within the training set, I still think we shouldn't stratify unless the full dataset was originally collected by stratifying. If our actual dataset was a simple random sample from the population, yet we choose the model that looks best across stratified splits (but not across simple random splits), then it may not work so well on future data. In any case, thanks for thinking about this! |
So here's a question: say you've got an extremely rare outcome (eg, a a binary classification problem where only 10% of observations are a "1"). If you're using a model where you can set class weights, then it seems to me like it'd still make sense to stratify while doing within-training-set CV; otherwise, you might not be able to estimate sensitivity/precision based on random splitting. In that situation it makes sense to me that you'd stratify your CV, even though you didn't collect the data following the same stratification. Is there a way that this would give you bad estimates (or at any rate, estimates that won't hold when using new data) of model performance? I think it helps that, for non-grouped resampling, rsample does check (a lot of times) to make sure that stratification isn't trying to stratify using too-small bins: Lines 79 to 129 in 585b8fc
|
That sounds like an example of my "Second use case" above. But now that you mention it, I think you could also justify it under the "mimic the sampling design" rule-of-thumb. If we have a rare response category, we wouldn't have tried to build & evaluate a model at all unless we had enough cases in the rare category. So we could justifiably say that our sampling design was: "Keep collecting data until there are enough of the rare cases for us to estimate sensitivity & precision." In that case, stratifying on the outcome isn't exactly what we did to get the original dataset, but it's a reasonable approximation. So it would be justifiable if we stratify within-training-set CV. |
It might honestly make sense in these situations to force the users to define bins for each group, and for the software to require the bins are constant across groups. So the user themselves can assign each group of lakes, in this example, to a "high", "medium", or "low" strata, and then pass that factor to |
* Add a first draft of stratification with groups Addresses #317 * Return strata * Assign groups better * Test with rate > pool * removes the pillar hints again * Reformat Co-authored-by: Hannah Frick <hannah@rstudio.com>
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Feature
As part of closing #207, we've recently implemented a number of grouping functions, with
group_mc_cv()
(#313),group_initial_split()
andgroup_validation_split()
(#315), andgroup_bootstraps()
(#316).Right now, none of these functions support stratification -- which would be useful if, for instance, you had repeated measurements of a number of patients and needed to stratify by outcome. We haven't included this partially so that we could implement grouped resampling quickly, but also because we aren't exactly sure what people would expect stratification to do when resampling by groups. Specific questions include:
If anyone has any thoughts on what they'd expect stratification to do in grouping functions, let us know here!
The text was updated successfully, but these errors were encountered: