Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratification in grouped resampling #317

Closed
mikemahoney218 opened this issue Jun 28, 2022 · 9 comments · Fixed by #365
Closed

Stratification in grouped resampling #317

mikemahoney218 opened this issue Jun 28, 2022 · 9 comments · Fixed by #365
Labels
discussion feature a feature request or enhancement

Comments

@mikemahoney218
Copy link
Member

Feature

As part of closing #207, we've recently implemented a number of grouping functions, with group_mc_cv() (#313), group_initial_split() and group_validation_split() (#315), and group_bootstraps() (#316).

Right now, none of these functions support stratification -- which would be useful if, for instance, you had repeated measurements of a number of patients and needed to stratify by outcome. We haven't included this partially so that we could implement grouped resampling quickly, but also because we aren't exactly sure what people would expect stratification to do when resampling by groups. Specific questions include:

  • How should strata be determined when the stratification variable isn't constant within a group? Median, mode, user-provided functions? What's a good default option?
  • What rules can we use to determine when a (group x strata) needs to be pooled with others?

If anyone has any thoughts on what they'd expect stratification to do in grouping functions, let us know here!

@mikemahoney218 mikemahoney218 added feature a feature request or enhancement discussion labels Jun 28, 2022
@juliasilge
Copy link
Member

juliasilge commented Jun 29, 2022

Posted on RStudio Community and Twitter

@civilstat
Copy link

civilstat commented Aug 14, 2022

Thank you for providing these grouping functions!
As for stratification, I'd be interested in two use cases.

The first use case is for survey statisticians working with complex sampling designs.
(I'll put the second in a separate comment.)

For starters, {rsample} could allow stratification when strata are constant within each group, and throw an error otherwise. That alone would already be useful to survey statisticians, even if it wouldn't cover other use cases.
That's because for typical complex sampling survey designs, "groups" are usually completely nested within strata. So the case you're worried about (what if the stratification variable isn't constant within a group?) shouldn't usually happen.

Also in this situation, if some strata have too few groups (for ex. the user requests 5-fold CV but some strata only have 4 groups), I'd want the user to get an informative error insisting that they themselves should pool strata. The strata for survey designs are usually chosen carefully, with subject matter expertise, and it would not be meaningful for my software to pool them for me.
(For ex., tell users they could make a pooled-stratum variable using forcats::fct_other(). But do not silently do it for them using one of the forcats::fct_lump() approaches.)

Finally, in this situation I'd want the software to give me a choice about whether the number of groups or the number of ultimate observations should be similar across folds. If the dataset came from a survey design in which we had sampled a certain number of groups (no matter how many observations were in each group), then CV should try to get equal numbers of groups per fold. But if instead the survey design was to keep sampling groups until a target number of final units was reached, then CV should try to get equal numbers of observations per fold.
Ah, I see that group_vfold_cv() already has a balance argument to do this. Great!


A bit more detail: what {rsample} calls "groups" is analogous to what survey statisticians would call "clusters" or "PSUs" (primary sampling units). For ex., maybe we can't collect data by surveying a random sample of individual schoolchildren, because we don't have a complete list of children. Instead we might have a list of schools. So it may be more practical to take a random sample of schools, then survey all children within the sampled schools. Schools would be the PSUs, and children would be the ultimate sampling units.

Also, in typical survey designs, the PSUs are nested within strata. For ex., maybe the strata are rural vs urban schools; or maybe they are elementary, vs middle, vs high schools. We'd divide our list of schools into strata first, then sample PSUs separately within each.

In Wieczorek, Guerin, & McMahon (2022) https://doi.org/10.1002/sta4.454 we argue that sample splitting or CV with survey data should mimic the survey design. So CV folds should be stratified by the design strata, and grouped by the design PSUs. Within each stratum separately, we partition the PSUs into folds; and then we combine across strata for each fold.

Here is how we approach it in our {surveyCV} R package: https://github.com/ColbyStatSvyRsch/surveyCV/blob/master/R/folds.svy.R
It would be great to have similar features built into {rsample}.

Finally, in terms of pooling (group x stratum) sets, it would usually be most meaningful to pool together a few strata based on their meaning, using subject matter expertise -- not based on metrics in the data. The user needs to think about: "If I hadn't been able to design a survey using these fine-grained strata, what coarser strata would I have used instead?"
For example if our strata were elementary, middle, and high schools, we could probably justify pooling together E+M or M+H, but E+H probably wouldn't make sense.
Or if our strata were US states, we could justify pooling together geographically-neighboring states, or perhaps states with similar rural/urban status. But I wouldn't want the software to pool together, say, New York + Kansas, even if their response values happened to be similar in my dataset.

Also paging @bschneidr who might have further suggestions.

@civilstat
Copy link

Second use case: when your data are iid (so you don't need to mimic a survey design), and you are stratifying only because you have a categorical variable with rare categories. You want to ensure that each fold has enough data from even the rarest category. Otherwise you simply can't fit or evaluate a model that handles this category vs the others.

In that case, I would want to find a way to stratify at the group level. Maybe I'd check: Which groups have at least one unit in the rare category? Then stratify on this. Randomly assign the "at least one rare unit" groups to folds first, then assign the remaining "no rare units" groups to folds next.
I guess the user will just tell you "stratify on this variable" but not which category is the rare one. If there are only 2 categories, just check empirically which one is rarer in the full dataset, and then assign to folds as above.
But I'm not sure how to generalize this to multi-category stratification variables.

Regardless of the algorithm, I'd want it to respect these two constraints:
(1) Ensure each fold has data from each category (not necessarily the same distribution of categories in each fold, just some data from each category), and
(2) Don't split up groups -- all data from a group should be assigned to the same fold.


Personally, with iid data I would not want the software to try too hard to get equal rates in each category across folds. Since that kind of stratification wasn't in the original iid sampling design, you are artificially making your folds "too similar."

If the goal of CV is to judge "how well does this algorithm tend to work on training sets like this one?"...
(meaning: sampled in this way, with roughly this sample size, from this population)...
then imposing too much regularity across folds will be counterproductive! It will tell you "how well does this algorithm tend to work on training sets sampled in a very regular way?" and will be too optimistic for your full dataset which was actually iid.

But again, that's my view based on thinking about it from the survey sampling perspective. I would be curious to hear more about other reasons why people do stratified CV, and what benefits they believe it's providing them.

@mikemahoney218
Copy link
Member Author

Thank you @civilstat ! I appreciate your comments a lot. I'm personally still thinking through them (and the package is currently changing hands, so it might be a second before anyone else gets a chance to look), but I wanted to make sure you knew you weren't just shouting into the void 😄

That's because for typical complex sampling survey designs, "groups" are usually completely nested within strata. So the case you're worried about (what if the stratification variable isn't constant within a group?) shouldn't usually happen.

This makes sense to me for a survey application, and might be a good place to start from. We often recommend stratifying based on the outcome variable, which is where you'd start running into the non-constant strata within a group (for instance, you've got measurements for a number of lakes across the state, and want to group them based on counties and stratify by nitrogen content). But it makes sense that the most straightforward approach would still be useful.

Don't split up groups

Agreed 😄

Still thinking over the rest 😄

@civilstat
Copy link

Thanks @mikemahoney218 and I apologize for leaving too-long comments---I didn't have time to make them shorter :)

I admit I had been thinking mostly about cross-validation within the training set. But I read the link you posted, and I agree that stratifying on the outcome makes sense for the initial split. If we are specifically setting aside a holdout set which we'll only touch once, we want that holdout set to cover a wide range of possible outcomes. Then stratifying makes sense: else we might evaluate our final models only on the moderate response values and we'll never know how it performs on cases with high or low response values.

For cross-validation or bootstrapping within the training set, I still think we shouldn't stratify unless the full dataset was originally collected by stratifying. If our actual dataset was a simple random sample from the population, yet we choose the model that looks best across stratified splits (but not across simple random splits), then it may not work so well on future data.

In any case, thanks for thinking about this!

@mikemahoney218
Copy link
Member Author

So here's a question: say you've got an extremely rare outcome (eg, a a binary classification problem where only 10% of observations are a "1"). If you're using a model where you can set class weights, then it seems to me like it'd still make sense to stratify while doing within-training-set CV; otherwise, you might not be able to estimate sensitivity/precision based on random splitting.

In that situation it makes sense to me that you'd stratify your CV, even though you didn't collect the data following the same stratification. Is there a way that this would give you bad estimates (or at any rate, estimates that won't hold when using new data) of model performance?

I think it helps that, for non-grouped resampling, rsample does check (a lot of times) to make sure that stratification isn't trying to stratify using too-small bins:

rsample/R/make_strata.R

Lines 79 to 129 in 585b8fc

## This should really be based on some combo of rate and number.
if (all(pcts < pool)) {
rlang::warn(c(
"Too little data to stratify.",
"Resampling will be unstratified."
))
return(factor(rep("strata1", n)))
}
if (pool < default_pool & any(pcts < default_pool)) {
rlang::warn(c(
paste0(
"Stratifying groups that make up ",
round(100 * pool), "% of the data may be ",
"statistically risky."
),
"Consider increasing `pool` to at least 0.1"
))
}
## Small groups will be randomly allocated to stratas at end
## These should probably go into adjacent groups but this works for now
if (any(pcts < pool)) {
x[x %in% names(pcts)[pcts < pool]] <- NA
}
## The next line will also relevel the data if `x` was a factor
out <- factor(as.character(x))
} else {
if (breaks < 2) {
rlang::warn(c(
"The bins specified by `breaks` must be >=2.",
"Resampling will be unstratified."
))
return(factor(rep("strata1", n)))
} else if (floor(n / breaks) < depth) {
rlang::warn(c(
paste0(
"The number of observations in each quantile is ",
"below the recommended threshold of ", depth, "."
),
paste0("Stratification will use ", floor(n / depth), " breaks instead.")
))
}
breaks <- min(breaks, floor(n / depth))
if (breaks < 2) {
rlang::warn(c(
"Too little data to stratify.",
"Resampling will be unstratified."
))
return(factor(rep("strata1", n)))
}

@civilstat
Copy link

That sounds like an example of my "Second use case" above. But now that you mention it, I think you could also justify it under the "mimic the sampling design" rule-of-thumb.

If we have a rare response category, we wouldn't have tried to build & evaluate a model at all unless we had enough cases in the rare category. So we could justifiably say that our sampling design was: "Keep collecting data until there are enough of the rare cases for us to estimate sensitivity & precision." In that case, stratifying on the outcome isn't exactly what we did to get the original dataset, but it's a reasonable approximation. So it would be justifiable if we stratify within-training-set CV.

@mikemahoney218
Copy link
Member Author

for instance, you've got measurements for a number of lakes across the state, and want to group them based on counties and stratify by nitrogen content

It might honestly make sense in these situations to force the users to define bins for each group, and for the software to require the bins are constant across groups. So the user themselves can assign each group of lakes, in this example, to a "high", "medium", or "low" strata, and then pass that factor to strata instead.

mikemahoney218 added a commit that referenced this issue Aug 19, 2022
hfrick added a commit that referenced this issue Aug 26, 2022
* Add a first draft of stratification with groups

Addresses #317

* Return strata

* Assign groups better

* Test with rate > pool

* removes the pillar hints again

* Reformat

Co-authored-by: Hannah Frick <hannah@rstudio.com>
hfrick pushed a commit that referenced this issue Sep 26, 2022
* Implement balance_prop_strata()

* Add code comments
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants