Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

survey_mean produces incorrect standard errors when an expression is used in summarize #126

Closed
skolenik opened this issue Aug 27, 2021 · 3 comments

Comments

@skolenik
Copy link

I had to work with Yes/No variables codes as integer 1/2, so I thought it would be a good idea to put down summarize(prop_yes=2-survey_mean(badly_coded_variable)). The syntax worked but the side effect of the expression inside summarize() was that the standard errors were also affected by that expression.

library(survey)
data(api)
as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=survey_mean())
as_survey(apiclus2) %>% mutate(awards12=case_when(awards=="Yes"~1,TRUE~2)) %>% summarize(aw12=survey_mean(awards12), aw12inv=2-survey_mean(awards12))

Output:

> as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=survey_mean())
# A tibble: 2 x 3
  awards    aw  aw_se
  <fct>  <dbl>  <dbl>
1 No     0.341 0.0424
2 Yes    0.659 0.0424
> as_survey(apiclus2) %>% mutate(awards12=case_when(awards=="Yes"~1,TRUE~2)) %>% summarize(aw12=survey_mean(awards12), aw12inv=2-survey_mean(awards12))
     aw12    aw12_se   aw12inv aw12inv_se
1 1.34127 0.04240799 0.6587302   1.957592

I don't know what the right fix for this is. You cannot possibly parse the expressions inside summarize() to make sense like "Oh, this is a linear combination so the resulting standard error is a quadratic form" -- that would be annoying. I think a safe conservative fix is to forbid expressions under summarize(), and only allow the RHS to be the survey_whatever() functions, so that no smart asses would try anything weird, but I don't know the implementation details to gauge if that's technically possible.

@gergness
Copy link
Owner

Oh bummer, I hadn't really thought about that.

The earliest versions actually did only allow survey_* functions in summarize, and it was an accidental implementation detail of adapting to changes in dplyr when that changed. However, I'm not sure I can think of an easy way to go back.

Plus, I've seen users take advantage of this, which is nice:

as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=100*survey_mean(vartype = "ci"))
#> # A tibble: 2 × 4
#>   awards    aw aw_low aw_upp
#>   <fct>  <dbl>  <dbl>  <dbl>
#> 1 No      34.1   25.7   42.5
#> 2 Yes     65.9   57.5   74.3

(It's been a while since my stats training, can you remind me if the standard error, variance, and coefficient of variation can also be multiplied by a scalar like this? I feel like at least one is wrong, but can't remember which one.)

I think I'll probably just add a note in documentation, but I'll think about it. Thanks for reporting!

Also, not sure if you already know this, but you can get the correct variance for awards12 by moving the function all the way inside of the survey_mean() function, like so:

#> svy %>% summarize(v1 = survey_mean(2 - awards12), v2 = survey_mean(awards12 == 1))
#>          v1      v1_se        v2      v2_se
#> 1 0.6587302 0.04240799 0.6587302 0.04240799

@gergness
Copy link
Owner

oops, and also expressions are allowed in group_by(), so:

svy %>% group_by(awards = 2 - awards12) %>% summarize(aw = survey_mean())
#> # A tibble: 2 × 3
#>   awards    aw  aw_se
#>    <dbl> <dbl>  <dbl>
#> 1      0 0.341 0.0424
#> 2      1 0.659 0.0424

(definitely not as nice, but I think it's consistent with the general tidyverse philosophy that sometimes you've gotta tidy your data before you get clean code)

@skolenik
Copy link
Author

Right, the "times 100" functionality is nice to have. (CV and standard errors are multiplicative like that; the variance has to be multiplied by the square of the factor.)

I would probably trust that whatever is inside survey_mean(expression) is done right... as your second example shows. The third example is a bit odd from the survey perspective: it relies on the implementation convention of empty survey_mean() (and the calling function is not supposed to be able to hijack the knowledge of what happens inside survey_mean()... decoupling and Code Clean, you know ;) ), and is not particularly helpful when you need to compute survey_total() in the last step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants