`survey_mean` produces incorrect standard errors when an expression is used in `summarize` #126

skolenik · 2021-08-27T14:38:05Z

I had to work with Yes/No variables codes as integer 1/2, so I thought it would be a good idea to put down summarize(prop_yes=2-survey_mean(badly_coded_variable)). The syntax worked but the side effect of the expression inside summarize() was that the standard errors were also affected by that expression.

library(survey)
data(api)
as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=survey_mean())
as_survey(apiclus2) %>% mutate(awards12=case_when(awards=="Yes"~1,TRUE~2)) %>% summarize(aw12=survey_mean(awards12), aw12inv=2-survey_mean(awards12))

Output:

> as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=survey_mean())
# A tibble: 2 x 3
  awards    aw  aw_se
  <fct>  <dbl>  <dbl>
1 No     0.341 0.0424
2 Yes    0.659 0.0424
> as_survey(apiclus2) %>% mutate(awards12=case_when(awards=="Yes"~1,TRUE~2)) %>% summarize(aw12=survey_mean(awards12), aw12inv=2-survey_mean(awards12))
     aw12    aw12_se   aw12inv aw12inv_se
1 1.34127 0.04240799 0.6587302   1.957592

I don't know what the right fix for this is. You cannot possibly parse the expressions inside summarize() to make sense like "Oh, this is a linear combination so the resulting standard error is a quadratic form" -- that would be annoying. I think a safe conservative fix is to forbid expressions under summarize(), and only allow the RHS to be the survey_whatever() functions, so that no smart asses would try anything weird, but I don't know the implementation details to gauge if that's technically possible.

The text was updated successfully, but these errors were encountered:

gergness · 2021-08-27T15:13:09Z

Oh bummer, I hadn't really thought about that.

The earliest versions actually did only allow survey_* functions in summarize, and it was an accidental implementation detail of adapting to changes in dplyr when that changed. However, I'm not sure I can think of an easy way to go back.

Plus, I've seen users take advantage of this, which is nice:

as_survey(apiclus2) %>% group_by(awards) %>% summarize(aw=100*survey_mean(vartype = "ci"))
#> # A tibble: 2 × 4
#>   awards    aw aw_low aw_upp
#>   <fct>  <dbl>  <dbl>  <dbl>
#> 1 No      34.1   25.7   42.5
#> 2 Yes     65.9   57.5   74.3

(It's been a while since my stats training, can you remind me if the standard error, variance, and coefficient of variation can also be multiplied by a scalar like this? I feel like at least one is wrong, but can't remember which one.)

I think I'll probably just add a note in documentation, but I'll think about it. Thanks for reporting!

Also, not sure if you already know this, but you can get the correct variance for awards12 by moving the function all the way inside of the survey_mean() function, like so:

#> svy %>% summarize(v1 = survey_mean(2 - awards12), v2 = survey_mean(awards12 == 1))
#>          v1      v1_se        v2      v2_se
#> 1 0.6587302 0.04240799 0.6587302 0.04240799

gergness · 2021-08-27T15:19:48Z

oops, and also expressions are allowed in group_by(), so:

svy %>% group_by(awards = 2 - awards12) %>% summarize(aw = survey_mean())
#> # A tibble: 2 × 3
#>   awards    aw  aw_se
#>    <dbl> <dbl>  <dbl>
#> 1      0 0.341 0.0424
#> 2      1 0.659 0.0424

(definitely not as nice, but I think it's consistent with the general tidyverse philosophy that sometimes you've gotta tidy your data before you get clean code)

skolenik · 2021-08-27T15:41:46Z

Right, the "times 100" functionality is nice to have. (CV and standard errors are multiplicative like that; the variance has to be multiplied by the square of the factor.)

I would probably trust that whatever is inside survey_mean(expression) is done right... as your second example shows. The third example is a bit odd from the survey perspective: it relies on the implementation convention of empty survey_mean() (and the calling function is not supposed to be able to hijack the knowledge of what happens inside survey_mean()... decoupling and Code Clean, you know ;) ), and is not particularly helpful when you need to compute survey_total() in the last step.

gergness added a commit that referenced this issue Sep 28, 2021

Improve documentation about on-the-fly expressions, fixes #126

13868b0

gergness closed this as completed in 94d7294 Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`survey_mean` produces incorrect standard errors when an expression is used in `summarize` #126

`survey_mean` produces incorrect standard errors when an expression is used in `summarize` #126

skolenik commented Aug 27, 2021

gergness commented Aug 27, 2021

gergness commented Aug 27, 2021

skolenik commented Aug 27, 2021

survey_mean produces incorrect standard errors when an expression is used in summarize #126

survey_mean produces incorrect standard errors when an expression is used in summarize #126

Comments

skolenik commented Aug 27, 2021

gergness commented Aug 27, 2021

gergness commented Aug 27, 2021

skolenik commented Aug 27, 2021

`survey_mean` produces incorrect standard errors when an expression is used in `summarize` #126

`survey_mean` produces incorrect standard errors when an expression is used in `summarize` #126