-
Notifications
You must be signed in to change notification settings - Fork 0
/
conf_intervals.qmd
475 lines (355 loc) · 18.5 KB
/
conf_intervals.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
# Confidence intervals {#sec-conf_intervals}
```{r}
#| include: false
library(fontawesome)
library(gghighlight)
library(tidyverse)
```
Obtaining an exact point estimate of the population parameter from just one random sample is almost unattainable. However, **interval estimation** allows us to provide a range of values where the parameter is expected to fall with a certain level of confidence. This can be achieved by constructing **confidence intervals**.
::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives
- Understand the concept of confidence intervals
- Calculate and interpret confidence intervals for mean
- Calculate and interpret confidence intervals for proportion
:::
## Confidence interval for mean
### The logic behind constructing a confidence interval
We will base the construction of a confidence interval on two key concepts:
1. The interval is **around the point estimate**, which represents our best estimate of the population parameter.
2. The **standard error** is utilized to quantify the extent of variability around the point estimate.
According to the Central Limit Theorem, the sampling distribution of the mean approaches a normal distribution (@sec-sampling). Furthermore, the standard deviation of this sampling distribution is the standard error of the mean, $\sigma_{\bar{x}}$. Consequently, it can be inferred that approximately 95% of the distribution of sample means lies within $\pm 1.96 \sigma_{\bar{x}}$ from the point estimate (the empirical rule; @sec-normal). This multiple of the standard error, such as $\pm 1.96 \sigma_{\bar{x}}$, is referred to as the **margin of error** in a confidence interval.
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-sample_distribution95
#| fig-cap: Sampling distribution of mean and 95% CI.
#| fig-width: 7.0
#| fig-height: 5.0
# Create a data frame
df4 <- data.frame(x = seq(-3, 3, by = 0.01),
y = dnorm(seq(-3, 3, by = 0.01), mean = 0, sd = 1))
ggplot() +
geom_area(data = df4, mapping = aes(x = ifelse(x < -1.96, x, -3), y = y),
fill = "grey70") + # Area under the curve
geom_area(data = df4, mapping = aes(x = ifelse(x > 1.96, x, 3), y = y),
fill = "grey70") + # Area under the curve
stat_function(fun = dnorm, args = list(mean = 0, sd = 1), linewidth = 1) +
scale_x_continuous(expand = c(0.01, 0),
breaks = c(-3, -2, -1, 0, 1, 2, 3),
labels = expression(-3*sigma[bar(x)], -2*sigma[bar(x)],
-1*sigma[bar(x)], mu[bar(x)],
+1*sigma[bar(x)], +2*sigma[bar(x)],
+3*sigma[bar(x)])) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 0.45)) +
annotate("text", x = 0, y = 0.42,
label = "Sampling distribution", size = 5) +
annotate("text", x = 0, y = 0.2,
label = "Central 95% of the distribution", size = 4) +
annotate("text", x = 0, y = 0.02,
label = "95% CI", size = 3) +
annotate("text", x = -2.2, y = 0.02,
label = "2.5%", size = 3) +
annotate("text", x = 2.2, y = 0.02,
label = "2.5%", size = 3) +
geom_segment(aes(x = -1.96, y = 0.01, xend = 1.96, yend = 0.01),
arrow = arrow(length = unit(0.03, "npc"), ends = "both")) +
labs(y = "Probabiblity density") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank())
```
In this case, the formula for the confidence interval (CI) of mean equals:
$$ 95\%CI=\mu_{\bar{x}} \ \pm 1.96 \ \sigma_{\bar{x}} = \mu_{\bar{x}} \ \pm 1.96 \frac{\sigma}{\sqrt{n}} $$ {#eq-CI}
When the sample size *n* is sufficiently large, the sample mean provides a good estimate of the population mean. Additionally, if the population standard deviation *σ* is unknown, we can estimate it by using the sample standard deviation *s*, and the formula becomes:
$$ 95\%CI=\bar{x} \ \pm 1.96 \ SE_{\bar{x}} = \bar{x} \ \pm 1.96 \frac{s}{\sqrt{n}} $$ {#eq-CI2}
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Example**
The serum creatinine of a sample of 121 elderly men has a mean of 1.15 mg/dl with a standard deviation of 0.3 mg/dl. The 95% confidence interval for the mean creatinine of this population is calculated as follows:
**Lower limit of 95% CI**
$LL = 1.15 \ - 1.96 \frac{0.3}{\sqrt{121}} = 1.15 \ - 1.96 \frac{0.3}{11} = 1.15 \ - 0.0534 = 1.096$
**Upper limit of 95% CI**
$UL = 1.15 \ + 1.96 \frac{0.3}{\sqrt{121}} = 1.15 \ + 1.96 \frac{0.3}{11} = 1.15 \ + 0.0534 = 1.203$
We are 95% confident that the mean serum creatinine is between 1.1 mg/dl and 1.2 mg/dl.
:::
**In R:**
For a 95% confidence interval, each of the grey areas in @fig-sample_distribution95 equals 2.5% of the distribution because the total percentage of 5% (100-95) is equally divided between both sides of the normal distribution.
```{r}
#| results: hold
n <- 121
mean <- 1.15
s <- 0.3
z <- qnorm(0.025, lower.tail = FALSE)
# compute lower limit of 95% CI
lower_95CI <- mean - z*(s/sqrt(n))
lower_95CI
# compute upper limit of 95% CI
upper_95CI <- mean + z*(s/sqrt(n))
upper_95CI
```
### Confidence level
However, there is no particular reason for choosing a 95% confidence level for constructing confidence intervals other than convention; confidence levels of 90% or 99% are sometimes preferred depending on the context. For example, when a 99% confidence level is chosen, the margin of error for the mean becomes $\pm 2.58 \sigma_{\bar{x}}$ (the empirical rule; @sec-normal).
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-sample_distribution99
#| fig-cap: Sampling distribution of mean and 99% CI.
#| fig-width: 7.0
#| fig-height: 5.0
# Create a data frame
df4 <- data.frame(x = seq(-3, 3, by = 0.01),
y = dnorm(seq(-3, 3, by = 0.01), mean = 0, sd = 1))
ggplot() +
geom_area(data = df4, mapping = aes(x = ifelse(x < -2.58, x, -3), y = y),
fill = "grey70") + # Area under the curve
geom_area(data = df4, mapping = aes(x = ifelse(x > 2.58, x, 3), y = y),
fill = "grey70") + # Area under the curve
stat_function(fun = dnorm, args = list(mean = 0, sd = 1), linewidth = 1) +
scale_x_continuous(expand = c(0, 0),
breaks = c(-2.58, -2, -1, 0, 1, 2, 2.58),
labels = expression(-2.58*sigma[bar(x)], -2*sigma[bar(x)],
-1*sigma[bar(x)],
mu[bar(x)], +1*sigma[bar(x)],
+2*sigma[bar(x)], -2.58*sigma[bar(x)])) +
scale_y_continuous(expand = c(0, 0.003), limits = c(0, 0.45)) +
annotate("text", x = 0, y = 0.42,
label = "Sampling distribution", size = 5) +
annotate("text", x = 0, y = 0.2,
label = "Central 99% of the distribution", size = 4) +
annotate("text", x = 0, y = 0.02,
label = "99% CI", size = 3) +
annotate("text", x = -2.7, y = 0.006,
label = "0.5%", size = 2.7) +
annotate("text", x = 2.7, y = 0.006,
label = "0.5%", size = 2.7) +
geom_segment(aes(x = -2.58, y = 0.005, xend = 2.58, yend = 0.005),
arrow = arrow(length = unit(0.03, "npc"), ends = "both")) +
labs(y = "Probabiblity density") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank())
```
Now, each of the grey areas in @fig-sample_distribution99 equals 0.5% of the distribution because the total percentage of 1% (100-99) is equally divided between both sides of the normal distribution. In this instance, the 99% CI for the mean is:
```{r}
#| results: hold
z2 <- qnorm(0.005, lower.tail = FALSE)
# compute lower limit of 95% CI
lower_99CI <- mean - z2*(s/sqrt(n))
lower_99CI
# compute upper limit of 95% CI
upper_99CI <- mean + z2*(s/sqrt(n))
upper_99CI
```
We observe that a 99% CI provides a higher level of confidence but comes with a wider range (1.07-1.22), while a 95% confidence interval offers a narrower range (1.09-1.20) but with slightly less certainty. Therefore, the increased level of confidence comes at the expense of precision, especially with smaller datasets.
### Understanding the condidence interval
The intuitive meaning of "confidence" in a confidence interval might not be immediately clear. To understand what confidence truly represents, let's consider once more the example of a population consisting of 100,000 adults, with a mean blood pressure (BP) of μ = 126 mmHg and a standard deviation of σ = 10 (@fig-pop01).
```{r}
#| echo: false
#| message: false
#| warning: false
set.seed(46)
# Create a non-uniform population of 100,000 numbers between 1 and 100
# Create a non-uniform population of 100,000 numbers between 1 and 100
pop1 <- rnorm(70000, mean = 120, sd = 3)
pop2 <- rnorm(30000, mean = 140, sd = 6)
pop <- c(pop1, pop2)
mu <- mean(pop) #calculate the population mean
sigma <- sd(pop) #calculate the population standard deviation
popdf <- as.data.frame(pop)
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-pop01
#| fig-cap: A hypothetical population of 100,000 observations. The dashed black line represents the population, μ.
#| fig-width: 7.0
#| fig-height: 3.5
# histogram
ggplot(popdf, aes(x = pop)) + geom_histogram(color = "black", fill = "#894ae0", alpha=0.3) +
geom_vline(xintercept = mu, linetype = "dashed", linewidth = 0.8) +
theme_classic() +
ggtitle("Histogram of Population") + xlab("x") +
theme(axis.title = element_text(hjust = 1))
```
We proceed by generating 100 random samples of size 10 from our population distribution and construct a 95% confidence interval for the mean of each sample.
```{r}
#| echo: false
#| message: false
#| warning: false
sample_sizes <- c(10)
Samps = list()
set.seed(47)
for (j in sample_sizes) {
Samps[[paste0("Sample size: ", j)]] = data.frame(sampsize=j, samp=sapply(1:100, function(i){ x <- sample(pop, j, replace = TRUE) }))
}
Samps %>%
map_df(as_tibble) %>%
gather(samp, value, samp.1:samp.100) -> Samps
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-intervals1
#| fig-cap: 100 Sample Means of Size 10 (with 95% Intervals) from the Population.
#| fig-width: 9.5
#| fig-height: 12.0
Samps %>%
group_by(sampsize, samp) %>%
mutate(sampmean = mean(value),
se = sd(pop)/sqrt((sampsize)),
lb95 = sampmean - 1.96*se,
ub95 = sampmean + 1.96*se) %>%
distinct(sampsize, samp, sampmean, se, lb95, ub95) %>%
ungroup() %>%
mutate(sampsize = fct_inorder(paste0("Sample Size: ", sampsize)),
samp = as.numeric(str_replace(samp, "samp.", ""))) %>%
ggplot(aes(x = sampmean, y = as.factor(samp), xmax=ub95, xmin=lb95)) +
#theme_steve() + post_bg() +
geom_vline(xintercept = mean(pop), color = "black", linetype="dashed", size = 1) +
geom_pointrange(color = "red") +
gghighlight(lb95 > 126.002 | ub95 < 126.002,
unhighlighted_params = list(color = "steelblue")) +
scale_x_continuous(limits = c(112, 141)) +
labs(x = "Sample Mean (with 95% Intervals)",
y = "Sample Number [1:100]",
title = "100 Sample Means of Size 10 (with 95% Intervals) from the Population")
```
In @fig-intervals1, each blue horizontal bar is a confidence interval (CI), centered on a sample mean (point). The intervals all have the same length, but are centered on different sample means as a result of random sampling from the population. The five red confidence intervals do not cover the population mean (the vertical dashed line; $\mu$ = 126 mmHg). **This aligns with our expectations under a 95% confidence level, where roughly 95% of the intervals should include the population parameter.**
### Sample size and condidence interval
Next, we construct the 95% confidence intervals of 100 randomly generated samples of size 50 from our population (@fig-intervals2):
```{r}
#| echo: false
#| message: false
#| warning: false
sample_sizes <- c(50)
Samps = list()
set.seed(55)
for (j in sample_sizes) {
Samps[[paste0("Sample size: ", j)]] = data.frame(sampsize=j, samp=sapply(1:100, function(i){ x <- sample(pop, j, replace = TRUE) }))
}
Samps %>%
map_df(as_tibble) %>%
gather(samp, value, samp.1:samp.100) -> Samps
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-intervals2
#| fig-cap: 100 Sample Means of Size 50 (with 95% Intervals) from the Population.
#| fig-width: 9.5
#| fig-height: 12.0
Samps %>%
group_by(sampsize, samp) %>%
mutate(sampmean = mean(value),
se = sd(pop)/sqrt((sampsize)),
lb95 = sampmean - 1.96*se,
ub95 = sampmean + 1.96*se) %>%
distinct(sampsize, samp, sampmean, se, lb95, ub95) %>%
ungroup() %>%
mutate(sampsize = fct_inorder(paste0("Sample Size: ", sampsize)),
samp = as.numeric(str_replace(samp, "samp.", ""))) %>%
ggplot(aes(x = sampmean, y = as.factor(samp), xmax=ub95, xmin=lb95)) +
#theme_steve() + post_bg() +
geom_vline(xintercept = mean(pop), color = "black", linetype="dashed", size = 1) +
geom_pointrange(color = "red") +
gghighlight(lb95 > 126.002 | ub95 < 126.002,
unhighlighted_params = list(color = "steelblue")) +
scale_x_continuous(limits = c(112, 141)) +
labs(x = "Sample Mean (with 95% Intervals)",
y = "Sample Number [1:100]",
title = "100 Sample Means of Size 50 (with 95% Intervals) from the Population")
```
Comparing the @fig-intervals1 and @fig-intervals2, we notice two key trends as the sample size increases from 10 to 50:
1. The sample statistic (points) gets closer to the population parameter (black dashed line).
2. The uncertainty around the estimate shrinks (confidence intervals become narrower).
::: content-box-green
A confidence interval is commonly expressed as 90% CI, 95% CI, or 99% CI, indicating the level of confidence associated with the estimate. The percentage reflects the proportion of intervals, constructed from repeated experiments, that would contain the population parameter (long-run interpretation).
Choosing an appropriate confidence level and sample size depend on the specific needs of the analysis and the trade-offs between certainty and precision.
:::
## Confidence interval for proportion (normal approximation)
Let X be a random variable representing the observed number of individuals in a sample with a binary characteristic, such as having a disease. Our best estimate of the population proportion, p, is given by the sample proportion $\hat{p} = \frac{X}{n}$, where n is the sample size. If we repeatedly draw samples of size *n* from our population and calculate the sample proportions $\hat{p_1} = \frac{X_1}{n}$, $\hat{p_2} = \frac{X_2}{n}$, $\hat{p_3} = \frac{X_3}{n}$, and so forth, then, under the assumption that the sample size is sufficiently large and satisfies the condition $min(np, n(1-p)) \geq 5$, the sampling distribution of the proportion would approximate a normal distribution, $N(\mu_{\hat{p}} = p, \sigma_{\hat{p}}^2 = \frac{p(1-p)}{n})$ (@fig-sample_distribution_pop).
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-sample_distribution_pop
#| fig-cap: Sampling distribution of proportion and 95% CI.
#| fig-width: 7.0
#| fig-height: 5.0
# Create a data frame
df4 <- data.frame(x = seq(-3, 3, by = 0.01),
y = dnorm(seq(-3, 3, by = 0.01), mean = 0, sd = 1))
ggplot() +
geom_area(data = df4, mapping = aes(x = ifelse(x < -1.96, x, -3), y = y),
fill = "grey70") + # Area under the curve
geom_area(data = df4, mapping = aes(x = ifelse(x > 1.96, x, 3), y = y),
fill = "grey70") + # Area under the curve
stat_function(fun = dnorm, args = list(mean = 0, sd = 1), linewidth = 1) +
scale_x_continuous(expand = c(0, 0), breaks = c(-3, -2, -1, 0, 1, 2, 3),
labels = expression(-3*sigma[hat(p)], -2*sigma[hat(p)], -1*sigma[hat(p)],
mu[hat(p)], +1*sigma[hat(p)], +2*sigma[hat(p)], +3*sigma[hat(p)])) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 0.45)) +
annotate("text", x = 0, y = 0.42,
label = "Sampling distribution", size = 5) +
annotate("text", x = 0, y = 0.2,
label = "Central 95% of the distribution", size = 4) +
annotate("text", x = 0, y = 0.02,
label = "95% CI", size = 3) +
annotate("text", x = -2.2, y = 0.02,
label = "2.5%", size = 3) +
annotate("text", x = 2.2, y = 0.02,
label = "2.5%", size = 3) +
geom_segment(aes(x = -1.96, y = 0.01, xend = 1.96, yend = 0.01),
arrow = arrow(length = unit(0.03, "npc"), ends = "both")) +
labs(y = "Probabiblity density") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank())
```
Similar to a confidence interval for the mean @eq-CI, a confidence interval for a proportion can be constructed as follows:
$$ 95\%CI= \mu_{\hat{p}} \ \pm 1.96 \ \sigma_{\hat{p}} = p \ \pm 1.96 \sqrt{\frac{p(1- p)}{n}} $$ {#eq-CI3}
and when the value of p is unknown, it is replaced with the sample proportion $\hat{p}$:
$$ 95\%CI= \hat{p} \ \pm 1.96 \ SE_{\hat{p}} = \hat{p} \ \pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$ {#eq-CI4}
where the standard error for proportion is $SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$.
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Example**
Suppose a pulmonologist chooses a random sample of 317 patients from the patient register, and finds that 34 of them have a history of suffering from chronic obstructive pulmonary disease (COPD). The 95% confidence interval for the proportion of COPD is calculated as follows:
$\hat{p} = \frac{X}{n} = \frac{34}{317}=0.107 \ or \ 10.7\%$
Additionally, the condition $min(np, n(1-p)) \geq 5$ is satisfied:
np = 317 \* 0.107 = 33.9 \> 5
n(1-p) = 317 \* (1 - 0.107) = 317 \* 0.893 = 283 \> 5
**Lower limit of 95% CI**
$LL = 0.107 \ - 1.96 \sqrt{\frac{0.107(1-0.107)}{317}} = 0.107 \ - 0.034 = 0.073 \ or \ 7.3\%$
**Upper limit of 95% CI**
$UL = 0.107 \ + 1.96 \sqrt{\frac{0.107(1-0.107)}{317}} = 0.107 \ + 0.034 = 0.141 \ or \ 14.1\%$
Based on our random sample, we are 95% confident that the percentage of patients with COPD falls within the range of 7.3% to 14.1%.
:::
**In R:**
```{r}
#| results: hold
n = 317
x = 34
# calculate the proportion
p_hat <- x/n
p_hat
# check the assumption min(np, n(1-p)) ≥ 5
min(c(n*p_hat, n*(1 - p_hat)))
```
```{r}
#| results: hold
z <- qnorm(0.025, lower.tail = FALSE)
se <- sqrt(p_hat*(1 - p_hat)/n)
# compute lower limit of 95% CI
lower_95CI <- p_hat - z*se
lower_95CI
# compute upper limit of 95% CI
upper_95CI <- p_hat + z*se
upper_95CI
```