forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-statisticalInference.Rmd
1894 lines (1291 loc) · 61.9 KB
/
05-statisticalInference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "5. Statistical Inference"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
```
```{r include=FALSE}
# Do not run during the lecture when we work interactively
set.seed(1314)
```
# Introduction
## Population
- The aim of a scientific study is to draw conclusions on the general population.
- Here, we study the effect of captopril on the blood pressure of patients with hypertension.
- We will collect data and we will model the data
- Therefore, we always have to translate the research question into an effect size that can be estimated with a model parameter or a combination of model parameters.
- e.g. the population mean
$$E(X)=\mu$$
- Does the blood pressure decreases on average after administering captopril to patients with hypertension?
---
## Overview
- Experimental Design & Data Exploration
- Point estimators (Estimation)
- Interval estimators (Statistical inference)
- Hypothesis tests (Statistical inference)
---
# Experimental Design
- 15 patients were drawn at random from the population of patients with hypertension
- pre-test/post-test design: the systolic and diasystolic blood pressure are measured before and after administering captopril
- Advantage: we can assess the effect of administering captopril on the blood pressure for each individual patient.
- Disadvantage?
```{r pop2Samp2Pop,fig.asp=.8, fig.align='center',echo=FALSE}
if ("pi" %in% ls()) rm("pi")
kopvoeter <- function(x, y, angle = 0, l = .2, cex.dot = .5, pch = 19, col = "black") {
angle <- angle / 180 * pi
points(x, y, cex = cex.dot, pch = pch, col = col)
lines(c(x, x + l * cos(-pi / 2 + angle)), c(y, y + l * sin(-pi / 2 + angle)), col = col)
lines(c(x + l / 2 * cos(-pi / 2 + angle), x + l / 2 * cos(-pi / 2 + angle) + l / 4 * cos(angle)), c(y + l / 2 * sin(-pi / 2 + angle), y + l / 2 * sin(-pi / 2 + angle) + l / 4 * sin(angle)), col = col)
lines(c(x + l / 2 * cos(-pi / 2 + angle), x + l / 2 * cos(-pi / 2 + angle) + l / 4 * cos(pi + angle)), c(y + l / 2 * sin(-pi / 2 + angle), y + l / 2 * sin(-pi / 2 + angle) + l / 4 * sin(pi + angle)), col = col)
lines(c(x + l * cos(-pi / 2 + angle), x + l * cos(-pi / 2 + angle) + l / 2 * cos(-pi / 2 + pi / 4 + angle)), c(y + l * sin(-pi / 2 + angle), y + l * sin(-pi / 2 + angle) + l / 2 * sin(-pi / 2 + pi / 4 + angle)), col = col)
lines(c(x + l * cos(-pi / 2 + angle), x + l * cos(-pi / 2 + angle) + l / 2 * cos(-pi / 2 - pi / 4 + angle)), c(y + l * sin(-pi / 2 + angle), y + l * sin(-pi / 2 + angle) + l / 2 * sin(-pi / 2 - pi / 4 + angle)), col = col)
}
par(mar = c(0, 0, 0, 0), mai = c(0, 0, 0, 0))
plot(0, 0, xlab = "", ylab = "", xlim = c(0, 10), ylim = c(0, 10), col = 0, xaxt = "none", yaxt = "none", axes = FALSE)
rect(0, 6, 10, 10, border = "red", lwd = 2)
text(.5, 8, "population", srt = 90, col = "red", cex = 2)
symbols(3, 8, circles = 1.5, col = "red", add = TRUE, fg = "red", inches = FALSE, lwd = 2)
grid <- seq(0, 1.3, .01)
for (i in 1:50)
{
angle1 <- runif(n = 1, min = 0, max = 360)
angle2 <- runif(n = 1, min = 0, max = 360)
radius <- sample(grid, prob = grid^2 * pi / sum(grid^2 * pi), size = 1)
kopvoeter(3 + radius * cos(angle1 / 180 * pi), 8 + radius * sin(angle1 / 180 * pi), angle = angle2)
}
text(7.5, 8, "Effect of captopril in population", col = "red", cex = 1.2)
rect(0, 0, 10, 4, border = "blue", lwd = 2)
text(.5, 2, "sample", srt = 90, col = "blue", cex = 2)
symbols(3, 2, circles = 1.5, col = "red", add = TRUE, fg = "blue", inches = FALSE, lwd = 2)
for (i in 0:2) {
for (j in 0:4)
{
kopvoeter(2.1 + j * (3.9 - 2.1) / 4, 1.1 + i)
}
}
text(7.5, 2, "Effect of captopril in sample", col = "blue", cex = 1.2)
arrows(3, 5.9, 3, 4.1, col = "black", lwd = 3)
arrows(7, 4.1, 7, 5.9, col = "black", lwd = 3)
text(1.5, 5, "EXP. DESIGN (1)", col = "black", cex = 1.2)
text(8.5, 5, "ESTIMATION &\nINFERENCE (3)", col = "black", cex = 1.2)
text(7.5, .5, "DATA EXPLORATION &\nDESCRIPTIVE STATISTICS (2)", col = "black", cex = 1.2)
```
---
# Data exploration and Descriptive Statistics
```{r}
captopril <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/captopril.txt")
head(captopril)
summary(captopril)
```
```{r}
captopril_tidy <- captopril %>% gather(type, bp, -id)
captopril_summary <- captopril_tidy %>%
group_by(type) %>%
summarize(
mean = mean(bp, na.rm = TRUE),
sd = sd(bp, na.rm = TRUE),
n = n()
) %>%
mutate(se = sd / sqrt(n))
captopril_summary
captopril_summary %>%
ggplot(aes(x = type, y = mean)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = .2) +
ylim(0, 210) +
ylab("blood pressure (mmHg)")
```
- This figure is not informative! It does not show the raw data.
We see that a lot of the space that was taken by the barplot does not contain data!
```{r}
captopril_tidy %>%
ggplot(aes(x = type, y = bp)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = "jitter")
```
- This plot would have been informative if the data was gathered on different individuals.
- However, the blood pressures are measured op the same subject!
- We will make a plot by filtering the systolic blood pressures
```{r}
captopril_tidy %>%
filter(type %in% c("SBPa", "SBPb")) %>%
mutate(type = factor(type, levels = c("SBPb", "SBPa"))) %>%
ggplot(aes(x = type, y = bp)) +
geom_line(aes(group = id)) +
geom_point()
```
- We have paired data. So we might estimate the effect of the treatment directly by comparing the blood pressure after treatment to the blood pressure before the treatment.
```{r}
captopril$deltaSBP <- captopril$SBPa - captopril$SBPb
captopril %>%
ggplot(aes(x = "Systolic blood pressure", y = deltaSBP)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = "jitter") +
ylab("Difference (mm mercury)") +
xlab("")
```
```{r}
captopril %>%
summarize(
mean = mean(deltaSBP, na.rm = TRUE),
sd = sd(deltaSBP, na.rm = TRUE),
n = n()
) %>%
mutate(se = sd / sqrt(n))
```
- Pre-test/post-test design: Effect of captopril in sample using $X=\Delta_\text{after-before}$!
- How will we model $X=\Delta_\text{after-na}$ and estimate the effect of captopril?
```{r}
captopril %>%
ggplot(aes(sample = deltaSBP)) +
stat_qq() +
stat_qq_line()
```
The systolic blood pressure differences are approximately normally distributed.
---
# Estimation
- No substantial deviations from normality
- We can assume that the differences $X \sim N(\mu, \sigma^2)$.
- Effect of captopril in the population is captured by the average blood pressure difference $\mu$.
- The average blood pressure $\mu$ in the population can be estimated using the sample mean $\bar x$=`r round(mean(captopril$deltaSBP),2)`
- The standard deviation $\sigma$ with the sample standard deviation $\text{S}$=`r round(sd(captopril$deltaSBP),2)`.
- Is the effect that we observe in the sample large enough to conclude that there is an effect of the captopril treatment on the blood pressure at population level?
- Our estimates will change from sample to sample!
- How are the estimators $\bar X$ and $S$ distributed?
## Point estimator the sample mean
- Suppose that $X$ is a random sample from the population and assume that $X \sim N(\mu,\sigma^2)$
- Estimate $\mu$ based on sample $X_1,...,X_n$, using the sample mean
$$\bar X = \frac{X_1+ X_2+ ... + X_n}{n} = \frac{\sum_{i=1}^{n} X_i}{n}$$ of random variables $X_1,X_2, ..., X_n$.
- Sample mean $\bar X$ is a random variable that varies from sample to sample
- Study the theoretical distribution of the sample mean to get insight
1. in how the sample mean can vary in a new similar study
2. how far $\bar X$ can be from the population mean $\mu$
---
### Overview
1. The sample mean is unbiased
2. Precision of sample mean
3. Distribution of sample mean
---
### The sample mean is unbiased
- We can generalize our observations based on the sample towards the population if the estimate is good approximation of the population value.
- A representative sample is required to generalize the results from the sample towards the population
- Avoid bias (so that the population mean is not systematically under or overestimated)
- Report how the sample is taken!
- Randomisation!
- Draw the subjects at random from population so every subject has the same probability to end up in the sample.
- Subjects with hypertension are sampled at random from the population
- Simple random sample: $X_1,...,X_n$ for characteristic $X$
- $X_1,...,X_n$ have same distribution
- They have same mean $\mu$ and variance $\sigma^2$
- $E(X_1)=...=E(X_n)=\mu$ and $\text{Var}(X_1)=...=\text{Var}(X_n)=\sigma^2$
- $\bar X$ is an *unbiased estimator* for $\mu$
<details><summary>Click to see proof</summary><p>
\begin{eqnarray*}
E(\bar X) &=& E \left(\frac{X_1+ X_2+ ... + X_n}{n}\right) \\
&= & \frac{E(X_1)+ E(X_2)+ ... + E(X_n)}{n} \\
&=& \frac{\mu + \mu + ... +\mu}{n} \\
&= & \mu
\end{eqnarray*}
</p></details>
---
### Imprecision/standard error
- Also for representative samples the results are imprecise.
\vspace{10pt}
- Different samples from the same population give different results.
- We illustrated this by using the NHANES
- We will draw 15 females at random from the NHANES study and we will register their log2 direct cholesterol values
- We repeat this 50 times to assess the variation from sample to sample
- We will plot the boxplot for each sample and will indicate the mean
```{r}
library(NHANES)
fem <- NHANES %>%
filter(Gender == "female" & !is.na(DirectChol)) %>%
select("DirectChol")
n <- 15 # number of subjects per sample
nSim <- 50 # number of simulations
femSamp <- matrix(nrow = n, ncol = nSim)
for (j in 1:nSim) {
femSamp[, j] <- sample(fem$DirectChol, 15)
if (j < 4) {
p <- data.frame(log2(femSamp)) %>%
gather(key = "sample", value = "log2cholesterol") %>%
ggplot(aes(x = sample, y = log2cholesterol)) +
geom_boxplot(na.rm = TRUE) +
stat_summary(
fun = mean, geom = "point",
size = 3, color = "red", na.rm = TRUE
) +
geom_hline(yintercept = mean(fem$DirectChol %>% log2())) +
ylab("cholesterol (log2)")
print(p)
}
}
data.frame(log2(femSamp)) %>%
gather(key = "sample", value = "log2cholesterol") %>%
ggplot(aes(x = sample, y = log2cholesterol)) +
geom_boxplot() +
stat_summary(
fun = mean, geom = "point",
size = 3, color = "red", na.rm = TRUE
) +
geom_hline(yintercept = mean(fem$DirectChol %>% log2())) +
ylab("cholesterol (log2)")
```
We observe that the mean nicely fluctuates around the population mean.
Copy the code, increase the sample size to 100 subjects and observe what happens!
---
### How to do this based on a single sample?
- Insight in how close we can expect $\bar X$ to $\mu$?
- How varies $\bar X$ from sample to sample?
- Variability on $\bar X$
- We have to determine this based on a single sample!
- We need to make assumptions
- We assume that the random variables $X_1, X_2, ..., X_n$ originate from $n$ *independent* subjects.
- For the captopril study we had dependent observations.
- Blood pressure measurements before ($Y_{i,before}$) and after ($Y_{i,after}$) administering captopril for the same subject $i=1,\ldots,n$.
- We turned them into n independent measurements by taking the difference $Y_{i,after}-Y_{i,before}$
---
### Variance estimator for $\bar X$
$$\sigma^2_{\bar X}=\frac{\sigma^2}{n}$$
- The standard deviation of $\bar X$ around $\mu$ is $\sqrt{n}$ times smaller that the deviation around the original observations $X$.
- The more observations we have the more precise $\bar X$.
<details><summary>Click to see proof</summary><p>
\begin{eqnarray*}
\text{Var}(\bar X)&=&\text{Var} \left(\frac{X_1+ X_2+ ... + X_n}{n}\right) \\
&= & \frac{\text{Var} (X_1+ X_2+ ... + X_n)}{n^2} \\
&\overset{*}{=} & \frac{\text{Var}(X_1)+ \text{Var}(X_2)+ ... + \text{Var}(X_n)}{n^2} \\
&=& \frac{\sigma^2 + \sigma^2 + ... \sigma^2}{n^2} \\
&= & \frac{\sigma^2}{n}.
\end{eqnarray*}
- (*) this is based on the assumption of independence.
$$\text{Var}[X_1 + X_2] = \text{Var}[X_1] + \text{Var}[X_2] + 2 \text{Covar}[X_1,X_2]$$
- With $Covar[X_1,X_2]=0$ when $X_1$ and $X_2$ are independent.
</p></details>
**Definition: standard error**
The standard deviation of $\bar{X}$ is $\sigma/\sqrt{n}$ and is also referred to as the **standard error** of the mean.
Generally one refers to the standard deviation of an estimator for a particular parameter $\theta$ with the term **standard error** of the estimator, which is denoted as $SE$.
---
### Captopril example
- $n = 15$ differences in systolic blood pressure
- Suppose that the standard deviation of the blood pressure differences in the population is $\sigma = 9.0$ mmHg
- Then, the standard error (SE) on the average systolic blood pressure differs $\bar X$ becomes:
$$
SE= \frac{9.0}{\sqrt{15}}=2.32\text{mmHg.}
$$
- Generally $\sigma$, and thus the SE on the sample mean are unknown.
- So we also have to estimate the standard deviation of the sample to obtain the standard error
- Estimator: $SE=S/\sqrt{n},$
- with $S^2$ the sample variance of $X_1,...,X_n$ and $S$ the sample standard deviation
- For the captopril example we obtain:
```{r}
n <- length(captopril$deltaSBP)
se <- sd(captopril$deltaSBP) / sqrt(n)
se
```
---
### Standard deviation vs standard error
#### Illustrate via repeated sampling
- Different sample sizes: 10, 50, 100
- Draw 1000 samples per sample size from the NHANES study, for each sample we calculate
- The mean
- The sample standard deviation
- The standard error
- We make a boxplot of the sample standard deviations and the standard errors for the different sample sizes
- Instead of using a for loop we will use the sapply function which is more efficient. It takes a vector or a list as input and applies a function on each element of the vector or on each list element.
```{r}
set.seed(1)
femSamp10 <- replicate(1000, sample(fem$DirectChol, size = 10))
femSamp50 <- replicate(1000, sample(fem$DirectChol, size = 50))
femSamp100 <- replicate(1000, sample(fem$DirectChol, size = 100))
## Calculate log2 and convert to data.frame
femSamp10_log2 <- data.frame(log2(femSamp10))
femSamp50_log2 <- data.frame(log2(femSamp50))
femSamp100_log2 <- data.frame(log2(femSamp100))
## Custom function to calculate summary statistics
calculate_stats <- function(x) {
x %>%
gather(sample, log2Chol) %>%
group_by(sample) %>%
summarize(
median = median(log2Chol, na.rm = TRUE),
mean = mean(log2Chol, na.rm = TRUE),
sd = sd(log2Chol, na.rm = TRUE),
n = n()
) %>%
mutate(se = sd / sqrt(n))
}
res <- rbind(
calculate_stats(femSamp10_log2),
calculate_stats(femSamp50_log2),
calculate_stats(femSamp100_log2)
)
```
##### Means {-}
We first illustrate the impact of sample size on the distribution of the means of the different samples
```{r}
res %>%
ggplot(aes(x = as.factor(n), y = mean)) +
geom_boxplot() +
ylab("Direct cholesterol (log2)") +
xlab("sample size")
```
- Note, that the variation of the sample means indeed reduces as the sample size increases. So the estimation gets more precise with increasing sample size.
---
##### Standard deviation {-}
We now illustrate the impact of sample size on the distribution of the standard deviation of the different samples
```{r}
res %>%
ggplot(aes(x = as.factor(n), y = sd)) +
geom_boxplot() +
ylab("standard deviation") +
xlab("sample size")
```
- The standard deviation remains similar across sample size.
It is centred around the same value: the standard deviation in the population. Indeed increasing the sample size does affect the variability in the population!
- Again we see that the variability of the standard deviation reduces with increasing sample size. So the standard deviation can also be estimated more precise with increasing sample size.
---
##### Standard error on the mean {-}
Finally, we illustrate the impact of sample size on the distribution of the standard deviation on the mean of the different samples
```{r}
res %>%
ggplot(aes(x = as.factor(n), y = se)) +
geom_boxplot() +
ylab("standard error") +
xlab("sample size")
```
- The standard error, the estimator for the precision of the sample mean, however, reduces considerably with increasing sample size again confirming that the estimation of the sample mean gets more precise.
---
### Normally distributed data
- For normally distributed data we have multiple estimators for the population mean $\mu$ e.g mean and median.
- But, $\bar{X}$ is the unbiased estimator of $\mu$ with the smallest standard error
- $\bar{X}$ deviates less from the mean $\mu$ than the median
- We illustrate this for repeated sampling with sample size 10
```{r}
res %>%
filter(n == 10) %>%
select(mean, median) %>%
gather(type, estimate) %>%
ggplot(aes(x = type, y = estimate)) +
geom_boxplot() +
geom_hline(yintercept = mean(log2(fem$DirectChol))) +
ggtitle("10 subjects")
```
Next, we compare the distribution of mean and median in repeated samples of sample size 50.
```{r}
res %>%
filter(n == 50) %>%
select(mean, median) %>%
gather(type, estimate) %>%
ggplot(aes(x = type, y = estimate)) +
geom_boxplot() +
geom_hline(yintercept = mean(log2(fem$DirectChol))) +
ggtitle("50 subjects")
```
### Distribution of sample mean
- How varies $\bar X$ from sample to sample?
- Distribution of $\bar X$?
- If $\bar X$ is normally distributed the standard error has a good interpretation: the s.e. is the standard deviation of the sample mean.
- If the data $X_i$ are normally distributed, the sample mean is also normally distributed.
$$X_i \sim N(\mu,\sigma^2) \rightarrow \bar X \sim N(\mu, \sigma^2/n)$$
---
#### NHANES: cholesterol
We illustrate this again with simulation using the NHANES study.
The log2 cholesterol levels were normally distributed.
```{r}
fem %>%
ggplot(aes(x = DirectChol %>% log2())) +
geom_histogram(aes(y = ..density.., fill = ..count..)) +
xlab("Direct cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(log2(fem$DirectChol)), sd = sd(log2(fem$DirectChol)))
) +
ggtitle("All females in Nhanes study")
fem %>%
ggplot(aes(sample = log2(DirectChol))) +
stat_qq() +
stat_qq_line() +
ggtitle("All females in Nhanes study")
```
---
##### Evaluate distribution for samples with 5 subjects {-}
```{r}
set.seed(1)
femSamp5 <- replicate(1000, sample(fem$DirectChol, size = 5))
femSamp5_log2 <- data.frame(log2(femSamp5))
femSamp5_log2 %>%
ggplot(aes(x = X1)) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 10) +
xlab("Direct cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp5_log2$X1), sd = sd(femSamp5_log2$X1))
) +
ggtitle("5 random females") +
xlim(range(log2(fem$DirectChol)))
femSamp5_means <- colMeans(femSamp5_log2)
ggplot(data.frame(means = femSamp5_means), aes(x = means)) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 15) +
xlab("Mean cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp5_means), sd = sd(femSamp5_means))
) +
ggtitle("Means on 5 females")
data.frame(means = femSamp5_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 5 females")
```
---
##### Explore the distribution of the mean for samples of size 10 {-}
Now we explore the results for the sample size of 10.
We first illustrate the plot for the first sample.
```{r}
femSamp10_log2 %>%
ggplot(aes(x = X1)) +
geom_histogram(aes(y = ..density.., fill = ..count..), binwidth = 0.5) +
xlab("Direct cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp10_log2$X1), sd = sd(femSamp10_log2$X1))
) +
ggtitle("10 random females") +
xlim(range(log2(fem$DirectChol)))
```
Next we look at the distribution of the sample mean over 1000 samples of sample size 10.
```{r}
femSamp10_means <- colMeans(femSamp10_log2)
ggplot(data.frame(means = femSamp10_means), aes(x = means)) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 15) +
xlab("Mean cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp10_means), sd = sd(femSamp10_means))
) +
ggtitle("Means on 10 females")
data.frame(means = femSamp10_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 10 females")
```
**So we confirmed that the mean is approximately normally distributed for studies with 5 and 10 females when the original data are approximately normally distributed.**
---
#### Captopril study
- For Captopril study the systolic blood pressure differences are approximatively normally distributed.
- s.e.= 2.32 mm Hg
- In 95 out of 100 studies with n = 15 subjects we expect the sample mean of the systolic blood pressure differences ($\bar X$) on less then $2 \times 2.32 = 4.64$mm Hg of the real population mean of the blood pressure differences ($\mu$).
---
### Non-normally distributed data
- When individual observations do not have a normal distribution, $\bar X$ is still \textit{approximately} Normally distributed
when the number observations are large enough.
- How large does the sample needs to be for the Normal approximation to work?
- This depends on the skewness of the distribution!
---
#### NHANES: cholesterol {-}
- When can evaluated this in the NHanes study if we do not log2 transform the data.
```{r}
ggplot(fem, aes(x = DirectChol)) +
geom_histogram(aes(y = ..density.., fill = ..count..)) +
xlab("Direct cholesterol") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(fem$DirectChol), sd = sd(fem$DirectChol))) +
ggtitle("All females in Nhanes study")
ggplot(fem, aes(sample = DirectChol)) +
stat_qq() +
stat_qq_line() +
ggtitle("All females in Nhanes study")
```
The cholesterol data is clearly non-Normally distributed.
##### Distribution of the sample mean for different sample sizes {-}
```{r}
## Calculating the means of the simulated samples WITHOUT first log2-transforming
femSamp5_means <- colMeans(femSamp5)
femSamp10_means <- colMeans(femSamp10)
femSamp50_means <- colMeans(femSamp50)
femSamp100_means <- colMeans(femSamp100)
ggplot(data.frame(means = femSamp5_means), aes(x = means)) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 15) +
xlab("Mean cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp5_means), sd = sd(femSamp5_means))
) +
ggtitle("Means on 5 females")
data.frame(means = femSamp5_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 5 females")
ggplot(data.frame(means = femSamp10_means), aes(x = means)) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 15) +
xlab("Mean cholesterol (log2)") +
stat_function(
fun = dnorm, color = "red",
args = list(mean = mean(femSamp10_means), sd = sd(femSamp10_means))
) +
ggtitle("Means on 10 females")
data.frame(means = femSamp10_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 10 females")
data.frame(means = femSamp50_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 50 females")
data.frame(means = femSamp100_means) %>%
ggplot(aes(sample = means)) +
stat_qq() +
stat_qq_line() +
ggtitle("Means on 100 females")
```
- **We observe that when the data are not normally distributed the distribution of the sample mean is not normally distributed in small samples**
- **For large samples, however, the sample mean of non normal data is still approximately normally distributed.**
---
### Central Limit Theorem
Let $X_1, \ldots, X_n$ are sequence of random variables that are drawn independently from the same distribution (population).
As long as the sample size n is sufficiently large, the sample mean $\bar X$ is approximately normally distributed, irrespective of the distribution of the observations $X_i$.
### Overview on the distribution of the mean
https://www.nature.com/articles/nmeth.2613.pdf
---
# Interval estimators
- $\bar X$ varies around $\mu$
- Here we will develop an interval around $\bar X$ that will contain the value of $\mu$ with a probability of 95% for a random sample.
- We first assume $\sigma^2$ to be known and we will later relax this assumption.
---
## Normally distributed data with known variance
- $X\sim N(\mu,\sigma^2) \rightarrow \bar X\sim N\left(\mu,\frac{\sigma^2}{n}\right)$
- 95% reference-interval for sample mean
\begin{equation*}
\left[\mu - 1.96 \frac{\sigma}{\sqrt{n}},\mu + 1.96 \frac{\sigma}{\sqrt{n}}%
\right]
\end{equation*}
- The interval contains the sample mean of a random sample with a probability of 95%.
- We can not calculate it because $\mu$ is unknown.
- Estimate $\mu$ by $\bar X$.
\begin{equation*}
\left[\bar X - 1.96 \frac{\sigma}{\sqrt{n}},\bar X + 1.96 \frac{\sigma}{\sqrt{n}}\right]
\end{equation*}
- More useful interpretation:
\vspace{15pt}
- Rewrite $\mu - 1.96 \ \sigma/\sqrt{n} < \bar{X}$ as $\mu < \bar{X} + 1.96 \ \sigma/\sqrt{n}$.
So that we can write
\begin{eqnarray*}
95\% &=& P( \mu - 1.96 \ \sigma/\sqrt{n} < \bar{X} < \mu + 1.96 \ \sigma/\sqrt{n} ) \\
&=&P( \bar{X} - 1.96 \ \sigma/\sqrt{n} < \mu < \bar{X} + 1.96 \ \sigma/\sqrt{n} )
\end{eqnarray*}
---
**Definition of 95% confidence interval on mean**
For a random sample, the interval
\begin{equation}
[\bar{X} - 1.96 \ \sigma/\sqrt{n} , \bar{X} + 1.96 \ \sigma/\sqrt{n} ],
\end{equation}
contains the population mean $\mu$ with a probability of 95%.
---
- The probability that the CI for a random sample contains the population parameter $\mu$, i.e. 95%, is also referred to as the **confidence level**.
- Note, that the lower and upper limit of the interval are also random variables that vary from sample to sample. Different samples indeed result in different confidence intervals because they are based on different observation.
- So they are *stochastic intervals*
- 95% of the samples will produce a 95% confidence interval that will contain the population mean $\mu$. The remaining 5% will produce intervals that do not contain the population mean.
- Based on one interval you cannot conclude that it contains the real population parameter, because its value is unknown.
Generally the standard deviation is unknown and has to be estimated e.g. by $S$
- For large $n$ $[\bar{X} - 1.96 \ s/\sqrt{n} , \bar{X} + 1.96 \ s/\sqrt{n} ]$ will contain the population mean with a probability of approximately 95%.
---
### NHANES log2 cholesterol example
#### One sample
```{r}
samp50 <- sample(fem$DirectChol, 50)
ll <- mean(samp50 %>% log2()) - 1.96 * sd(samp50 %>% log2()) / sqrt(50)
ul <- mean(samp50 %>% log2()) + 1.96 * sd(samp50 %>% log2()) / sqrt(50)
popMean <- mean(fem$DirectChol %>% log2())
c(ll = ll, ul = ul, popMean = popMean)
```
#### Repeated sampling
```{r}
res$ll <- res$mean - 1.96 * res$se
res$ul <- res$mean + 1.96 * res$se
mu <- fem$DirectChol %>%
log2() %>%
mean()
res$inside <- res$ll <= mu & mu <= res$ul
res$n <- as.factor(res$n)
res %>%
group_by(n) %>%
summarize(coverage = mean(inside)) %>%
spread(n, coverage)
```
- Note, that the coverage in the samples with 10 observations is too low because we do not account for the uncertainty in the estimation of the standard deviation.
- If we look to the first 20 intervals, `r sum((!res[res$n==10,"inside"])[1:20])` out of 20 do not contain the population mean.
```{r}
res %>%
filter(n == 10) %>%
slice(1:20) %>%
ggplot(aes(x = sample, y = mean, color = inside)) +
geom_point() +
geom_errorbar(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se)) +
geom_hline(yintercept = mean(log2(fem$DirectChol))) +
ggtitle("20 CI for N=10") +
ylim(range(fem$DirectChol %>% log2()))
```
---
- For large sample sizes (100) the coverage is fine because we can estimate the standard deviation with a relative high precision.
```{r}
res %>%
filter(n == 50) %>%
slice(1:20) %>%
ggplot(aes(x = sample, y = mean, color = inside)) +
geom_point() +
geom_errorbar(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se)) +
geom_hline(yintercept = mean(log2(fem$DirectChol))) +
ggtitle("20 CI for N=50") +
ylim(range(fem$DirectChol %>% log2()))
res %>%
filter(n == 100) %>%
slice(1:20) %>%
ggplot(aes(x = sample, y = mean, color = inside)) +
geom_point() +
geom_errorbar(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se)) +
geom_hline(yintercept = mean(log2(fem$DirectChol))) +
ggtitle("20 CI for N=100") +
ylim(range(fem$DirectChol %>% log2()))
```
- What have you observed for the interval width?
---
### Other confidence levels
- We can replace the $z_{2.5\%}=1.96$ by another quantile of the normal distribution $z_{\alpha/2}$ to obtain an interval with another confidence level $1-\alpha$.
- Confidence intervals are not only used for the mean but also for other population parameters.
---
## Unknown variance
In real examples $\sigma$ is unknown and estimated based on the sample using the sample standard deviation $S$.
- The previous intervals were a bit to small because they did not account for the uncertainty on the estimation of $S$.
- When $n$ is large, $S$ is close to $\sigma$.
- Hence, ${(\bar{X} - \mu)}/{(S/\sqrt{n}) }$ is approximately standard normal and
\begin{equation*}
\left[\bar{X} - z_{\alpha/2} \ \frac{S}{\sqrt{n}} , \bar{X} + z_{\alpha/2} \
\frac{S}{\sqrt{n}}\right]
\end{equation*}
is an approximate $(1- \alpha)100\%$ CI for $\mu$.
- For small samples this no longer holds (e.g. n=10)
The estimation of $S$ introduces additional uncertainty in the standardized value ${(\bar{X} - \mu)}/{(S/\sqrt{n})}$. Its distribution
- is still symmetric but has heavier tails that the normal distribution.
- it depends on $n$ how much heavier the tails are
- is a (Student) $t$-distribution with $n-1$ *degrees of freedom*.
---
### T-distribution
More formally: Let $X_1, X_2, ..., X_n$ be an independent random sample from a Normal distribution $N(\mu, \sigma^2)$, then $(\bar{X} - \mu)/(S/\sqrt{n})$ follows a $t$-distribution with $n-1$ degrees of freedom.
The density of a t-distribution can be calculated in R using the function `dt`. It has arguments `x` for the quantile and `df` for the degrees of freedom.
```{r eval=FALSE}
grid <- seq(-5, 5, .1)
densDist <- cbind(grid, dnorm(grid), sapply(c(2, 5, 10), dt, x = grid))
colnames(densDist) <- c("x", "normal", paste0("t", c(2, 5, 10)))
densDist %>%
as.data.frame() %>%
gather(dist, dens, -x) %>%
ggplot(aes(x = x, y = dens, color = dist)) +
geom_line() +
ylab("Density")
```
t-distributions have heavier tails then the normal distribution $\rightarrow$ larger quantiles so broader intervals for the same confidence level.
- This captures the additional uncertainty for estimating $S$.
- If $n \rightarrow \infty$ then $t(df) \rightarrow N(0,1)$
- Quantiles of the $t$-distribution can be calculated in R using `qt`. e.g. 95%, 97.5%, 99.5% quantile for a t-distribution with 14 degrees of freedom:
```{r}
qt(.975, df = 14)
qt(c(.95, .975, .995), df = 14)
```
- These quantiles can be used to calculate 90%, 95% and 99% CI.
- 97.5% quantile `r round(qt(.975,df=14),2)` of a t-distribution with $n-1=14$ degrees of freedom is indeed larger than that of a standard Normal distribution `r round(qnorm(.975),2)`.
---
### Confidence interval based on the t-distribution
The $100\% (1-\alpha)$ CI for the mean $\mu$ of a
Normal distributed random variable $X$ with unknown variance is
\begin{equation*}
\left[\bar{X} - t_{n-1, \alpha/2} \frac{s}{\sqrt{n}} , \bar{X} + t_{n-1,
\alpha/2} \frac{s}{\sqrt{n}}\right]
\end{equation*}
- We simply replace the $(1-\alpha/2)100\%$ quantile of the Normal distribution by that of the t-distribution with $n-1$ degrees of freedom.