-
Notifications
You must be signed in to change notification settings - Fork 0
/
sampling.qmd
390 lines (276 loc) · 18.4 KB
/
sampling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# Populations and samples {#sec-sampling}
```{r}
#| include: false
library(fontawesome)
devtools::install_github("eliocamp/tagger")
library(tagger)
library(tidyverse)
```
::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives
- Explain the difference between population, sample and sampling distributions
- Understand the properties of the distribution of sample means
- Understand the central limit theorem
:::
## Population, sample, and point estimation
### Population and parameters
In statistics, a **population** refers to a theoretical concept representing the complete collection of individuals (which may not necessarily be people) sharing specific defining characteristics. Examples are the population of all patients with diabetes mellitus, all people with depression, or the population of all middle-aged women.
Researchers are particularly interested in quantities such as the population mean and population variance of random variables (characteristics) within these populations. These values are typically not directly observable and are denoted as **parameters** in statistical analysis. We use Greek lowercase letters for parameters, such as $μ$ and $σ^2$, to represent the population mean and variance, respectively. For example, researchers want to know what the mean depression score for the population would be if all people with depression were treated with a new anti-depression treatment.
### Sample and sample statistics
In practice, researchers encounter constraints in resources and time, particularly when dealing with large or inaccessible populations, making it impractical to study each individual within the population (e.g., every individual with depression in the world). As a result, obtaining an exact value of a population parameter is typically unattainable. Instead, researchers analyze a **sample**, a subset of the population intended to be representative. In such cases, a point estimator is utilized to calculate an estimate of the unknown parameter based on the measurements obtained from the sample. For example, a **sample statistic** such as the sample mean can serve as an estimator for the population mean.
::: content-box-blue
In most cases, the best way to get a sample that accurately represents the population is by taking a **random sample** from the population. When selecting a random sample, each individual in the population has **equal** and **independent** chance of being included in the sample.
:::
The statistical framework in @fig-sample_frame illustrates the process of inferring population parameters using sample statistics.
![The statistical framework for population parameter estimation.](images/sample_frame.png){#fig-sample_frame fig-align="center" width="80%"}
::: content-box-green
**Point estimation**
Point estimation is a statistical method used to estimate an unknown parameter of a population based on data collected from a sample. The objective of point estimation is to find a single, best guess or **estimate** for the value of the parameter. Common point estimators include the sample mean for the population mean and sample variance for the population variance.
:::
### Error in the estimate
The difference between the point estimate and the population parameter is referred to as the **error** in the estimate. This error constitutes a "total" error, comprised of two components:
$$total \ error = bias + sampling \ error$$ {#eq-total_error}
- **Bias:** This refers to a tendency to overestimate or underestimate the true value of the population parameter. There are numerous sources of bias in a study, including measurement bias (i.e., errors in measuring exposure or disease), sampling bias (i.e., some members of a population are systematically more likely to be selected in a sample than others), recall bias (i.e., when participants in a research study do not accurately remember a past event or experience), and attrition bias (i.e., systematic differences between study groups in the number and the way participants are lost from a study). Bias can be minimized through thoughtful design of the study (i.e., a comprehensive protocol), careful data collection procedures, and the application of suitable statistical techniques [@brown2024].
- **Sampling error:** This measures the extent to which an estimate tends to vary from one sample to another due to random chance. Our objective is frequently to quantify and understand this variability in estimates. Standard error and confidence intervals, common measures of sampling error, are primarily influenced by both the sample size and the variability of the estimated characteristic.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
The sampling error is the error caused by observing a sample instead of the whole population. There is no sampling error in a census because the calculations are based on the entire population.
:::
## Sampling distribution
### What is a Sampling Distribution?
Suppose a hospital is interested in finding out the average blood pressure (BP) of its diabetic patients, but measuring each patient is impractical. Instead, they randomly selected *n* patients from this group and measured their BP. The resulting average BP for this sample, let's say $\bar{x_1} = 130$ mmHg, represents one sample mean (@fig-sample1).
![The estimate of mean, $\bar{x_1} =130$, represents an observed sample mean from an already collected sample.](images/sample1.png){#fig-sample1 fig-align="center" width="25%"}
Consider repeating the sampling process by randomly selecting various samples, each consisting of *n* patients, and calculating their average blood pressure. This would yield a range of different sample means, such as $\bar{x_1} = 130$ mmHg, $\bar{x_2} = 128$ mmHg, $\bar{x_3} = 133$ mmHg, and so forth. This collection of multiple sample means (obtained from different samples) is the sampling distribution of the mean BP (@fig-many_samples).
![The collection of sample means derived from repeated sampling forms a sampling distribution of the mean BP.](images/many_samples.png){#fig-many_samples fig-align="center" width="70%"}
::: content-box-blue
The **sampling distribution** is a theoretical probability distribution that represents the possible values of a sample statistic, such as the sample mean, obtained from all possible samples of a specific size drawn from a population.
:::
| Parameter | Population | Sample | Sampling distribution of mean |
|------------------|:----------------:|:----------------:|:-----------------:|
| **Mean** | $\mu$ | $\bar{x}$ | $\mu_{\bar{x}}$ |
| **Standard deviation** | $\sigma$ | $s$ | $\sigma_{\bar{x}}$ |
: Notation of population, sample, and sampling distributions mean and standard deviation {#tbl-symbols}
### Standard Error of the mean (SEM)
The standard deviation of the sampling distribution is known as the standard error (SE). There are multiple formulas for standard error depending on what is our sampling distribution. For example, the standard error of the mean (SEM) is the population σ divided by the square root of the sample size n:
$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$ {#eq-se}
However, we usually do not know the population parameter σ; therefore, we use the sample standard deviation s, as it is an estimator of the population standard deviation.
$$SE_{\bar{x}} = \frac{s}{\sqrt{n}}$$ {#eq-se2}
The **Standard Error of the mean** (SEM) is a metric that describes the variability of sample means within the **sampling distribution**. In practice, it provides insight into the uncertainty associated with estimating the population mean when working with a sample, particularly when the sample size is small.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
The standard error is essential in constructing **confidence intervals** around point estimates, a process known as interval estimation (@sec-conf_intervals).
:::
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Example**
The CD4 count of a sample of 64 healthy individuals has a mean of 850 $counts/mm^3$ with a standard deviation of 240 $counts/mm^3$. The standard error of the mean for this population is calculated as follows:
$SE_{\bar{x}} = \frac{s}{\sqrt{n}} = \frac{240}{\sqrt{64}} = \frac{240}{8} = 30 \ counts/mm^3$
The SEM provides a measure of the precision of our estimate of the population mean CD4 counts. If we were to repeat the sampling process numerous times and calculate the mean CD4 count each time, we would expect the calculated means to vary around the population mean by approximately 30 $counts/mm^3$.
:::
**In R:**
```{r}
sem <- 240 / sqrt(64)
sem
```
### Properties of the distribution of sample means
Consider a population of 100,000 adults, characterized by a mean blood pressure (BP) of μ = 126 mmHg and a standard deviation of σ = 10. Now, imagine that the distribution of their BP reveals an intriguing bimodal pattern, as visually depicted in @fig-pop0.
```{r}
#| echo: false
#| message: false
#| warning: false
set.seed(46)
# Create a non-uniform population of 100,000 numbers between 1 and 100
# Create a non-uniform population of 100,000 numbers between 1 and 100
pop1 <- rnorm(70000, mean = 120, sd = 3)
pop2 <- rnorm(30000, mean = 140, sd = 6)
pop <- c(pop1, pop2)
mu <- mean(pop) #calculate the population mean
sigma <- sd(pop) #calculate the population standard deviation
popdf <- as.data.frame(pop)
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-pop0
#| fig-cap: A hypothetical population of 100,000 observations. The dashed black line represents the population, μ.
#| fig-width: 7.0
#| fig-height: 3.5
# histogram
ggplot(popdf, aes(x = pop)) + geom_histogram(color = "black", fill = "#894ae0", alpha=0.3) +
geom_vline(xintercept = mu, linetype = "dashed", linewidth = 0.8) +
theme_classic() +
ggtitle("Histogram of Population") + xlab("x") +
theme(axis.title = element_text(hjust = 1))
```
Let's consider sampling five individuals from the population and calculating their sample mean BP, denoted as $\bar{x_1}$. Next, let's repeat this process by collecting a second sample of five individuals and calculating the sample mean again, which we might denote as $\bar{x_2}$. This process can be iterated 100 times ($N = 100$) and generate the histogram of the sample means (@fig-simulation2 a). Next, we run this simulation a bunch of times with different sample size 10, 30, 50, 100 (@fig-simulation2 b, c, d, e).
```{r}
#| echo: false
#| message: false
#| warning: false
n <- c(5, 10, 30, 50, 100) #set up number of samples
t <- c(100) #set up number of trials in simulation
df <- data.frame() #initialize our empty data frame
# Run the simulation
for(i in n) { #for each value of n...
col <- c()
for(j in t) { #we loop through each value of t...
trial <- 1:j
counter <- j #set up an egg timer based on whichever t value we're on
value <- c()
while(counter > 0) { # and extract n samples from the population...
bucket <- sample(pop, i, replace = TRUE)
xbar <- mean(bucket) #calculate the mean...
value <- c(value, xbar) # and add it to a vector
counter <- counter - 1 #egg timer counts down and loops back until it hits 0
}
sbar <- sd(value) #calculate the standard deviation of our sample
col <- cbind(trial, value, sbar, i, j) #stick all the info together...
df <- rbind(df, col) #and attach it to our master data frame
} #and we do it again for the next set of values until we're done!
}
rm(col, bucket, value, counter, i, j, n, sbar, t, xbar, trial) #clean up
names(df) <- c("trial#", "value", "sdev", "sample_size", "No.samples")
```
```{r}
#| echo: false
#| message: false
#| warning: false
df2 <- df |>
mutate(sample_size = factor(sample_size)) |>
group_by(sample_size) |>
summarize(my_mean = mean(value))
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-simulation2
#| fig-cap: Distribution of sample means. Each panel represents a simulation of 100 random samples of size 5, 10, 30, 50, 100 taken from the population data. The dashed black line represents the population, μ, while the yellow dashed line the mean of the sample means, $\mu_{\bar{x}}$.
#| fig-width: 7.0
#| fig-height: 12.0
# We tidy up our data frame to get it ready for graphing. Note that we built it in "tall"
# form so it's already structured for ggplot
# Creating the plot
ggplot(df, aes(x = value)) +
geom_histogram(fill = "steelblue", binwidth = 0.8) +
#ggtitle("Demonstrating The Central Limit Theorem With Simulation") +
geom_vline(data = df2, aes(xintercept = my_mean), color = "yellow", linetype = "dashed", linewidth = 0.8) +
geom_vline(xintercept = mu, linetype = "dashed", color = "gray30", linewidth = 0.7) +
xlab(expression(bar(x))) +
theme(axis.title = element_text(hjust = 1),
legend.position = "none") +
facet_grid(sample_size ~ No.samples, scales="free_y", labeller = label_both) +
tag_facets()
```
From @fig-simulation2, it is evident that as the sample size increases, the distribution of sample means tends to approximate a normal distribution, with the mean of this distribution,$\mu_{\bar{x}}$, approaching the population mean, $\mu$.
::: content-box-blue
**Properties of the distribution of sample means**
- As the sample size increases, the mean of a large number of sample means converges to the population mean. This property is known as the **law of large numbers**.
- The standard error of the mean (SEM) decreases as the sample size increases.
- As the sample size increases, the distribution of sample means tends to approximate a **normal** distribution.
:::
## Central Limit Theorem (CLM) for sample means
::: content-box-green
The **Central Limit Theorem** (CLM) for sample means in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution **regardless** of the variable's underlying distribution of the population observations: $\overline{X} \sim N(\mu, \sigma^2/n)$.
*NOTE: The CLM can be applied in inferential statistics for various test statistics, such as difference in means, difference in proportions, and the slope of a linear regression model, under the assumption of large samples and the absence of extreme skewness.*
:::
To illustrate this, let's generate some data from a continuous uniform distribution (100,000 observations):
```{r}
#| echo: false
#| message: false
#| warning: false
set.seed(46)
#get data from uniform distribution
pop2 <- runif(100000, min = 0, max = 1)
mu2 <- mean(pop2) #calculate the population mean
sigma2 <- sd(pop2) #calculate the population standard deviation
popdf2 <- as.data.frame(pop2)
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-pop
#| fig-cap: A hypothetical population of 100,000 observations. The dashed black line represents the population, μ.
#| fig-width: 7.0
#| fig-height: 3.5
# histogram
ggplot(popdf2, aes(x = pop2)) +
geom_histogram(color='black', fill = "#894ae0",
alpha=0.3, binwidth = 0.1, boundary = 0) +
geom_vline(xintercept = mu2, linetype = "dashed", linewidth = 1.2) +
theme_classic() +
ggtitle("Destribution of Simulated data for Population") +
xlab("x") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
theme(axis.title = element_text(hjust = 1))
```
We can consider the data we've just created above as the entire population (N=100,000) from which we can sample. We sample a bunch of times with different number of samples (50, 70, 100) and simple sizes (5, 10, 30) and we generate the histograms of the sample means (@fig-simulation).
```{r}
#| echo: false
#| message: false
#| warning: false
n <- c(5, 10, 30) #set up number of samples
t <- c(50, 70, 100) #set up number of trials in simulation
df4 <- data.frame() #initialize our empty data frame
# Run the simulation
for(i in n) { #for each value of n...
col <- c()
for(j in t) { #we loop through each value of t...
trial <- 1:j
counter <- j #set up an egg timer based on whichever t value we're on
value <- c()
while(counter > 0) { # and extract n samples from the population...
bucket <- sample(pop2, i, replace = TRUE)
xbar <- mean(bucket) #calculate the mean...
value <- c(value, xbar) # and add it to a vector
counter <- counter - 1 #egg timer counts down and loops back until it hits 0
}
sbar <- sd(value) #calculate the standard deviation of our sample
col <- cbind(trial, value, sbar, i, j) #stick all the info together...
df4 <- rbind(df4, col) #and attach it to our master data frame
} #and we do it again for the next set of values until we're done!
}
rm(col, bucket, value, counter, i, j, n, sbar, t, xbar, trial) #clean up
names(df4) <- c("trial#", "value", "sdev", "sample_size", "No.samples")
```
```{r}
#| echo: false
#| message: false
#| warning: false
#df5 <- df4 |>
#mutate(sample_size = factor(sample_size),
#No.samples = factor(No.samples)) |>
#group_by( sample_size, No.samples) |>
#summarize(my_mean = mean(value))
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| label: fig-simulation
#| fig-cap: Distribution of sample means. Each panel represents a simulation of 100 random samples of size 5, 10, 20, 30, 70, 100 taken from the uniform population data. The dashed black line represents the population, μ.
#| fig-width: 8.0
#| fig-height: 10.0
# We tidy up our data frame to get it ready for graphing. Note that we built it in "tall"
# form so it's already structured for ggplot
# Creating the plot
ggplot(df4, aes(x = value)) +
geom_histogram(fill = "steelblue", binwidth = 0.06) +
#ggtitle("Demonstrating The Central Limit Theorem With Simulation") +
#geom_vline(data = df5, aes(xintercept = my_mean), color = "red", linetype = "dashed", linewidth = 0.7) +
geom_vline(xintercept = mu2, linetype = "dashed", color = "gray30", linewidth = 0.7) +
xlab(expression(bar(x))) +
theme(axis.title = element_text(hjust = 1)) +
facet_grid(sample_size ~ No.samples, scales="free_y", labeller = label_both) +
tag_facets()
```
We observe that the distribution of sample means for sample size of five exhibits considerable variability (@fig-simulation a, b, c). As we take a large number of sample means and the sample size is increased to 30, the distributions become increasingly symmetric, with less variability, and tend toward approximate normality (@fig-simulation f, i).