-
Notifications
You must be signed in to change notification settings - Fork 0
/
chi_square.qmd
246 lines (164 loc) · 7.32 KB
/
chi_square.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# Chi-sqaure test of independence {#sec-chi_square}
```{r}
#| include: false
library(fontawesome)
library(htmlwidgets)
```
If we want to see whether there's an association between two categorical variables we can use the Pearson's chi-square test, often called chi-square test of independence. This is an extremely elegant statistic based on the simple idea of comparing the frequencies we observe in certain categories to the frequencies we might expect to get in those categories by chance.
When we have finished this Chapter, we should be able to:
::: {.callout-caution icon="false"}
## Learning objectives
- Applying hypothesis testing
- Investigate the association between two categorical variables using the chi-square test
- Interpret the results
:::
## Research question and Hypothesis Testing
We will use the "Survival from Malignant Melanoma" dataset named "meldata". The data consist of measurements made on patients with malignant melanoma, a type of skin cancer. Each patient had their tumor removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark, between 1962 and 1977.
Suppose we are interested in the association between tumor ulceration and death from melanoma.
::: {.callout-note icon="false"}
## Null hypothesis and alternative hypothesis
- $H_0$: There is no association between the two categorical variables (they are independent)
- $H_1$: There is association between the two categorical variables (they are dependent)
:::
**NOTE:** In practice, the null hypothesis of independence, for our particular question, is no difference in the proportion of patients with ulcerated tumors who die compared with non-ulcerated tumors ($p_{ulcerated} = p_{non-ucerated}$).
## Packages we need
We need to load the following packages:
```{r}
#| message: false
#| warning: false
library(rstatix)
library(ggsci)
library(patchwork)
library(here)
library(tidyverse)
```
## Preparing the data
We import the data *meldata* in R:
```{r}
#| warning: false
library(readxl)
meldata <- read_excel(here("data", "meldata.xlsx"))
```
```{r}
#| echo: false
#| label: fig-meldata
#| fig-cap: Table with data from "meldata" file.
DT::datatable(
meldata, extensions = 'Buttons', options = list(
dom = 'tip'
)
)
```
We inspect the data and the type of variables:
```{r}
glimpse(meldata)
```
The data set *meldata* has 250 patients (rows) and includes seven variables (columns). We are interested in the character (`<chr>`) `ulcer` variable and the character (`<chr>`) `status` variable which should be converted to factor (`<fct>`) variables using the `convert_as_factor()` function as follows:
```{r}
meldata <- meldata %>%
convert_as_factor(status, ulcer)
glimpse(meldata)
```
## Plot the data
We are interested in the association between tumor ulceration and death from melanoma. It is useful to plot the data as counts but also as percentages. It is percentages we are comparing, but we really want to know the absolute numbers as well.
```{r}
#| warning: false
#| label: fig-barplot1
#| fig-cap: Bar plot.
p1 <- meldata %>%
ggplot(aes(x = ulcer, fill = status)) +
geom_bar(width = 0.7) +
scale_fill_jco() +
theme_bw(base_size = 14) +
theme(legend.position = "bottom")
p2 <- meldata %>%
ggplot(aes(x = ulcer, fill = status)) +
geom_bar(position = "fill", width = 0.7) +
scale_y_continuous(labels=scales::percent) +
scale_fill_jco() +
ylab("Percentage") +
theme_bw(base_size = 14) +
theme(legend.position = "bottom")
p1 + p2 +
plot_layout(guides = "collect") & theme(legend.position = 'bottom')
```
Just from the plot, death from melanoma in the ulcerated tumor group is around 40% and in the non-ulcerated group around 13%. The number of patients included in the study is not huge, however, this still looks like a real difference given its effect size.
## Contigency table and Expected frequencies
First, we will create a contingency *2x2* table (two categorical variables with exactly two levels each) with the frequencies using the Base R.
```{r}
tb1 <- table(meldata$ulcer, meldata$status)
tb1
```
Next, we will also create a more informative table with row percentages and marginal totals.
::: {#exercise-joins .callout-tip}
## Table with row percentages and marginal totals
::: panel-tabset
## finalfit
Using the function `summary_factorlist()` which is included in `{finalfit}` package for obtaining row percentages and marginal totals:
```{r}
row_tb1 <- meldata %>%
finalfit::summary_factorlist(dependent = "status", add_dependent_label = T,
explanatory = "ulcer", add_col_totals = T,
include_col_totals_percent = F,
column = FALSE, total_col = TRUE)
knitr::kable(row_tb1)
```
## modelsummary
The contingency table using the `datasummary_crosstab()` from the `{modelsummary}` package:
```{r}
modelsummary::datasummary_crosstab(ulcer ~ status, data = meldata)
```
:::
:::
From the raw frequencies, there seems to be a large difference, as we noted in the plot we made above. The proportion of patients with ulcerated tumors who die equals to 45.6% compared with non-ulcerated tumors 13.9%.
## Assumptions
::: {.callout-note icon="false"}
## Check if the following assumption is satisfied
A commonly stated assumption of the chi-square test is the requirement to have an expected count of **at least 5 in each cell** of the `2x2` table.
For larger tables, all expected counts should be \> 1 and no more than 20% of all cells should have expected counts \< 5.
:::
We can calculate the **expected frequencies** for each cell using the `expected()` function from `{epitools}` package:
```{r}
epitools::expected(tb1)
```
Here, as we observe the assumption is fulfilled.
## Run Pearson's chi-square test
Finally, we run the chi-square test:
::: callout-tip
## chi-square test
::: panel-tabset
## Base R
```{r}
chisq.test(tb1)
```
## rstatix
```{r}
chisq_test(tb1)
```
:::
:::
There is evidence for an association between the ulcer and status (reject $H_0$). The proportion of patients with ulcerated tumors who died (45.6%) is significant larger compared with non-ulcerated tumors (13.9%) (p\<0.001).
## Risk Ratio and Odds ratio
**Risk ratio**
From the data in the following table
```{r}
epitools::table.margins(tb1)
```
\vspace{14pt}
we can calculate the risk ratio by hand: $$ Risk \ Ratio = \frac{\frac{41}{90}}{\frac{16}{115}} =\frac{0.4556}{0.1391} = 3.27$$
The risk ratio with the 95% CI using R:
```{r}
epitools::riskratio(tb1)$measure
```
The risk of dying is 3.27 (95% CI: 1.97, 5.4) times higher for patients with ulcerated tumors compared to non-ulcerated tumors.
**Odds ratio**
We can also calculate the odds ratio by hand: $$ Odds \ Ratio = \frac{\frac{41}{49}}{\frac{16}{99}} =\frac{0.837}{0.162} = 5.17$$ The odds ratio with the 95% CI using R:
```{r}
epitools::oddsratio(tb1, method = "wald")$measure
```
The odds of dying is 5.17 (95% CI: 2.65, 10.13) times higher for patients with ulcerated tumors compared to non-ulcerated tumors patients.
Finnaly, we can also reverse the odds ratio: $$ \frac{1}{OR} = \frac{1}{5.17} = 0.193$$
```{r}
epitools::oddsratio(tb1, method = "wald", rev = "rows")$measure
```
The non-ulcerated tumors patients have 0.193 (95% CI: 0.098, 0.378) times the odds (of dying) of the ulcerated tumors. This means that the non-ulcerated tumors patients have (0.193 - 1 = -0.807) 80.7% lower odds of dying than ulcerated tumors.