-
Notifications
You must be signed in to change notification settings - Fork 0
/
reliability.qmd
907 lines (630 loc) · 31.6 KB
/
reliability.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
# Measures of reliability {#sec-reliability}
```{r}
#| include: false
library(fontawesome)
library(htmlwidgets)
```
In this chapter, we explore measures of relative and absolute reliability (or agreement) that are used to assess the consistency, stability, and reproducibility of measurements or judgments.
When we have finished this Chapter, we should be able to:
::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives
- Understand the concepts of relative and absolute reliability
- Apply appropriate measures of reliability
- Interpret the results of reliability analysis
:::
## Relative and absolute reliability
Two distinct types of reliability are used: the relative and absolute reliability (agreement) [@kottner2011].
- **Relative Reliability** is defined as the ratio of variability between scores of the same subjects (e.g., by different raters or at different times) to the total variability of all scores in the sample. Reliability coefficients, such as the intra-class correlation coefficient for numerical data or Cohen's kappa for categorical data, are employed as suitable metrics for this purpose.
- **Absolute Reliability (or Agreement)** pertains to the assessment of whether scores, or judgments are identical or comparable, as well as the extent to which they might differ. Typical statistical measures employed to quantify this degree of error are the standard error of measurement (SEM) and the limits of agreement (LOA) for numerical data.
## Packages we need
We need to load the following packages:
```{r}
#| message: false
#| warning: false
# packages for graphs and tables
library(ggExtra)
library(janitor)
# packages for reliability analysis
library(irr)
library(irrCAC)
library(SimplyAgree)
library(blandr)
library(BlandAltmanLeh)
library(flextable)
library(vcd)
# packages for data import and manipulation
library(here)
library(tidyverse)
```
## Reliability for continuous measurements
### Research question
The parent version of Gait Outcomes Assessment List questionnaire (GOAL) is a parent-reported outcome assessment of family priorities and functional mobility for ambulatory children with cerebral palsy. We aim to examine the test--retest reliability of the GOAL questionnaire for the total score (score range: 0 - 100) which is an indicator of the stability of the questionnaire.
::: content-box-blue
**Test-Retest Reliability**
Test-retest reliability is used to assess the consistency and stability of a measurement tool (e.g. self-report survey instrument) over time on the same subjects under the same conditions. Specifically, assessing test-retest reliability involves administering the measurement tool to a group of individuals initially (time 1), subsequently reapplying it to the same group at a later time (time 2), and finally examining the correlation between the two sets of scores obtained.
| ID | Measurement 1 (time 1) | Measurement 2 (time 2) |
|:---:|:----------------------:|:----------------------:|
| 1 | $x_{11}$ | $x_{12}$ |
| 2 | $x_{21}$ | $x_{22}$ |
| ... | ... | ... |
| n | $x_{n1}$ | $x_{n2}$ |
: Measurements in two time points Time 1 and Time 2. {#tbl-data_str}
:::
### Preparing the data
The GOAL questionnaire was completed twice, 30 days apart, in a prospective cohort study of 127 caregivers of children with cerebral palsy and the data were recorded as follows:
```{r}
#| warning: false
library(readxl)
goal <- read_excel(here("data", "goal.xlsx"))
```
```{r}
#| echo: false
#| label: fig-BirthWeight
#| fig-cap: Table with data from "BirthWeight" file.
DT::datatable(
goal, extensions = 'Buttons', options = list(
dom = 'tip',
columnDefs = list(list(className = 'dt-center', targets = "_all"))
)
)
```
We will begin our investigation into the association between the first (GOAL1) and the second (GOAL2) measurement by generating a scatter plot:
```{r}
#| warning: false
#| label: fig-correlation
#| fig-cap: Scatter plot of the association between individual GOAL scores on the test and retest measurements (n = 127).
#| fig-width: 9.0
#| fig-height: 6.5
ggplot(goal, aes(GOAL1, GOAL2)) +
geom_point(color = "black", size = 2) +
lims(x = c(0, 100), y = c(0,100)) +
geom_abline(intercept = 0, slope = 1, linewidth = 1.0, color = "blue") +
coord_fixed(ratio = 1)
```
The scatter plot compares the GOAL1 and GOAL2 total scores. The solid blue diagonal line is the line of equality (i.e. the reference line: Y = X) that represents a perfect agreement of the two measurements.
### True measurement and the measurement error
The observed scores (X) from an instrument are thought to be composed of the underlying true score (T) and the total (systematic and random) error of a measurement (E):
$$
X = True + Error
$$ {#eq-score}
If the true measurement and the error term are uncorrelated, the measurement variance, $\sigma^2_X$, is given by:
$$
\sigma^2_X = \sigma^2_{True} + \sigma^2_{Error}
$$ {#eq-sigma}
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
The total variability can break down into smaller pieces based on characteristics of the study design.
:::
### Intra-class correlation coefficient (ICC)
Test-retest data of continuous measurements is often assessed using the intra-class correlation coefficient $ρ_{ICC}$. The $ρ_{ICC}$ is a ratio generally defined as:
$$
ρ_{ICC} =\frac{\sigma^2_{True}}{\sigma^2_{X}}
$$ {#eq-ricc}
The $ρ_{ICC}$ correlation coefficient ranges from 0 to 1 where higher values indicate better test-retest reliability. The population intra-class coefficient $ρ_{ICC}$ can be estimated using the ICC index. There are three different types of ICC representing different mathematical models:
- one-way random effects model ICC(1)
- two-way random effects model ICC(A,1)
- two-way mixed-effects model ICC(C,1)
where **A** stands for "Agreement" and **C** stands for "Consistency".
The choice of the appropriate ICC model depends on several factors, including how the data were collected, which variance components are considered relevant, and the specific type of reliability (agreement or consistency) we intend to assess [@liljequist2019].
In the context of our example, it is recommended to use the **"two-way"** model rather than the "one-way" model and **"agreement"** rather than "consistency", as systematic differences in the individual scores on the GOAL instrument over time are of interest [@qin2018].
For this model the @eq-score becomes:
$$
X = μ + ID + M + Residual
$$
where $μ$ is a constant (the mean value of X for the whole population), *ID* term is the difference due to subjects, *M* the difference due to measurements (in our case different time points), and the *Residual* is the random error.
The variance $\sigma^2_X$ for this model consists of three variance components:
$$
\sigma^2_X=\sigma^2_{ID} +\sigma^2_{M} +\sigma^2_{Residual}
$$
In population, the intra-class coefficient, $ρ_{A, 1}$, is defined as:
$$
ρ_{A, 1} = \frac{\sigma^2_{ID}}{\sigma^2_{ID} +\sigma^2_{M} +\sigma^2_{Residual}}
$$ {#eq-rA1}
where $\sigma^2_{ID}$, $\sigma^2_{M}$, and $\sigma^2_{Residual}$ are the variances of subjects (ID), measurements (M) and error (Residual), respectively.
A statistical estimate of the $ρ_{A, 1}$ for agreement is given by the ICC(A,1) formula [@mcgraw1996; @koo2016; @qin2018; @liljequist2019]:
$$
ICC(A, 1) = \frac{MSB-MSE}{MSB + (k-1) MSE + \frac{k}{n}(MSM-MSE)}
$$ {#eq-iccA1}
where $MSB$ = Mean Square Between subjects; MSE = Mean Square Error; $MSM$ = Mean Square (Between) Measurements; n = number of subjects; k = number of measurements.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
In test-retest reliability analysis there are only two measurements (Measurement 1 and Measurement 2; see @tbl-data_str) at different time points, so k = 2.
:::
The @tbl-icc provides a categorization of ICC index [@koo2016], yet the interpretation of ICC values can be somewhat arbitrary.
| ICC | Level of agreement |
|:-----------:|:------------------:|
| \<0.50 | Poor |
| 0.50 - 0.74 | Moderate |
| 0.75 - 0.89 | Good |
| 0.90 - 1.00 | Very good |
: Interpretation of intra-class correlation coefficient (ICC). {#tbl-icc}
**In R:**
First, we convert data from a wide format to a long format using the `pivot_longer()` function:
```{r}
# convert data into long format
goal_long <- goal |>
mutate(ID = row_number()) |>
pivot_longer(cols = c("GOAL1", "GOAL2"),
names_to = "M", values_to = "score") |>
mutate(ID = as.factor(ID),
M = factor(M))
head(goal_long)
```
Then, a two-way analysis of variance is applied with factors the `id` and the `items`:
```{r}
aov.goal <- aov(score ~ ID + M, data = goal_long)
s.aov <- summary(aov.goal)
s.aov
```
The statistical results are arranged within a matrix and we extract the mean squares of interest:
```{r}
stats <- matrix(unlist(s.aov), ncol = 5)
stats
MSB <- stats[1,3]
MSM <- stats[2,3]
MSE <- stats[3,3]
```
Finally, we calculate the ICC(A, 1) for agreement based on the @eq-iccA1:
```{r}
n <- dim(goal)[1]
k <- dim(goal)[2]
iccA1 <- (MSB - MSE)/(MSB + (k - 1) * MSE + k/n * (MSM - MSE))
iccA1
```
Next, we present R functions to carry out all the tasks for an reliability analysis.
::: {#exercise-joins .callout-important icon="false"}
## `r fa("r-project", fill = "#0008C1")` Summary statistics of reliability analysis
::: panel-tabset
## irr
```{r}
icc(goal, model = "twoway", type = "agreement")
```
## SimplyAgree
```{r}
SimplyAgree::jmvreli(
data = goal,
vars = vars(GOAL1, GOAL2),
desc = TRUE)
```
:::
:::
The test--retest relative reliability obtained between the initial measurement and the measurement recorded 30 days later was very good according to the inter-rater correlation coefficient, ICC(A,1) = 0.96 (95% CI = 0.94--0.97).
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
The result of significance test is not a point of interest because of the expected high correlation between two repeated measurements obtained from the same individuals.
:::
### Standard Error of Measurement (SEM)
The standard error of measurement (SEM; not to be confused with the standard error of the mean) is considered a measure that quantifies the amount of measurement error within an instrument, thus serving as an absolute estimate of the instrument's reliability. The SEM is defined as the square root of the error variance [@lek2018]:
$$
SEM =\sqrt{\sigma^2_{Error}}
$$ {#eq-sem_definition}
Now, let's see how the SEM and $\rho_{ICC}$ are mathematically associated in a formula. [@tesio2012]. From @eq-sigma, the error variance of randomly selected subjects is simply the difference between the total variation in observed scores and true-score variance:
$$
\sigma^2_{Error} = \sigma^2_X - \sigma^2_{True}
$$
We multiply and divide the right-hand side of the equation by $\sigma^2_X$:
$$
\begin{aligned}
\sigma^2_{Error} &= \sigma^2_X \cdot \frac{(\sigma^2_X - \sigma^2_{True})}{\sigma^2_X} \\
&= \sigma^2_X \cdot (1 - \frac{\sigma^2_{True}}{\sigma^2_X})
\end{aligned}
$$ {#eq-sem0}
and using the @eq-ricc becomes:
$$
\begin{aligned}
\sigma^2_{Error} &= \sigma^2_X \cdot (1 - \rho_{ICC})
\end{aligned}
$$ {#eq-sem10}
Finally, the SEM is obtained by calculating the square root of both sides of the equation @eq-sem10:
$$
SEM = \sigma_X \cdot \sqrt{1-\rho_{ICC}}
$$ {#eq-sem1}
If $\rho_{ICC}$ equals 0, the SEM will equal the standard deviation of the observed test scores. Conversely, if $\rho_{ICC}$ equals 1, the SEM is zero.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
In practice, $σ_{Χ}$ is estimated from the sample data. Typically, in a test-retest analysis, it is either the standard deviation of the baseline score [@beninato2011] or the higher standard deviation among the two scores [@goldberg2011]. Using the score with a higher standard deviation decreases the chances of underestimating the minimum detectable change (MDC) (see below @eq-mdc). Therefore, by using the $ICC(A,1)$ for agreement, the @eq-sem1 becomes:
$$
SEM = SD_X \cdot \sqrt{1-ICC(A,1)}
$$ {#eq-sem}
:::
```{r}
# choose the higher standard deviation between GOAL1 and GOAL2
SD <- max(c(sd(goal$GOAL1), sd(goal$GOAL2)))
# calculate the Standard Error of Measurement for agreement
sem <- SD * sqrt(1 - iccA1)
sem
```
Note that the higher the ICC, the smaller the SEM which implies higher reliability as it indicates that observed scores are close to the true scores. SEM estimates the variability of the observed scores likely to be obtained given an individual's true score [@mcmanus2012; @harvill1991] . In our example, the SEM is approximately 3 units. This means that, there is a probability of 0.95 of the individual's observed score being between -6 units and +6 units of the true score (Observed score = True score ± 2SEM), assuming normally distributed errors.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
In test-retest analysis where there are 2 measurements at different times, the SEM often is approximated by the formula:
$$
SEM = \frac{SD_{dif}}{\sqrt{2}}
$$
where $SD_{dif}$ is the standard deviation of the differences in scores.
```{r}
#calculate the differences of scores
dif <- goal$GOAL1 - goal$GOAL2
sd(dif)/sqrt(2)
```
:::
### Minimum Detectable Change (MDC)
Now, the question that arises is whether a given change in score between test and retest is likely to be beyond the expected level of measurement error (real difference). In this case, the key measure to consider is the minimum detectable change (MDC) defined as [@goldberg2011]:
$$
MDC = \sqrt{2} \cdot z_{a/2} \cdot SEM
$$ {#eq-mdc}
where $\sqrt{2}$ is introduced because we are interested in the change between two measurements (test and retest), and $z_{a/2}$ the z-score associated with the 95% confidence level.
```{r}
# calculate the z-value for a/2 = 0.05/2 = 0.025
z <- qnorm(0.025, mean = 0, sd = 1, lower.tail = FALSE)
# calculate the minimum detectable change
mdc <- sqrt(2) * z * sem
mdc
```
Both random and systematic errors are taken into account in the MDC. In our example, an individual change in score smaller than 9 units can be due to measurement error and may not be a real change.
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
To define a 95% CI outside which one could be confident that a retest score reflects a real change, we should use a more complicated approach that uses the standard error of prediction (SEP; $SEP = SD_X \cdot \sqrt{1-ICC^2}$) instead of the SEM. This method can yield a more accurate result [@weir2005].
:::
### Limits of Agreement (LOA) and Bland-Altman Plot
- **Limits of Agreement (LOA)**
If the differences between the two scores (GOAL1 - GOAL2) follow a normal distribution, we expect that approximately 95% of the differences will fall within the following range (see @sec-normal):
$$
95\%\ LOA = \overline{dif} \pm 1.96 \cdot SD_{dif}
$$ {#eq-loa}
where $\overline{dif}$ is the mean of the differences (bias), and $SD_{dif}$ is the standard deviation of the differences that measure random fluctuations around this mean.
```{r}
# calculate the mean of the differences (MD)
MD <- mean(dif)
MD
# compute lower limit of 95% LOA
lower_LOA <- mean(dif) - z * sd(dif)
lower_LOA
# compute upper limit of 95% LOA
upper_LOA <- mean(dif) + z * sd(dif)
upper_LOA
```
In our example, the mean of the differences (bias) is -0.73 (i.e. the scores at retest are on average 0.73 units higher than the scores at first test), which is a small difference. Additionally, the limits of agreement indicate that 95% of the differences lie in the range of −9 units to 8 units.
- **The Bland-Altman method**
::: content-box-blue
The Bland-Altman method uses a scatter plot to quantify the measurement bias and a range of agreement by constructing 95% limits of agreement (LOA). The basic assumption of Bland-Altman is that the differences are normally distributed .
:::
```{r}
#| warning: false
#| label: fig-BA_plot
#| fig-cap: Plot of differences between GOAL1 and GOAL2 vs. the mean of the two measurements. The bias of 0.73 units is denoted by the gap (purple arrow) between the X axis, corresponding to a zero differences (black dashed line), and the solid blue line of the mean.
#| fig-width: 9.0
#| fig-height: 6.5
#calculate the mean of GOAL1 and GOAL2
mean_goal12 <- (goal$GOAL1 + goal$GOAL2)/2
# create a data frame with the means and the differences
dat_BA <- data.frame(mean_goal12, dif)
# Bland-Altman plot
BA_plot <- ggplot(dat_BA, aes(x = mean_goal12, y = dif)) +
geom_point() +
geom_hline(yintercept = 0, color = "black", linewidth = 0.5, linetype = "dashed") +
geom_hline(yintercept = MD, color = "blue", linewidth = 1.0) +
geom_hline(yintercept = lower_LOA, color = "red", linewidth = 0.5) +
geom_hline(yintercept = upper_LOA, color = "red", linewidth = 0.5) +
labs(title = "Bland-Altman plot of test-retest reliability",
x = "Mean of GOAL1 and GOAL2",
y = "GOAL1 - GOAL2") +
annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0) +
annotate("text", x = 31.2, y = -0.2, label = "bias", color = "#EA74FC", size = 4) +
geom_segment(x = 28, y = 0.1, xend = 28, yend = -0.85, linewidth = 0.8, colour = "#EA74FC",
arrow = arrow(length = unit(0.07, "inches"), ends = "both"))
# add a histogram on the right-hand side of the graph
ggMarginal(BA_plot, type = "density", margins = 'y',
yparams = list(fill = "#61D04F"))
```
For each pair of measurements, the *difference* between the two measurements is plotted on the Y axis, and the *mean* of the two measurements on the X axis. We can check the distribution of the differences by examining the marginal green histogram on the right-hand side of the graph. In our example, the normality assumption is met; however, if the histogram is skewed or has very long tails, the assumption of Normality might not hold.
The *mean of the differences* (MD), represented by the solid blue line, is an estimate of the systematic bias between the two measurements (@fig-BA_plot). In our case, the magnitude of bias (purple arrow) has a small value (-0.73 units). The lower and upper red horizontal lines represent the upper and lower 95% limits of agreement (LOA), respectively. Under the normality assumption of the differences (green histogram), nearly 95% of the data points are likely to be within the LOAs. In our example, most of the the points are randomly scattered around the zero dashed line within the limits of agreement (−9 to 8 units), and as expected 7 out of 127 (5.5%) data points fall out of these limits.
If we want to add confidence intervals for the MD and for the lower and upper limits of agreement (Lower LOA, Upper LOA) in @fig-BA_plot, we can use the `bland.altman.stats()` function from the `{BlandAltmanLeh}` package which provides these intervals:
```{r}
# get the confidence intervals for the MD, Low and Upper LOA
ci_lines <- bland.altman.stats(goal$GOAL1, goal$GOAL2)$CI.lines
ci_lines
```
```{r}
#| warning: false
#| label: fig-BA_plot2
#| fig-cap: Bland-Altman plot with confidence intervals for the MD and for the lower and upper limits of agreement (Lower LOA, Upper LOA).
#| fig-width: 9.0
#| fig-height: 6.5
# define the color of the lines of the confidence intervals
ci_colors <- c("red", "red", "blue", "blue", "red", "red")
# Bland-Altman plot
BA_plot2 <- BA_plot +
geom_hline(yintercept = ci_lines, color = ci_colors, linewidth = 0.5, linetype = "dashed")
# add a histogram on the right-hand side of the graph
ggMarginal(BA_plot2, type = "density", margins = 'y',
yparams = list(fill = "#61D04F"))
```
::: {.callout-tip icon="false"}
## `r fa("comment", fill = "#1DC5CE")` Comment
If many data points in the Bland-Altman plot are close to the zero dashed line, it indicates that there is a good level of agreement between the two measurements.
:::
The MD (bias) can be considered insignificant, as the zero dashed line of equality lies inside the confidence interval for the mean difference (-1.5, 0.03). We can also test this with a paired t-test:
```{r}
t.test(goal$GOAL1, goal$GOAL2, paired = T)
```
There are lots of packages on CRAN that include functions for creating Bland-Altman plots such as the `{blandr}` and `{BlandAltmanLeh}`:
::: {.callout-important icon="false"}
## `r fa("r-project", fill = "#0008C1")` Bland-Altman plots
::: panel-tabset
## blandr
```{r}
#| warning: false
#| label: fig-BA_plot3
#| fig-width: 9.0
#| fig-height: 6.5
blandr.draw(goal$GOAL1, goal$GOAL2) +
annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0)
```
## BlandAltmanLeh
```{r}
#| warning: false
#| label: fig-BA_plot4
#| fig-width: 9.0
#| fig-height: 6.5
bland.altman.plot(goal$GOAL1, goal$GOAL2, graph.sys = "ggplot2", conf.int=0.95) +
annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0)
```
:::
:::
## Reliability for categorical measurements
### Research question
A screening questionnaire for Parkinson's disease in a community was administrated with two weeks interval. We aim to examine the test--retest reliability of the following questions included in the questionnaire:
::: content-box-yellow
**Q1:** Do you have trouble buttoning buttons? (yes/no)
**Q2:** Do you have trouble arising from a chair? (yes/no)
**Q3:** Do your feet suddenly seem to freeze in door-ways? (yes/no)
**Q4:** Is your handwriting smaller than it once was? (yes/no)
:::
### Preparing the data
The file *questions* contains the data of the four questions which required a 'yes' or 'no' response. The same set of questions was administered to 2000 individuals aged over 65 years old on two different time points (T1 and T2) with a 14-day gap in between.
```{r}
#| warning: false
library(readxl)
qs <- read_excel(here("data", "questions.xlsx"))
```
```{r}
#| echo: false
#| label: fig-questions
#| fig-cap: Table with data from "questions" file.
DT::datatable(
qs, extensions = 'Buttons', options = list(
dom = 'tip',
columnDefs = list(list(className = 'dt-center', targets = "_all"))
)
)
```
### Contingency table
The test-retest data are arranged in a 2x2 contingency table as follows:
| | | $\textbf{Time 2 (T2)}$ | | |
|:----------------------:|:------:|:----------------------:|:---:|:------:|
| | | yes | no | Totals |
| $\textbf{Time 1 (T1)}$ | yes | a | b | g1 |
| | no | c | d | g2 |
| | Totals | f1 | f2 | N |
: Interpretation. {#tbl-table}
The **Marginal totals** (f1, f2, g1, g2) provide important summary information about the distribution of categories and rater's assessments.
::: content-box-green
**Symmetry and balance in a contingency table** [@dettori2020]
- **Symmetrical:** The distribution across f1 and f2 is the same as g1 and g2
- **Asymmetrical:** The distribution across f1 and f2 is in the opposite direction to g1 and g2
- **Balanced:** The proportion of the total number of objects in f1 and g1 is equal to 0.5
- **Imbalance:** The proportion of the total number of objects in f1 and g1 is not equal to 0.5
:::
### Reliability coefficients for categorical data
#### *Cohen's kappa (unweighted)*
The Cohen's kappa (κ), also known as the Kappa statistic, is used to measure relative reliability for categorical items. It quantifies the degree of agreement beyond what would be expected due to chance alone [@cohen1960].
The Cohen's kappa statistic is defined as follows:
$$
k = \frac{p_o-p_{e\kappa}}{1-p_{e\kappa}}
$$
where $p_o = (a+d)/N$ is the proportion of observed overall agreement and $p_{e\kappa} = (f_1 \cdot g_1 + f_2 \cdot g_2)/N^2$ is the proportion of agreement that would be expected by chance.
Kappa can range form negative values (no agreement) to +1 (perfect agreement).
| Cohen's kappa | Level of agreement |
|:-------------:|:------------------:|
| \<0.20 | Poor |
| 0.21 - 0.40 | Fair |
| 0.41 - 0.60 | Moderate |
| 0.61 - 0.80 | Good |
| 0.81 - 1.00 | Very good |
: Interpretation of Cohen's kappa statistic. {#tbl-kappa}
#### *Gwet's AC1*
Gwet's AC1 is another measurement of relative reliability [@gwet2008]. The AC1 statistic is given by the formula:
$$
AC1 = \frac{p_o-p_{eAC1}}{1-p_{eAC1}}
$$
$p_o$ remains the same, but $p_{eAC1} = 2q \cdot (1 - q)$ is the proportion of agreement that would be expected by chance where $q = (f1 + g1)/2N$.
### Examples
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Q1: Do you have trouble buttoning buttons?** (symmetrical and balanced distribution of marginal totals)
:::
The first step consists of creating the contingency table with marginal totals as follows:
```{r}
qs$Q1_T1 <- factor(qs$Q1_T1, levels=c("yes", "no"))
qs$Q1_T2 <- factor(qs$Q1_T2, levels=c("yes", "no"))
table1 <- table(Q1_T1 = qs$Q1_T1, Q1_T2 = qs$Q1_T2)
tb1 <- addmargins(table1)
tb1
```
The proportions for question Q1 are:
```{r}
addmargins(prop.table(table1))
```
We observe that the distribution of marginal totals of question Q1 is **symmetrical** (f1 = 0.46 is close to g1 = 0.49; f2 = 0.54 is close to g2 = 0.51) and approximately **balanced** (f1 = 0.46 and f2 = 0.54 are close to 0.5; g1 = 0.49 and g2 = 0.51 are close to 0.5).
- **kappa**
First, we extract the elements of the `tb1`:
```{r}
a <- tb1[1,1]
b <- tb1[1,2]
c <- tb1[2,1]
d <- tb1[2,2]
f1 <- tb1[3,1]
f2 <- tb1[3,2]
g1 <- tb1[1,3]
g2 <- tb1[2,3]
N <- tb1[3,3]
```
The percent of observed overall agreement is:
```{r}
po <- (a + d)/N
po
```
The proportion of agreement that would be expected by chance is:
```{r}
pek <- (f1 * g1 + f2 * g2)/N^2
pek
```
Finally, we calculate the Cohen's kappa statistic:
```{r}
k <- (po-pek)/(1-pek)
k
```
There are lots of packages on CRAN that include functions to calculate Cohen's kappa statistic such as the `{irr}` and `{vcd}`:
::: {.callout-important icon="false"}
## `r fa("r-project", fill = "#0008C1")` Cohen's kappa
::: panel-tabset
## irr
```{r}
Q1 <- qs[1:2]
kappa2(Q1)
```
**Hypothesis Testing**
::: {.callout-note icon="false"}
## `r fa("angles-right", fill = "#1DC5CE")` Null hypothesis and alternative hypothesis for kappa
- $H_0$: The agreement is the same as chance agreement. ($k = 0$)
- $H_1$: The agreement is different from chance agreement. ($k \neq 0$)
:::
The p-value is less than 0.05 (reject $H_0$), the two assessments agreed more than would be expected by chance. Note that measurements taken from the same individuals on two different time points are closely associated by nature.
## vcd
```{r}
k_Q1 <- Kappa(table1)
k_Q1
```
::: callout-warning
Note that, in the above results ASE, which is used for calculating the confidence intervals, is the asymptotic standard error of the kappa value.
:::
```{r}
confint(k_Q1)
```
:::
:::
Therefore, the relative agreement for the question Q1 is good (k = 0.7, 95%CI: 0.67, 0.73).
- **AC1**
For AC1 statistic, the proportion of agreement that would be expected by chance is:
```{r}
q <- (f1 + g1)/(2*N)
peAC1 <- 2*q*(1-q)
peAC1
```
The AC1 statistic is as follows:
```{r}
AC1 <- (po-peAC1)/(1-peAC1)
AC1
```
We can also use the `{irrCAC}` package to get the relevant statistic with confidence intervals:
```{r}
Q1 <- qs[1:2]
gwet.ac1.raw(Q1)$est
```
We conclude that AC1 is 0.7, with a 95% confidence interval ranging from 0.67 to 0.73, consistent with the result obtained using Cohen's kappa statistic.
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Q2: Do you have trouble arising from a chair?** (symmetrical and imbalanced distribution of marginal totals)
:::
```{r}
qs$Q2_T1 <- factor(qs$Q2_T1, levels=c("yes", "no"))
qs$Q2_T2 <- factor(qs$Q2_T2, levels=c("yes", "no"))
table2 <- table(Q2_T1 = qs$Q2_T1, Q2_T2 = qs$Q2_T2)
tb2 <- addmargins(table2)
tb2
```
The proportions for question Q2 are:
```{r}
addmargins(prop.table(table2))
```
The percent of observed overall agreement is 0.80 + 0.05 = 0.85. The distribution of marginal totals of question Q2 is **symmetrical** (f1 = 0.85 is close to g1 = 0.90; f2 = 0.15 is close to g2 = 0.10) but **imbalanced** (f1 = 0.85 and f2 = 0.15 deviate greatly from 0.5; g1 = 0.90 and g2 = 0.10 are far away from 0.5).
- **kappa**
```{r}
Q2 <- qs[3:4]
kappa2(Q2)
```
According to the value of kappa statistic, k = 0.32, there appears to be a fair agreement between test and re-test for question Q2.
::: content-box-green
**The first kappa paradox**
Although questions Q1 and Q2 have the same high percentage of agreement ($p_o$), the imbalanced distribution of the marginal totals in Q2 affects the kappa statistic, creating what appears to be a 'paradox'--- **high percent agreement and low kappa** [@feinstein1990; @cicchetti1990].
:::
- **AC1**
```{r}
gwet.ac1.raw(Q2)$est
```
The AC1 value is 0.81, which means that this measure is robust against what is known as the 'kappa' paradox.
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Q3: Do your feet suddenly seem to freeze in door-ways?** (symmetrical and imbalanced distribution of marginal totals)
:::
```{r}
qs$Q3_T1 <- factor(qs$Q3_T1, levels=c("yes", "no"))
qs$Q3_T2 <- factor(qs$Q3_T2, levels=c("yes", "no"))
table3 <- table(Q3_T1 = qs$Q3_T1, Q3_T2 = qs$Q3_T2)
tb3 <- addmargins(table3)
tb3
```
The proportions for question Q3 are:
```{r}
addmargins(prop.table(table3))
```
The percent of observed overall agreement is 0.45 + 0.15 = 0.60. The distribution of marginal totals of question Q3 is relatively **symmetrical** (same direction) but it is **imbalanced** (f1 = 0.70 and f2 = 0.30 deviate greatly from 0.5).
- **kappa**
```{r}
Q3 <- qs[5:6]
kappa2(Q3)
```
According to the value of kappa statistic, k = 0.13, there appears to be a poor agreement between test and re-test for question Q3.
- **AC1**
```{r}
gwet.ac1.raw(Q3)$est
```
::: content-box-yellow
`r fa("arrow-right", fill = "orange")` **Q4: Is your handwriting smaller than it once was?** (asymmetrical and imbalanced distribution of marginal totals)
:::
```{r}
qs$Q4_T1 <- factor(qs$Q4_T1, levels=c("yes", "no"))
qs$Q4_T2 <- factor(qs$Q4_T2, levels=c("yes", "no"))
table4 <- table(Q4_T1 = qs$Q4_T1, Q4_T2 = qs$Q4_T2)
tb4 <- addmargins(table4)
tb4
```
The proportions for question Q4 are:
```{r}
addmargins(prop.table(table4))
```
The percent of observed overall agreement is 0.25 + 0.35 = 0.60. The distribution of marginal totals of question Q4 is **asymmetrical** (opposite direction) and **imbalanced** (f1 = 0.30 and f2 = 0.70 deviate greatly from 0.5).
- **kappa**
```{r}
# Q4: Is your handwriting smaller than it once was?
Q4 <- qs[7:8]
kappa2(Q4)
```
According to the value of kappa statistic, k = 0.26, there appears to be a fair agreement between test and re-test for question Q3.
The asymmetrical imbalance in Q4 results in a higher value of k (0.26) compared to the more symmetrical imbalance in Q3, which yielded a lower k value (0.13).
::: content-box-green
**The second kappa paradox**
The k values for the same percentage of agreement ($p_o$) can be unexpectedly **raised** when imbalances in the marginal totals are not perfectly symmetrical [@feinstein1990; @cicchetti1990].
:::
- **AC1**
```{r}
gwet.ac1.raw(Q4)$est
```