-
Notifications
You must be signed in to change notification settings - Fork 0
/
200-appendix.Rmd
223 lines (148 loc) · 12.4 KB
/
200-appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
```{r ch200-Setup, include=FALSE}
# load these libraries so that renv snapshots them
# and so they can be used in the bibliography
library(rstanarm)
library(arm)
```
# (APPENDIX) Appendix {-}
# Supplementary Code {#code}
**Eight Schools Model**
\setstretch{1.0}
```
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // standard error of effect estimates
}
parameters {
real mu; // population treatment effect
real<lower=0> tau; // standard deviation in treatment effects
vector[J] eta; // unscaled deviation from mu by school
}
transformed parameters {
vector[J] theta = mu + tau * eta; // school treatment effects
}
model {
target += normal_lpdf(eta | 0, 1); // prior log-density
target += normal_lpdf(y | theta, sigma); // log-likelihood
}
generated quantities {
vector[J] log_lik;
for (j in 1:J) {
log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}
}
```
\setstretch{2.0}
\clearpage
**Model with Lapse and Subject-level Parameters**
\setstretch{1.0}
```{r ch210-Minimum Dagger, comment=NA}
cat(writeLines(readLines("models/universal_pf.stan")))
```
\setstretch{2.0}
\clearpage
**Stan Algebraic Solver**
\setstretch{1.0}
```
functions {
vector system(vector y, vector theta, real[] x_r, int[] x_i) {
vector[2] z;
z[1] = exp(y[1] + y[2]^2 / 2) - theta[1];
z[2] = 0.5 + 0.5 * erf(-y[1] / (sqrt(2) * y[2])) - theta[2];
return z;
}
}
transformed data {
vector[2] y_guess = [1, 1]';
real x_r[0];
int x_i[0];
}
transformed parameters {
vector[2] theta = [0.100, 0.99]';
vector[2] y;
y = algebra_solver(system, y_guess, theta, x_r, x_i);
}
```
\setstretch{2.0}
# Developing a Model {#model-dev}
Our final modeling strategy is an evolution from other attempts. The development proceeded through multiple iterations described in [chapter 4](#application), but doesn't tell the full story. We learn more from others when they share what didn't work along with the final path that did work. There is knowledge to be gained in failed experiments, because then there is one more way to not do something, just like a failing outcome reduces the variance of the Beta distribution.
In the first attempt at modeling, we used a classical GLM to get a baseline understanding of the data, but the fact that some estimates for certain subjects failed due to complete separation reinforced the our adoption of non-classical techniques. Our first Bayesian model was derived from @lee2014bayesian which used nested loops to iterate over subjects and SOA values. The data were required to be stored in a complicated way that made it difficult to comprehend and extend.
We moved on to using `arm::bayesglm` to remove convergence issues, but we were met with other limitations such as linear parameterization and lack of hierarchical modeling. The book Statistical Rethinking [@mcelreath2020statistical] offers a great first introduction to Bayesian multilevel modeling. McElreath's `rethinking` package accompanies the book, and offers a compact yet expressive syntax for models that get translated into a Stan model. A model with age group and block can be written using `rethinking::ulam` as
\setstretch{1.0}
```{r ch220-Rubber Modern, echo=TRUE, eval=FALSE}
rethinking::ulam(alist(
k ~ binomial_logit(n, p),
p = exp(b + bG[G] + bT[trt]) * (x - (a + aG[G] + aT[trt])),
a ~ normal(0, 0.06),
aG[G] ~ normal(0, sd_aG),
aT[trt] ~ normal(0, sd_aT),
b ~ normal(3, 1),
bG[G] ~ normal(0, sd_bG),
bT[trt] ~ normal(0, sd_bT),
c(sd_aG, sd_aT, sd_bG, sd_bT) ~ half_cauchy(0, 5)
), data = df, chains = 4, cores = 4, log_lik = TRUE)
```
\setstretch{2.0}
While learning about multilevel models, we tried writing a package that generates a `Stan` program based on `R` formula syntax. At the time the concepts of no-pooling, complete pooling, and partial pooling were vaguely understood, and the package was plagued by the same lack of flexibility that `rstanarm` and `brms` have. Then it was discovered that `brms` and `rstanarm` already did what we were trying to do, but programming experience was invaluable.
We also tried using `lme4`, `rstanarm`, and `brms`, and learned more about the concepts of fixed and random effects. We noticed that parameterization can have a significant affect on the efficiency of a model and the inferential power of the estimated parameters. When fitting a classical model, there is little difference in estimating `a + bx` vs. `d(x - c)` since the latter can just be expanded as `-cd + dx` which is essentially the same as the first parameterization, but there is a practical difference in the interpretation of the parameters. The second parameterization implies that there is a dependence among the parameters that can be factored out. In the context of psychometric functions, there is a stronger connection between PSS and `c` and the JND and `d`. This parameterization made it easier to specify priors and also increased the model efficiency. Of the modeling tools mentioned, only `rethinking` and `Stan` allow for arbitrary parameterization.
We finally arrived at a model that worked well, but learned that using a binary indicator variable for the treatment comes with the assumption of higher uncertainty for one of the conditions. The linear model that we arrived at is displayed in equation \@ref(eq:badlinearmodel).
\begin{equation}
\theta = \exp(\beta + \beta_G +(\beta_T + \beta_{TG})\times trt) \left[x - (\alpha + \alpha_G + (\alpha_T + \alpha_{TG})\times trt)\right]
(\#eq:badlinearmodel)
\end{equation}
Using an indicator variable in this fashion also introduced an interaction effect into the model that we almost did not account for after switching to using a factor variable. Interaction effects between factors is handled by creating a new factor that is essentially the cross-product of other factor variables. E.g. for factor variables $x$ and $y$
\setstretch{1.0}
$$
x = \begin{bmatrix}
a \\
b \\
c
\end{bmatrix}, y = \begin{bmatrix}
i \\
j
\end{bmatrix}\Longrightarrow x\times y =
\begin{bmatrix}
ai & aj \\
bi & bj \\
ci & cj
\end{bmatrix}
$$
\setstretch{2.0}
The final round of reparameterization came in the form of adopting non-centered parameterization for more efficient models. To us, $Z \sim N(0, 1^2);\quad X = 3 + 2Z$ is the same as $X \sim N(3, 2^2)$, but to a computer the process of sampling from $X$ can be more difficult than sampling from $Z$ (discussed in [chapter 2](#methods)).
# Reproducible Results {#reproduce}
Data doesn't always come in a nice tidy format, and we had to turn the raw experimental data into a clean data set that is ready for modeling. Sometimes the process is quick and straight forward, but other times, like with this psychometric data, it takes more effort and clever techniques. There is academic value in describing the steps taken to reduce the headache later.
To begin, there is a strong push in recent years for reproducible data science. Scientific methods and results should be able to be replicated by other researchers, and part of that includes being able to replicate the process that takes the raw data and produces the tidy data that is ready for analysis. Tidy data is described by @wickham2014tidy and can be summed up by three principles
\setstretch{1.0}
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table
\setstretch{2.0}
One problem is having data in a spread sheet, modifying it, and then having no way of recovering the original data. Spread sheets are a convenient way to organize, transform, and lightly analyze data, but problems can quickly arise unless there is a good backup/snapshot system in place. Mutability in computer science is the property of a data structure where its contents can be modified in place. Immutability means that the object cannot be modified without first making a copy. Data is immutable, or at least that is the mindset that researchers must adopt in order to have truly reproducible workflows. The raw data that is collected or produced by a measurement device should never be modified without first being copied, even if for trivial reasons such as correcting a spelling mistake. If a change is made to the raw data, it should be carefully documented and reversible.
To begin the data cleaning journey, we introduce the directory system that we had been given to work with. Each task is separated into its own folder, and within each folder is a subdirectory of age groups.
```{r ch230-Lama-Everyday, out.width="30%", fig.cap="Top-level data directory structure."}
knitr::include_graphics("figures/data_dir.png")
```
Within each age group subdirectory are the subdirectories for each subject named by their initials which then contain the experimental data in Matlab files.
```{r ch230-Third-Needless-Antique, out.width="35%", fig.cap="Subdirectory structure."}
knitr::include_graphics("figures/data_subdir.png")
```
The data appears manageable, and there is information contained in the directory structure such as task, age group, and initials, and file name contains information about the experimental block. There is also an excel file that we were later given that contains more subject information like age and sex, though that information is not used in the model. The columns of the Matlab file depend on the task, but generally they contain an SOA value and a response, but no column or row name information -- that was provided by the researcher who collected the data.
We then created a table of metadata -- information extracted from the directory structure and file names combined with the the subject data and the file path. Regular expressions can be used to extract patterns from a string. With a list of all Matlab files within the `RecalibrationData` folder, we tried to extract the task, age group, initials, and block using the regular expression:
```
"^(\\w+)/(\\w+)/(\\w+)/[A-Z]{2,3}_*[A-Z]*(adapt[0-9]|baseline[0-9]*).*"
```
The `^(\\w+)/` matches any word characters at the start and before the next slash. Since the directory is structured as `Task/AgeGroup/Subject/file.mat`, the regular expression should match three words between slashes. The file name generally follows the pattern of `Initials__block#__MAT.mat`, so `[A-Z]{2,3}_*[A-Z]*` should match the initials, and `(adapt[0-9]|baseline[0-9]*)` should match the block (baseline or adapt). This method works for $536$ of the $580$ individual records. For the ones it failed, it was generally do to misspellings or irregular capitalizing of "baseline" and "adapt".
Since there is only a handful of irregular block names, they can be dealt with by a separate regular expression that properly extracts the block information. Other challenges in cleaning the data include the handling of subjects with the same initials. This becomes a problem when filtering by a subject's initials is not guaranteed to return a unique subject. Furthermore there are two middle age subjects with the same initials of "JM", so one was also identified with their sex "JM_F". The solution is to create a unique identifier (labeled as SID) that is a combination of age group, sex, and initials. For an experiment identifier (labeled as RID), the task and block were prepended to the SID. Each of these IDs uniquely identify the subjects and their experimental records making it easier to filter and search.
\setstretch{1.0}
```{r ch230-Remote Monkey, echo=TRUE}
library(dplyr)
FangPsychometric::multitask |>
select(rid, sid, task, block, age_group, age, sex) |>
distinct() |>
arrange(rid) |>
glimpse(width = 60)
```
\setstretch{2.0}
Then with the table of clean metadata, the task is simply to loop through each row, read the Matlab file given by `path`, add the unique ID as a column, and then join the experimental data with the metadata to create a data set that is ready for model fitting and data exploration. The full code used to generate the clean data is not yet available online, but can be shared with the committee.
The benefit of writing a script to generate the data is that others can look over the code and verify that it is doing what it is intended to do, and we can go back to any step within the process to make changes if the need comes up. Another tool that contributed to the reproducibility is the version control management software, Git. With Git we can take a snapshot of the changes made, and revert if necessary. This thesis is also hosted on Github, and the entire history of development can be viewed there.