-
Notifications
You must be signed in to change notification settings - Fork 8
/
data_analysis.Rmd
335 lines (233 loc) · 14.9 KB
/
data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "Data Analysis"
output:
html_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
```
### Read in Data
First, we will read in the data:
```{r, warning=F, message=F}
#install.packages("here") # BCS changed
library(here)
d <- read.csv(here("data","summarized_data.csv"))
d_save<-d
```
<br>
The summarized dataset contains `r nrow(d)` observations, but let's look at the first handful of rows in the data to see what we're dealing with:
```{r}
head(d, n = 5)
```
<br>
Most of these variables (columns) are self-explanatory. However, for completeness, the variables are:
+ **subject_ID**: a unique ID for each participant.
+ **p_cor_sex**: The proportion correct responses for the sex question
+ **p_cor_lang**: The proportion correct responses for the language question
+ **p_cor_age**: The proportion correct responses for the age question
+ **time_mins**: The length of time participants took to complete the study in minutes
+ **fin_consent_mins**: The length of time participants took to complete the consent form
+ **consent_to_fin_mins**: The length of time the participants took to complete the survey (excluding the consent form)
+ **max_RT_exp_trials**: The longest RT during the main experimental trials
+ **childcare**: the amount of childcare experience the participant has in months
+ **childcare**: the amount of caregiver experience the participant has in months
+ **age**: the age of the participant in years
+ **gender**: the gender of the participant
+ **gender_text**: the text the participant wrote if selecting the other option
+ **country**: the country of origin of the participant
+ **country_text**: the text the participant wrote if selecting the other option
+ **hearing**: if the participant reported normal hearing
+ **eng_first**: if english is the participant's first language
+ **know_corp_lang**: if the participant knows any of the languages in the corpus other than English
+ **monolingual**: if the participant is monolingual
+ **n_attention_checks**: the number of times the participant needed to successfully complete the attention check
+ **n_audio_checks**: the number of times the participant needed to successfully complete the audio check
+ **var_sex**: The participant's variance of responses for the sex question. A score of 0 indicates the participant responded identically for all questions in the sex block
+ **var_lang**: The participant's variance of responses for the language question. A score of 0 indicates the participant responded identically for all questions in the language block
+ **var_age**: The participant's variance of responses for the age question. A score of 0 indicates the participant responded identically for all questions in the age block
<br>
### Data cleaning
First we can look at the distribution of number of attention checks required for each participant:
```{r message=FALSE}
library(ggplot2)
ggplot(d, aes(n_attention_checks)) + geom_histogram() +
geom_vline(xintercept = 5, linetype = "solid", color = "darkgrey", size = 1.5)
```
The x axis is the number of times a participant required to complete the attention check. The y axis shows he counts of those attempts. The grey line is the exlusion criteria we pre-determrined (i.e., Exclude any participant who needs more than 5 attempts).
We can remove participants based on that criteria:
```{r}
# get number of people who would be excluded from full sample without prior exclusions
atten_exclude<-sum(d_save$n_attention_checks >5)
# actually exclude from data
d <- d[d$n_attention_checks <= 5, ]
```
From this we can see that `r atten_exclude` participants were removed based on the attention check.
<br>
Next, we can look at the distribution of number of audio checks required for each participant:
```{r message=FALSE}
library(ggplot2)
ggplot(d, aes(n_audio_checks)) + geom_histogram() +
geom_vline(xintercept = 5, linetype = "solid", color = "darkgrey", size = 1.5)
```
The x axis is the number of times a participant required to complete the audio check. The y axis shows he counts of those attempts. The grey line is the exlusion criteria we pre-determrined (i.e., Exclude any participant who needs more than 5 attempts).
We can remove participants based on that criteria:
```{r}
# get number of people who would be excluded from full sample without prior exclusions
audio_exclude<-sum(d_save$n_audio_checks >5)
# actually exclude from the data
d <- d[d$n_audio_checks <= 5, ]
```
From this we can see that `r audio_exclude` participants were removed based on the audio check.
<br>
We can have a look at the variables that could affect our other exclusion criteria:
```{r}
ggplot(d, aes(gender)) + geom_bar() +
xlab("Gender")
```
```{r}
ggplot(d, aes(country)) + geom_bar() +
xlab("Country")
```
```{r}
ggplot(d, aes(eng_first)) + geom_bar() +
xlab("English as first language")
```
```{r}
ggplot(d, aes(know_corp_lang)) + geom_bar() +
xlab("Knowledge or corpora language (Other than English)")
```
We'll get rid of participants who specify a gender other than female or male, reside in a country other than Canada or the US, do not have English as a first language, have significant familiarity with the languages from Babble Cor other than English, or responded with the same response for any of the sex, language, or age question blocks:
```{r}
# get number of people who would be excluded from full sample without prior exclusions
gen_exclude<-length(d_save$gender)-sum(d_save$gender %in% c("Female", "Male"))
coun_exclude<-length(d_save$country)-sum(d_save$country %in% c("Canada", "USA"))
eng_exclude<-length(d_save$eng_first)-sum(d_save$eng_first=="Yes")
know_corp_exclude<-length(d_save$know_corp_lang)-sum(d_save$know_corp_lang=="None")
same_resp_exclude<-length(d_save$var_sex)-sum(d_save$var_sex>0 & d_save$var_lang>0 & d_save$var_age>0 )
# actually exclude from the sample
d <- d[d$gender %in% c("Female", "Male"), ]
d <- d[d$country %in% c("Canada", "USA"), ]
d <- d[d$eng_first == "Yes", ]
d <- d[d$know_corp_lang == "None", ]
d <- d[d$var_sex > 0 & d$var_lang > 0 & d$var_age > 0, ]
```
Number excluded based on:
- gender: `r gen_exclude`
- country: `r coun_exclude`
- English as a first language: `r eng_exclude`
- knowing some of the corpus languages: `r know_corp_exclude`
- giving the same response for all questions in a block: `r same_resp_exclude`
The data now contains `r nrow(d)` observations (some participants were excluded for multiple reasons). Some of our sample's demographic breakdown can be seen at the end of this document.
With the cleaning out of the way, let's now have a look to see whether the data supports our hypotheses.
<br>
### Hypothesis 1: Participants will be able to identify the infant’s sex significantly above chance (50%)
Presumably, some participants will do well and some not well. Let's plot all participant's proportion of correct responses at determining the babies' sex:
```{r message=FALSE}
ggplot(d, aes(p_cor_sex)) + geom_histogram() +
geom_vline(xintercept = 0.5, linetype="solid", color = "darkgrey", size=1.5) +
geom_vline(xintercept = mean(d$p_cor_sex), linetype="solid", color = "red", size=1.5) +
xlab("Proportion correct for sex judgement")
```
The x axis is the proportion of correct responses for the sex question and the y axis is the number of responses.
As we can see, some people at the far left of the plot do terrible, scoring 0% correct. Some people do better, scoring quite well (80% correct).
However, the average proportion correct is `r round(mean(d$p_cor_sex), 2)`(sd = `r round(sd(d$p_cor_sex),2)`) which is represented by the red vertical line.
The grey vertical line represents 0.50 (50%). This is how well we would expect people to do on average if they we're just guessing or had no ability to discriminate between audio clips from male or female babies.
We would like to see the red line far to the right of the grey line. This would indicate that participants could discriminate between the sex of the babies in the clips better than chance. However, we see that participants are actually performing worse than chance.
For completeness, let run a one sample t-test that will test if the participants' average of `r round(mean(d$p_cor_sex), 2)` is significantly different than 0.5:
```{r}
t.test(d$p_cor_sex, mu = 0.5)
eff_size<-(mean(d$p_cor_sex)-0.5)/sd(d$p_cor_sex)
```
From this output we can see that the results are significant at the $\alpha$ = 0.01 level but in the wrong direction. This idicates that participants performed worse than what we would expect by chance. The cohen's D effect size of this difference is `r eff_size`.
<br>
### Hypothesis 2: Participants will be able to identify whether the infant is acquiring English or another language above chance (50%)
As with hypothesis 1, let's have a look at the distribution of performance for the language hypothesis:
```{r message=FALSE}
library(ggplot2)
ggplot(d, aes(p_cor_lang)) + geom_histogram() +
geom_vline(xintercept = 0.5, linetype = "solid", color = "darkgrey", size = 1.5) +
geom_vline(xintercept = mean(d$p_cor_lang), linetype="solid", color = "red", size=1.5) +
xlab("Proportion correct for language judgement")
```
The x axis is the proportion of correct responses for the language question and the y axis is the number of responses.
The average proportion correct is `r round(mean(d$p_cor_lang), 2)`(sd = `r round(sd(d$p_cor_lang),2)`) which is represented by the red vertical line. Again, the grey vertical line represents 0.50 (50%).
Here participant's are (numerically) performing better than chance.
Let's run a one sample t-test to determine if participant's are performing statistically better than chance:
```{r}
t.test(d$p_cor_lang, mu = 0.5)
eff_size<-(mean(d$p_cor_lang)-0.5)/sd(d$p_cor_lang)
```
From this output we can see that the results are significant at the 0.01 level. For those unfamiliar with the scientific notation of the p value reported in the output above, the p value is `r format(t.test(d$p_cor_lang, mu = 0.5)$p.value, scientific=FALSE)`. This indicates that participants are able to discriminate between the language the babies are aquiring better than would be expected by chance. However the effect size is quite small with a cohen's D = `r eff_size`.
<br>
### Hypothesis 3: Participants will be able to identify the infant’s age range above chance (33%).
Let's have a look at the distribution of performance for the age hypothesis:
```{r message=FALSE}
library(ggplot2)
ggplot(d, aes(p_cor_age)) + geom_histogram() +
geom_vline(xintercept = 1/3, linetype = "solid", color = "darkgrey", size = 1.5) +
geom_vline(xintercept = mean(d$p_cor_age), linetype="solid", color = "red", size=1.5) +
xlab("Proportion correct for age judgement")
```
The x axis is the proportion of correct responses for the age question and the y axis is the number of responses.
The average proportion correct is `r round(mean(d$p_cor_age), 2)`(sd = `r round(sd(d$p_cor_lang),2)`) which is represented by the red vertical line. In contrast to the first two hypotheses, now the grey vertical line represents 0.33 (33%). This is because is we would expect participants who we guessing or unable to discriminate between the three age groups (0-7, 8-18, and 19-36 months) to be correct 33% on average. As we can see, participants perform appreciably better than chance.
Let's run a one sample t-test to determine if participant's are performing statistically better than chance:
```{r}
t.test(d$p_cor_age, mu = 1/3)
eff_size<-(mean(d$p_cor_age)-(1/3))/sd(d$p_cor_lang)
```
From this output we can see that the results are significant at the 0.01 level. Again, for those unfamiliar with the scientific notation of the p value reported in the output above, the p value is `r format(t.test(d$p_cor_age, mu = 0.5)$p.value, scientific=FALSE)`. This indicates that participants are able to discriminate between the age group the babies belong to better than would be expected by chance. The effect size for this difference is quite strong with a cohen's D = `r eff_size`.
<br>
Given these significant results, that people were able to discriminate between babies age groups as well as the language the babies are aquiring, we have several follow up hypotheses.
<!--
<br>
### Hypothesis 5a: Participants who identify as females will be able to identify infant’s **language** significantly better than other participants
To test this hypothesis, we can also start by plotting the data:
```{r}
ggplot(d, aes(p_cor_lang, fill=gender)) + geom_density(alpha=0.4) +
geom_vline(xintercept = mean(d$p_cor_lang[d$gender == "Female"]), linetype="solid", color = "pink", size=1.5) +
geom_vline(xintercept = mean(d$p_cor_lang[d$gender == "Male"]), linetype="solid", color = "blue", size=1.5) + xlab("Proportion of correct responses for language question")
```
Here we have a density plot (basically a smoothed histogram). The x axis shows the proportion of correct responses for the language question colored by gender (female: pink; male: blue). The vertical lines represent the mean proportion correct for both groups.
We can see that numerically the females and males perform almost identically. However, we can conduct an ANOVA to test if this difference is significantly different than chance:
```{r}
summary(aov(p_cor_lang ~ gender, data=d))
```
We can see that there is not a significant difference in performance between males and females.
<br>
### Hypothesis 5b: Participants who identify as females will be able to identify infant’s **age** significantly better than other participants
To test this hypothesis, we can also start by plotting the data:
```{r}
ggplot(d, aes(p_cor_age, fill=gender)) + geom_density(alpha=0.4) +
geom_vline(xintercept = mean(d$p_cor_age[d$gender == "Female"]), linetype="solid", color = "pink", size=1.5) +
geom_vline(xintercept = mean(d$p_cor_age[d$gender == "Male"]), linetype="solid", color = "blue", size=1.5) + xlab("Proportion of correct responses for age question")
```
Here we have a density plot (basically a smoothed histogram). The x axis shows the proportion of correct responses for the age question colored by gender (female: pink; male: blue). The vertical lines represent the mean proportion correct for both groups.
We can see that numerically the males perform better than females. However, we can conduct an ANOVA to test if this difference (even though this is not the difference we predicted) is different than chance:
```{r}
summary(aov(p_cor_age ~ gender, data=d))
```
We can see that there is not a significant difference in performance between males and females.
-->
<br>
<br>
<br>
<br>
<br>
### Demographic Breakdown
Below is some of the counts and descriptive statistics for our dataset
```{r}
library(plyr)
count(d, 'gender')
count(d, 'gender_text')
count(d, 'country')
count(d, 'country_text')
count(d, 'hearing') # no exclusion based on this?
count(d, 'eng_first')
count(d, 'know_corp_lang')
count(d, 'monolingual') # no exlusion based on this?
library(psych)
describe(d)
```