Skip to content

A curated list of publicly available datasets for replication studies.

License

Notifications You must be signed in to change notification settings

ying531/awesome-replicability-data

Repository files navigation

awesome-replicability-data

This repository collects publicly available datasets for replicability analysis. We curate a collection of multi-site individual-level replication studies, paired individual-level datasets of original and replication studies, and one-sided pairs with individual-level data for the replication study. We are non-selective in collecting these datasets, i.e., both successful and failed studies are included as long as they are available.

This repository accompanies papers:

  1. "Diagnosing the role of observable distribution shift in scientific replications" by Ying Jin, Kevin Guo and Dominik Rothenhäusler. [Reference]
  2. "Beyond reweighting: On the predictive role of covariate shift in effect generalization" by Ying Jin, Naoki Egami and Dominik Rothenhäusler. [Reference] [Replication code]

💡 Update Dec 2024: Accompanying our new "Beyond" paper, we update with newly included processing scripts for two large-scale, multi-site replication studies, the Pipeline project and the ManyLabs 1 project!

Please feel free to contact us at yjin[at]hcp[dot]med[dot]harvard[dot]edu, or open an issue if you have suggestions for replication datasets not collected here!

Related resources

Reproduction code. The reproduction code for the analysis of distribution shift and generalization in two new multi-site, multi-hypothesis replication projects in the new Paper 2 is available at a separate GitHub repository [predictive-shift].

R package. Also accompanying Paper 1, our R package repDiagnosis provides statistical tools for estimating the contribution of observable distribution shifts in paired replication studies, such as covariate difference and mediation shifts. Paired data 1, 3, 8 below are cleaned and pre-loaded in the R package for use.

Interactive diagnosis app. Play with our interactive analysis tools based on Paper 1 in our online R shiny app! Quick start with pre-loaded datasets in the app (datasets 1, 3, 8 below). You can also diagnose your own replication study, or probe the generalizability of your single study.

Example diagnosis. We provide in analysis.html an analysis report for other paired studies not elaborated in the "diagnosing" Paper 1.

Contents

1. Multi-site, multi-hypothesis datasets. Data list, Data details

2. Complete, paired datasets. Data list, Data details.

3. One sided datasets. Data list, Data details.

List of multi-site, multi-hypothesis replication datasets

Name Original paper Original data/repo Processing code
1. Pipeline Schweinsberg, et al., 2016 OSF link Folder link
2. ManyLabs1 Klein, et al., 2014 OSF link Folder link

List of complete, paired datasets

Below we list links to papers and datasets for original and replication studies where both of them have individual-level data publicly available. The Processed column links to data folder in this repo (if any) which we processed from publicly available data. Clicking the link in Name column jumps to texts that summarize the studies.

Name Original paper Original data/repo Replication paper Replication data/repo Processed
1. Covid information Pennycook, et al., 2020 OSF link Roozenbeek, et al., 2021 OSF link Folder link
2. Empathy and SES Côté, et al., 2013 no data Babcock, et al., 2017 (two reps) OSF link Folder link
3. EMDR and misinformation Houben, et al., 2018 OSF link Calvillo and Emami, 2019 OSF link Folder link
4. Self-centrality and mind-body practice Gebauer, et al., 2018 yoga meditation analysis Vaughan-Johnston, et al., 2021 yoga meditation Folder link
5. Queueing design Shunko, et al., 2018 data zipfile Long, et al. data zipfile Folder link
6. Multi-lab disgust and moral judgement Ghelfi, et al., 2020 OSF link (to all studies) Folder link
7. Pain and cooperation Bastian, et al., 2014 OSF link Prochazka, et al., 2022 OSF link Folder link
8. Cleaniness and moral judgement Schnall, et al., 2008 OSF link Johnson, et al., 2014 OSF link Folder link
9. Lie and foreign language Suchotzki and Gamer, 2008 OSF link Frank, et al., 2019 OSF link Folder link
10. Multi-lab ego depletion Rep 1: Hagger, et al., 2016 OSF link Rep 2: Dang, et al., 2020 OSF link
11. Honesty and time Shalvi, et al., 2012 data in replication OSF link Van der Gruyssen, et al., 2020 Rep 1, Rep 2 Folder link

List of one-sided datsets

Below we collect one-sided original-replication study pairs, i.e., where the replication study has individual-level data, while the original study has only summary statistics available. We include such datasets if the original paper contains rich summary statistics. These summary statistics, together with individual-level data of the replication study, are processed and stored in the links in Processed column. Clicking the link in Name column jumps to texts that summarize the studies.

Name Original paper Replication paper Replication data/repo Processed
1. Climate change misinformation van der Linden, et al., 2015 Williams and Bond, 2020 OSF link Folder link
2. Pain-tolerance metaphor Sierra, et al., 2016 Pendrous, et al., 2020 OSF link Folder link
3. Body dissatifaction Martijn, et al., 2010 Glashouwer, et al., 2019 Database link Folder link
4. Priming and exercise Pottratz, et al., 2021 Timme, et al., 2022 OSF link Folder link

Details of multi-site, multi-hypothesis studies

1. Pipeline project dataset

  • Background. The Pipeline project is a pre-publication, pre-registered collaborative project where 25 labs across the world replicate 10 experiments on various moral judgement effects. In our analysis, we focus on the replicability of estimates of the average treatment effects either with a two-group t-test or a paired t-test.
  • Experiment protocols. Replicator labs were explicitly selected for their expertise and access to subject populations that were theoretically expected to exhibit the original effects. All teams consistently follow the same experimental protocol with locally recruited participants, recording the same set of variables. This provides opportunities to study variation of results, especially unexpected distribution shifts, given these controlled conditions.

2. ManyLabs1 dataset

  • Background. The ManyLabs1 project is a renowned pre-registered collaborative project where 36 labs replicate 13 experiments in psychological science, with a total of more than 6000 participants. In our analysis, we focus on the replicability of estimates of the average treatment effects with a two-group t-test.
  • Experiment protocols. Replicator labs voluntarily participate in the project. All teams consistently follow the same experimental protocol with locally recruited participants, recording the same set of variables. Here, sites were selected conveniently but ``naturally'' without explicit intention.

1. ManyLabs1 project dataset

Details of paired studies and datasets

1. Covid information study dataset

  • Background. This study investigates the effect of a `nudge' for thinking about truthfulness of information on the ability of truth discernment when sharing COVID-related news. The treated were asked to rate the accuracy of several headlines, and all participants rated how likely they were to share them on social media.

  • Sample sizes. The original study by Pennycook et al. recruited n = 1145 participants, while the replication study by Roozenbeek et al. had sample size N = 1583.

  • Variables. The outcome variable is ratings, which is the rating for willingness to share the headlines. In addition, both studies measured demographical information including age, gender, education, ethnicity. Other measures include cognitive reflection crt, science knowledge sciknow, medical maximizer-minimizer scale mms, etc. The binary treatment is encoded in treatment column, and real is a binary indicator of whether the information is correct.

  • Results. The original study finds a statistically significant estimate of the interaction of treatment and news truthfulness, i.e., treated participants were less willing to share headlines that were perceived as less accurate. The replication study failed to detect such effect in the first stage with N = 701, but find a significant but smaller effect after collecting the second round of data with pooled N = 1583.

2. Empathy and SES dataset

  • Background. Babcock et al. conducted two replications of one study from Côté et al., regarding the effect of inducing emphathy on utilitarian moral judgment across socialeconomic status (SES). Treated participants took an emphathy nudge, and then all participants completed an allocation task.

  • Sample sizes. The original sample size was n = 91. The first replication study had sample size N1 = 230, and the second had N2 = 300.

  • Variables. The primal outcome is Decision_DV, i.e., how many dollars they would take away from the 'lose' member in the allocation task, as a measure of utilitarian moral judgement. Control variables including age, gender, ethnicity, income, riligiousity, political orientation, etc., were also collected. Intermediate outcomes on how much they felt compassionate, moved, and sympathetic towards the 'lose' member were also collected. We clean the datasets for the two replication studies separately.

  • Results. The original study found a significant effect of the interaction of experimental condition and SES. Study 1 in the replication study did not replicate this result, while the second replication study did.

3. EMDR and misinformation dataset

  • Background. This study concerns the effect of eye movement on susceptibility to false memories. These eye movements are a standard component of ``eye movement desensitization and recprocessing", a standard intervention for posttraumatic stress disorder.

  • Sample sizes. The original study by Houben et al. had sample size n = 82, while the direct replication by Calvillo et al. had sample size N = 120.

  • Variables. The outcome variable are the total number of correct answers and the total number of misinformation after the experiment. In addition, both studies collect gender, age, pre- and post-intervention vividness of memory and emotionality, with one depression level measure differing from BDI to BDI-II.

  • Results. The original study found a statistically significant effect of eye movement on increasing false memories, while the replication study did not.

4. Self-centrality and mind-body practice dataset

  • Background. This study investigates whether mind-body practices (yoga in experiment 1 and meditation in experiment 2) increase self-enhancement. In experiment 1, waves of local yoga participants were randomly assigned to treatment and control by week. In experiment 2, participants were recruited from an undergraduate psychology subject pool, with two waves completed offline and two online.

  • Sample sizes. The original study has n1 = 93 for experiment 1 and n2 = 162 (potentially repeated measure over a few weaks). The replication study has N1 = 97 and N2 = 300 for the two experiments.

  • Variables. There are a few outcome variables, including self-centrality, self-enhancement, self-esteem, etc. In our folder, we cleaned the datasets with easier-to-understand column names, and also provide the data cleaning scripts (adapted from the data sources) for reproducibility.

  • Results. Experiment 1 showed no significant effect of yoga for enhancing self-centrality, but did (largely) replicated the effect on self-enhancement, self-esteem and commnunal narcissism. The discrepancy was explained by sampling differences in Vaughan-Johnston et al. Experiment 2 showed no significant effect of medication on self-centrality; frequentisy and Bayesian analyses were contrary regarding self-enhancement; however, they found much stronger evidence for well-being effects than the original study.

5. Queueing design and service time dataset

  • Background. This study investigates the impact of queue design on worker productivity in service systems that involve human servers by varying between multiple parallel queues versus single pooled queue.

  • Sample sizes. The original study recruited n1 = 248 participants from a public university in US and n2 = 481 participants on M-Turk. The replication study recruited N1 = 246 and N2 = 252 participants for two rounds.

  • Variables. The outcome variable is median speed. The treatment variable is structure of the queue. Other baseline variables were also measured, including age, gender, device used in the experiment, and managerial experience of the participant.

  • Results. The original study found the singe-queue structure slows down servers, while the replication study failed to find such effect.

6. Multi-lab disgust and moral judgement dataset

  • Background. This is a multi-lab replication of an original study from Eskine et al. (2011); unit-level data for the original study is not publicly available to our knowledge. They studied the effect of gustatory disgust on moral judgement, where participants were randomly assigned to bitter, neutral (control), or sweet beverages, and then judged the moral wrongness of six vignettes. We follow the ordering on OSF to clean the datasets and preserve common demographic, manipulation check, and outcome variables.

  • Sample sizes. The original study had sample size n = 57, while the replication studies had N = 1137 participants in total across k = 11 studies.

  • Variables. The outcome variable is the average moral rating of the six vignettes. The treatment variable is condition, coded as dummysweet, dummybitter and dummywater in the cleaned datasets. Baseline covariates including religiosity, gender, age, years in colledge, major, ethnicity, potilical orientation, etc. We preserve gender, age, and political orientation for consistency in cleaned data. To evaluate the intended effect of the beverages on subjective ratings (bitter, disgusting, neutral, and sweet) is also assessed, named as check_... in the cleaned data.

  • Results. The original study showed that gustatory disgust triggers a significantly heightened sense of moral wrongness. In the multi-lab replication study the overall estimates of effect sizes were all smaller than the original study; some were in the opposite direction; all had 0.95 confidence intervals containing zero.

7. Pain and cooperation dataset

  • Background. Experiment 2 of Bastian et al. (2014) studied the effect of sharing painful experience on intergroup cooperation. Small groups (2-6 people each) of participants performed either two painful or two painless tasks and then played an economic game. Prochazka et al. (2022) conducts a pilot nonpreregistered direct replication and a second preregistered direct replication, with group sizes fixed at three.

  • Sample sizes. The original study had sample size n = 62. The pilot replication had N = 153 from Czech Republic, and the second preregistered replication had N2 = 158 students from Slovakia.

  • Variables. The outcome variable is cooperation, the average score from the six games. The treatment variable is condition. We cleaned the datasets by preserving overlapping variables, while the original data additionally contains group size information. Baseline covariates include age and gender. After the experiments, intermediate outcomes such as the level of pain and unpleasantness of sensations were measured as a manipulation check.

  • Results. The original study found that shared pain increases cooperation among group members. Both replication studies failed to replicate this finding.

8. Cleaniness and moral judgement dataset

  • Background. This study investigates the impact of physical cleaness on the severity of moral judgement. Participants are randomly assigned to be primed with the concept of cleanliness (Exp.1) and wash hands after experiencing disgust (Exp.2), and then rate six moral vignettes.

  • Sample sizes. The original study had n1 = 40 for Exp.1 and n2 = 44 for Exp.2. The replication study had N1 = 219 for Exp.1 and N2 = 132 for Exp.2.

  • Variables. We cleaned the datasets and preserved common covariates in both studies. The outcome variable is vignette, the mean rating in all vignettes. The treatment variable is condition with treatment equal 1. Other variables include the emotionality collected after the experiments.

  • Results. The original study finds statistically significant effects in both experiments, while Johnson et al. failed to replicate either of them.

9. Lie and language dataset

  • Background. This study investigates the impact of foreign versus native language on lying. In the original study, German-speaking participants took a lie test where questions were presented randomly in German or English, and they answered with truth or lying in different languages. In the replication study, participants were Dutch-speaking.

  • Sample sizes. The original study had n = 41 participants, and the replication study had N = 63.

  • Variables. The measured outcome is the response time for truth-or-lie-telling answers in both languages. In our cleaned data, each row contains the mean response time of a participant (indicated by ID) for questions of different Emotionality, Veracity (Lie or Truth) and Language, as well as the participant's evaluation of emotionality for each category of (Emotionality times Language). Due to limited access, only the replication data contains demographic features including age, gender, major, language proficiency as introduced in Frank, et al., 2019.

  • Results. The original study showed smaller reaction time differences between lying and truth telling in the foreign compared to thenative language condition, which was mostly driven by prolonged truth responses. The replication study found statistically significant conclusion in the same direction, yet with a smaller effect size.

10. Multi-lab ego depletion dataset

There are two multi-lab replications. Hagger, et al., [2016] failed, but Dang, et al., [2020] succeeded. Dang, et al., [2020] also pointed out inconsistent implementation of the intervention may be a potential reason for the replication failure in Hagger, et al., [2016]. Both OSF links contain datasets for each lab, which includes individual-level characteristics.

11. Honesty and time dataset

  • Background. This study investigates the impact of time pressure on cheating. In the original study, participants privately roll out a dice and get payment according to their reported amount on the dice (which does not have to be true). The reported amount is used as the outcome.

  • Sample sizes. The original study had n = 72. The replication study consisted of two experiments; the first one had larger sessions with N1 = 426, another one had the same session size as the original study with N2 = 297.

  • Variables. The outcome of interest is the reported dice number. The treatment variable (=1) indicates whether there is time pressure (i.e., having to report the dice number in a short time). Data for the original study only contains gender as demographic information. Data for the replication study contains age, gender, education, etc., as demographics, as well as ratings for their belief in the financial incentive and anonymousness of their die roll. The original study and the replication study 1 collected the participants' positive and negative feelings after the experiment; we preserve all such columns and put the common ones before others.

  • Results. The original study found that time pressure increases cheating, while neither of the replication studies replicated this conclusion.

Details of one-sided studies and datasets

1. climate change misinformation dataset

  • Background. This study investigates the impact of information communication on protecting against misinformation about climate change.

  • Sample sizes. The original study had n = 2167. The replication study had N = 792.

  • Variables. The outcome of interest is the perceved concensus, and there are multiple treatment conditions. We clean the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset, with processing script included for reproducibility.

  • Results. The original study had multiple hypotheses; the replication study replicated a susbet of them.

2. Pain-tolerance metaphor dataset

  • Background. This study investigates the impact of common physical properties (such as 'cond') within a perseverance metaphor on increasing pain tolerance. Participants completed a cold pressor task before and after a randomly allocated intervention of metaphor exercise.

  • Sample sizes. The original study had n = 87. The replication study had N = 89.

  • Variables. The outcome of interest is the difference in pain tolerance. We save the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset.

  • Results. The original study found that physical metaphor increases pain tolerance, while the replication study did not replicate this result.

3. Body dissatifaction dataset

  • Background. This study investigates the impact of a computer-based evaluative conditioning (EC) procedure using positive social feedback on enhancing body satisfaction.

  • Sample sizes. The original study had n = 54. The replication study had N = 129.

  • Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.

  • Results. The conclusion in the original study was not successfully replicated.

4. Priming and exercise dataset

  • Background. This study investigates the impact of affective priming as a behavioral intervention on the enhancement of exercise-related affect.

  • Sample sizes. The original study had n = 54. The replication study had N = 53.

  • Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.

  • Results. The conclusion in the original study was not successfully replicated. The replication report emphasized potential heterogeneity among people as a potential factor for the failure.

Reference

Please use the following citation if you use this collection in your study, or you use our softwares for analyzing replication studies.

@article{jin2023diagnosing,
  title={Diagnosing the role of observable distribution shift in scientific replications},
  author={Jin, Ying and Guo, Kevin and Rothenh{\"a}usler, Dominik},
  journal={arXiv preprint arXiv:2309.01056},
  year={2023}
}

@article{jin2024beyond,
  title={Beyond reweighting: On the predictive role of covariate shift in effect generalization},
  author={Jin, Ying and Egami, Naoki and Rothenh{\"a}usler, Dominik},
  journal={arXiv preprint arXiv:2412.08869},
  year={2024}
}

Releases

No releases published

Packages

No packages published