This repository contains the code for the paper Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours by Haghverdi et al. (2018).
Note: Further updates and development of the analysis and simulation code will take place at https://github.com/MarioniLab/FurtherMNN2018. If you have general questions regarding the code (i.e., not specifically involving the manuscript), please post your issues at the above repository instead.
- To generate the simulation figure in the main text (uneven composition of cell types) and supplement (identical composition), enter the
Simulations
directory. First run the source filesimulateBatches.R
, then run the source fileplotCorrections.R
. - To generate the haematopoietic data figures in the main text, enter the
Haematopoiesis
directory. First run the source fileprepareData.R
, then theplotCorrections.R
script. - For the pancreas figures:
- Download gene expression data from the four public data sets (Gene-by-Cell matrices, meta data and highly variable gene lists) by running the bash script
DownloadData.sh
.
This will download a zipped file containing the raw count matrices for GSE86473.
The remaining data sets are downloaded directly in the data processsing scripts for the appropriate studies (denoted by GEO/ArrayExpress accession number from the manuscript). - To run the data processing and normalization, move to this to the
pancreas
directory and execute the scriptnormalizePancreas.R
. - To calculate the highly variable genes, execute (or source) the script
findHighlyVariableGenes.R
- To assign cell type labels according to the approaches described for each study, run the script
assignCellTypeLabels.R
. - To generate any of the pancreas data sets results, you have to run the source file
PancreasProcessingCorrection.R
in the Pancreas folder first. You will need to create a directory called 'results', into which all figures and batch corrected data will be saved. - To correct the batch effects and generate t-SNE plots and the Silhouette boxplots for the pancreas data sets (Fig 4 and Supplementary Figure 5), run the source file
PancreasCorrectionComparison.R
in the Pancreas folder.
This will also generate the pancreas PCA plots and the entropy of mixings boxplots in the supplement (Suppl. Fig.5). - To compare performance of MNN with locally variable batch effects versus a global batch effect settings (Suppl.Fig. 6), run the source file
local_global_batchvect.R
in the Pancrease folder. - Differential expression testing figures can be generated by running the R markdown document,
PancreasDE_analysis.Rmd
, contained in thePancreasDE
directory.
The static version of the R notebook is also available as a html document that can be opened with any internet browser.
- Download gene expression data from the four public data sets (Gene-by-Cell matrices, meta data and highly variable gene lists) by running the bash script
- The scripts to download and normalize the 10X droplet data can be found in
Droplet/
, specificallypbmc_normalisation.R
for the 68,000 PBMCs andtcell_4K_normalisation.R
for the 4,000 T cells. Please note that trying to normalise 68,000 cells on your local machine will require a lot of resources (memory and CPU), it is recommended that the scripts in theDroplet/
are executed on an appropriate high performance computing cluster. The scripts to perform tSNE and cluster assignment using community detection on the uncorrected data can be performed by runninguncorrected_68k_tSNE.R
,assign_cell_types_68kPBMC.R
. To perform the equivalent tasks to generate the panels of Figure 5, runcombine_10X.R
,pbmc68k_tSNE.R
,PBMC_68k_plotting.R
,assign_cell_types_68kPBMC_corrected.R
andCorrected_PBMC_68K_assignCellLabels.R
.