note: the repository has been renamed to sc_mixology. the old link will be redirected to the current repository.
sc_mixology uses three human lung adenocarcinoma cell lines HCC827, H1975 and H2228, which were cultured separately, and then processed in three different ways. Firstly, single cells from each cell line were mixed in equal proportions, with libraries generated using three different protocols: CEL-seq2, Drop-seq (with Dolomite equipment) and 10X Chromium. Secondly, the single cells were sorted from the three cell lines into 384-well plates, with an equal number of cells per well in different combinations (generally 9-cells, but with some 90-cell population
controls). Thirdly, RNA was extracted in bulk for each cell line and the RNA was mixed in 7 different proportions and diluted to single cell equivalent amounts ranging from 3.75pg to 30pg and processed using CEL-seq 2 and SORT-seq. ERCC spike-in controls were present in samples processed using the 2 plate-based technologies (CEL-seq2 and SORT-seq).
Raw data from this series of experiments is available under GEO accession number GSE118767. The processed count data obtained from scPipe is stored in R objects that use the SingleCellExperiment class. Below are instructions for getting the count data and metadata (including annotations) for each dataset. All data is post sample quality control, without gene filtering.
You can find R object files in the data folder
load("data/sincell_with_class.RData")
This will create three variables: sce10x_qc
, sce4_qc
, and scedrop_qc_qc
. sce10x_qc
contains the read counts after quality control processing from the 10x platform. sce4_qc
contains the read counts after quality control processing from the CEL-seq2 platform. scedrop_qc_qc
contains the read counts after quality control proessing from the Drop-seq platform.
The true label is stored in colData()
. For single cells the label is in column cell_line_demuxlet
. For single cell mixtures the ground truth is the combination of three cell lines, which is in column H1975
, H2228
and HCC827
. so one merge and use the combination as the label, such as paste(sce_SC1_qc$H1975,sce_SC1_qc$H2228,sce_SC1_qc$HCC827,sep="_")
. Similarly, the ground truth in RNA mixture is the proportion of RNA from each cell line, stored in column H2228_prop
, H1975_prop
and HCC827_prop
, which can be merged into one column and use as the label, such as paste(sce2_qc$H2228_prop,sce2_qc$H1975_prop,sce2_qc$HCC827_prop,sep="_")
.
To access count data from a SingleCellExperiment object, use the counts(sce)
function:
counts(sce10x_qc)[1:5, 1:5]
To access sample information from a SingleCellExperiment object, use the colData(sce)
function:
head(colData(sce10x_qc))
You can find an Rnotebook in the script/data_QC_visualization folder named data_explore_mixture.Rmd
which includes code for analysing the cell mixture and RNA mixture datasets.
The [script] folder contains scripts that can reproduce the analysis and figures from our paper: Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.
Note: The ggtern
package, which has been used to generate the ternary plots, has known issues with recent versions of ggplot
and the relevant code may be broken if you have updated the ggplot
package.