Skip to content

2. Input preparation

Catalina Vallejos edited this page Jun 7, 2020 · 12 revisions

‼️ THIS WIKI IS NO LONGER MAINTAINED - PLEASE REFER TO THE VIGNETTE INSTEAD ‼️

The input dataset for BASiCS must be stored as an SingleCellExperiment object (see r Biocpkg("SingleCellExperiment") package). To use BASiCS on an existing SingleCellExperiment object, please read the section below.

Using the newBASiCS_Data function

The newBASiCS_Data function can be used to create the required SingleCellExperiment object based on the following information:

  • Counts: a matrix of raw expression counts with dimensions $q$ times $n$. Within this matrix, $q_0$ rows must correspond to biological genes and $q-q_0$ rows must correspond to technical spike-in genes. Gene names must be stored as rownames(Counts).

  • Tech: a vector of TRUE/FALSE elements with length $q$. If Tech[i] = FALSE the gene i is biological; otherwise the gene is spike-in. This vector must be specified in the same order of genes as in the Counts matrix.

  • SpikeInfo: a data.frame with $q-q_0$ rows. First column must contain the names associated to the spike-in genes (as in rownames(Counts)). Second column must contain the input number of molecules for the spike-in genes (amount per cell).

  • BatchInfo (optional argument): vector of length $n$ to indicate batch structure in situations where cells have been processed using multiple batches.

For example, the following code simulates a dataset with 50 genes (40 biological and 10 spike-in) and 40 cells.

set.seed(1)
Counts = matrix(rpois(50*40, 2), ncol = 40)
rownames(Counts) <- c(paste0("Gene", 1:40), paste0("Spike", 1:10))
Tech = c(rep(FALSE,40),rep(TRUE,10))
set.seed(2)
SpikeInput = rgamma(10,1,1)
SpikeInfo <- data.frame("SpikeID" = paste0("Spike", 1:10), "SpikeInput" = SpikeInput)

# No batch effect
DataExample = newBASiCS_Data(Counts, Tech, SpikeInfo)

# With batch effect
DataExample = newBASiCS_Data(Counts, Tech, SpikeInfo, 
                             BatchInfo = rep(c(1,2), each = 20)) 

Single-cell RNA sequencing data typically require filtering (quality control) before performing the analysis. This is in order to remove cells and/or transcripts with very low expression counts. The function BASiCS_Filter can be used to perform this filtering. For examples, refer to help(BASiCS_Filter). Additional tools for this purpose can also be found in the scater Bioconductor package

NOTE: Input number of molecules for spike-in should be calculated using experimental information. For each spike-in gene $i$, we use

$$ \mu_{i} = C_i \times 10^{-18} \times (6.022 \times 10^{23}) \times V \times D$$

where,

  • $C_i$ is the concentration of the spike $i$ in the ERCC mix (see here)
  • $10^{-18}$ is to convert att to mol
  • $6.022 \times 10^{23}$ is the Avogadro number (mol $\rightarrow$ molecule)
  • $V$ is the volume added into each chamber (in nL)
  • $D$ is a dilution factor

For example, for the 96-well plate in the Fluidigm C1 system, $V = 6.7 \times 10^{-3}$ (see here).

Using an existing SingleCellExperiment object

To convert an existing SingleCellExperiment object (Data) into one that can be used within BASiCS, meta-information must be stored in the object.

With Spikes

  • SingleCellExperiment::isSpike(Data, SpikeType) <- Tech: the logical vector indicating biological/technical genes (see above) must be stored in the isSpike slot. SpikeType is a string containing the name of the spike-ins used (e.g. "ERCC"). Note: If Data contains more that one type of spike-ins (length(SingleCellExperiment::spikeNames(Data)) > 1), unused spike-in types should be removed (see help(isSpike, package = SingleCellExperiment)).

  • colData(Data)$BatchInfo <- BatchInfo: the vector indicating the batch structure (see above) must be stored in the colData slot.

  • metadata(Data): the SpikeInfo object is stored in the metadata slot of the SummarizedExperiment object: metadata(Data)=list(SpikeInput = SpikeInfo[,2]). Once the additional information is included, the object can be used within BASiCS.

Without Spikes

In many cases (e.g. droplet-based scRNA-Seq data), spike-in genes are not present. To run BASiCS on a SingleCellExperiment object, one needs to solely store the BatchInfo metadata in the object. Here is an example on how to create a SingleCellExperiment object which does not contain spike-in genes:

set.seed(1)
Counts <- matrix(rpois(50*40, 2), ncol = 40)
rownames(Counts) <- c(paste0("Gene", 1:50))

# Create SingleCellExperiment object containing batch information
library(SingleCellExperiment)
DataExample <- SingleCellExperiment(assays = list(counts = Counts),
                                   colData = data.frame(BatchInfo = rep(c(1,2), each = 20)))