updated documentation relative to the Howitt et al preprint

plger · Oct 18, 2024 · f772d3a · f772d3a
1 parent 8373453
commit f772d3a
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 2 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: scDblFinder
 Type: Package
 Title: scDblFinder
-Version: 1.19.6
+Version: 1.19.7
 Authors@R: c(
     person("Pierre-Luc", "Germain", email="pierre-luc.germain@hest.ethz.ch", role=c("cre","aut"), comment=c(ORCID="0000-0003-3418-4218")),
     person("Aaron", "Lun", email="infinite.monkeys.with.keyboards@gmail.com", role="ctb"))

diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd
@@ -30,7 +30,7 @@ For a more general introduction to the topic of doublets, refer to the [OCSA boo
 
 All methods require as an input either a matrix of counts or a `r Biocpkg("SingleCellExperiment")` containing count data. With the exception of [findDoubletClusters](findDoubletClusters.html), which operates at the level of clusters (and consequently requires clustering information), all methods try to assign each cell a score indicating its likelihood (broadly understood) of being a doublet.
 
-The approaches described here are _complementary_ to doublets identified via cell hashes and SNPs in multiplexed samples: while hashing/genotypes can identify doublets formed by cells of the same type (homotypic doublets) from two samples, which are often nearly undistinguishable from real cells transcriptionally (and hence generally unidentifiable through the present package), it cannot identify doublets made by cells of the same sample, even if they are heterotypic (formed by different cell types). Instead, the methods presented here are primarily geared towards the identification of heterotypic doublets, which for most purposes are also the most critical ones.
+The approaches described here are _complementary_ to doublets identified via cell hashes and SNPs in multiplexed samples: while hashing/genotypes can identify doublets formed by cells of the same type (homotypic doublets) from two samples, which are often nearly undistinguishable from real cells transcriptionally (and hence generally unidentifiable through the present package), it cannot identify doublets made by cells of the same sample, even if they are heterotypic (formed by different cell types). Indeed, recent evidence suggests that doublets are for instance a serious and strongly underestimated issue in 10x Flex datasets (see [Howitt et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.03.616596v2)). Instead, the methods presented here are primarily geared towards the identification of heterotypic doublets, which for most purposes are also the most critical ones.
 
 <br/>
 

diff --git a/vignettes/scDblFinder.Rmd b/vignettes/scDblFinder.Rmd
@@ -213,6 +213,16 @@ Like other similar tools, scDblFinder focuses on identifying heterotypic doublet
 
 However, should you for some reason try to identify also homotypic doublets with scDblFinder, be sure to not to use the cluster-based approach, and to set `removeUnidentifiable=FALSE`. Otherwise, scDblFinder removes artificial doublets likely to be homotypic from training, therefore focusing the task on heterotypic doublets, but at the expense ot homotypic ones (which are typically deemed relatively harmless).
 
+### What is a sample exactly? Usage with barcoded and 10X Flex data.
+
+As indicated above, the `samples` argument should be used to indicate different captures.
+For multiplexed samples, this is expected to be the batch of cells processed together, rather than the actual samples.
+
+In highly multiplexed datasets such as produced by the 10X Flex kit (especially 16-plex), this can cause two kinds of problems.
+First, the whole logic of the Flex approach is that inter-sample doublets can be resolved into separate cells, and while a large number of unresolvable intra-sample doublets will remain (see [Howitt et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.03.616596v2)), the expected remaining doublet rate will not be the same as for classical 10X experiment. For this reason, we recommend to set a higher `dbr.sd` in such circumstances, e.g. `dbr.sd=1` to base the thresholding entirely on the classification accuracy.
+
+Another, more practical problem is that, with such kits, the very large number of cells in a single capture might translante into very large computational demands when running `scDblFinder`. To circumvent such problem, one can split a batch of cells into more decently-sized chunks and process the chunks separately, so long as each chunk is representative of the whole batch in terms of cell heterogeneity.
+
 ### How can I make this reproducible?
 
 Because it relies on the partly random generation of artificial doublets, running scDblFinder multiple times on the same data will yield slightly different results.