scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).
scTagger is available as a Conda package:
conda create -n sctagger-env -c bioconda sctagger
conda activate sctagger-env
scTagger.py -h
We provided a simple Snakefile
alongside a config.yaml
file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).
scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.
The whole pipeline contains three steps that you can run each part separately:
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.
./scTagger.py extract_lr_bc -r "path/to/long/read/fastq" -o "path/to/output/file" -p "path/to/output/plots"
Augments
-r
: Space separated paths to reads in FASTQ-g
: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)-z
: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")-t
: Number of threads (Optional, Default: 1)-sa
: Short-read adapter (Optional, Default:CTACACGACGCTCTTCCGATCT
)--num-bp-afte
: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)-o
: Path to output file-p
: Path to plot file (Optional, Default: No plotting)
Inputs
- A list of FASTQ files of long-reads
Outputs
- A Tsv file:
- First column is read-id
- Second column is the best edit distance with the short-read adapter
- Third column is the starting point of long-read that matches with the adapter
- Fourth column is the long-read segment that find.
- A plot of optimal alignment locations of the short read adapter to the long-reads.
The second step is to extract the top short-reads barcodes that cover most of the reads.
./scTagger.py extract_sr_bc -i "path/to/bam/file" -o "path/to/output/file" -p "path/to/output/plot"
Arguments
-i
: Input file-o
: Path to output file.-p
: Path to plot file (Optional, Default: No plotting)--thresh
: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size
: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt
: Max number of barcodes to keep (Optional, Default: 25000)
Input
- A bam file of short reads data
Output
- A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
- A cumulative plot of SR coverage with batches of 1,000 barcodes
This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc
module.
./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'
Arguments
-i
: Input TSV file containing the long-read segments file generated byextract_lr_bc
step-o
: Path to output file.-wl
: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.--thresh
: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size
: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt
: Max number of barcodes to keep (Optional, Default: 25000)
Input
- The output file of the
extract_lr_bc
step - 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)
Output
- A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
The last step is to match long-read segments with selected barcodes from short reads
./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"
Arguments
-lr
: Long-read segments TSV file-sr
: Short-read barcode list TSV file-mr
: Maximum number of errors allowed for barcode matching (Optional, Default: 2)-m
: Maximum number of GB of RAM to be used (Optional, Default: 16.0)-bl
: Length of barcodes (Optional, Default: 16)-t
: Number of threads to use for searching (Optional, Default: 16)-p
: Path of plot file-o
: Path to output file. Output file is gzipped
Inputs
- Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section
Outputs
- A TSV file
- First column is the read id
- Second column is the minimum edit distance
- Third column is the number of short reads barcodes that match with the long-read
- Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
- A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:
Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. "Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments." iScience (2022). DOI:10.1016/j.isci.2022.104530
Please check the paper branch of this repository for the archived paper experiements and implementation.