We present TRAPeS (TCR Reconstruction Algorithm for Paired-End Single-cell), a software for reconstruction of T cell receptors (TCR) using short, paired-end single-cell RNA-sequencing.
TRAPeS reconstruct the TCR in 3 steps: For each chain, it first identify the V and J segments by searching for paired reads with one read mapping to the V segment and its mate mapping to the J segment. Then, a set of putative CDR3-originating reads are identified as the set of unmapped reads whose mates map to the V,J and C segments. Finally, an iterative dynamic programming algorithm is used to reconstruct the CDR3 region with the putative CDR3 reads.
For more information, see our paper in Nucleic Acids Research here or our bioRxiv preprint here
TRAPeS is written in Python and C++, and currently works on Linux. TRAPeS require the following python libraries:
- biopython
- pysam
In addition, TRAPeS requires the following software:
- bowtie2
- RSEM
- samtools
TRAPeS takes as input mapped and unmapped files of reads after genomic alignment (e.g. using TopHat).
TRAPeS assumes a certain folder structure: It assumes that each cell has its own folder, and all of those folders are under one path. Also, it assumes that each cell folder has identical subfolder structure.
To run TRAPeS, simply run:
python trapes.py [options]
To display help:
python trapes.py -h
To test that TRAPeS is installed correctly, run the following command:
python /path-to-TRAPeS/trapes.py -genome hg38 -path /path-to-TRAPeS/Example/proc_data/ -sumF /path-to-TRAPeS/Example/TRAPeS_out/TCR.out -output TRAPeS_out/TCR.out -score 15 -unmapped unmapped.bam -bam sorted.bam
This should produce the following files:
/path-to-TRAPeS/Example/TRAPeS_out/TCR.out.summary.txt
/path-to-TRAPeS/Example/TRAPeS_out/TCR.out.TCRs.txt
which should be similar to the files:
/path-to-TRAPeS/Example/TRAPeS_out/example.output.summary.txt
/path-to-TRAPeS/Example/TRAPeS_out/example.output.TCRs.txt
Please note: in order to download the large example bam files, use git lfs when cloning the repository.
###Options when running TRAPeS
Input files:
-path : The folder with all the single cell samples. Assumes every subfolder under this path is a folder of a single cell sequencing results.
-bam : The location of the sorted mapped bam file relative to the single cell folder. For example, if under each single cell folder there is a subfolder named “TopHat_Output” and the mapped file is under that folder and named mapped.bam, the command should read “–bam TopHat_Output/mapped.bam”.
-unmapped : the location of the bam file containing the unmapped reads. Similar to the –bam tag, it is relative to the single cell folder.
Output files:
-output : Output prefix for the files generated by TRAPeS in every single cell folder, relative to the single cell folder (e.g. TopHat_Output/TRAPeS/TCR.out). Can include a subdirectory (TRAPeS will create that subdirectory if it does not yet exists).
-sumF : Prefix for the output summary files of the entire path (summary of all single cells together).
Parameters for the TCR reconstruction
-score : The threshold score for the alignment of a read to the V or J segment. Any putative CDR3-originating read that its alignment to the V or J segments passes this score will be used for reconstruction. Default is 15, but should be changed based on sequencing quality and length. In the next version the default will be computed based on the length of the input reads
-overlap : Minimum number of bases that the extended V and J segments should overlap in order to stop the reconstruction. Default is 10.
-iterations : Maximum number of times to extend the V and J segments. If after that number of iterations the V and J segments still do not overlap the reconstruction is determined as unsuccessful. Default is 20.
-bases : Number of bases from the V and the J segments that will be used as the initial template for the reconstruction. Default is min(length(V), length(J)).
-lowQ : By default, the putative CDR3-originating reads are identified as unmapped reads whose mate is aligned to the V/J/C segments. However, those reads can by chance be mapped to other places in the genome (usually low quality mapping). By including the –lowQ tag the set of CDR3-origintating reads will also include the reads that map to other places in the genome. We recommend using this tag.
-top: Rank all the V-J pairs based on the number of mapped reads, and reconstruct only the top X number of V-J pairs. Default: reconstruct all V-J pairs. Recommended for very deep libraries or libraries when there are many possible V-J pairing.
-byExp: Must be used along with the top parameter. Rank all the V-J pairs based on the number of mapped reads, but instead of reconstructing only the top X number of V-J pairs, randomly choose 2 V-J pairs with the same rank (same number of mapped reads) and reconstruct them. Then TRAPeS will move on to the next rank until the number of V-J pairs specified with top has been reconstructed. Default: off.
-oneSide: Add this parameter to also search for productive reconstructions only from the extended V segment (in case of no overlap between the extended V and extended J segments). Default: off.
-readOverlap: Consider only reads with that number of bases overlapping V/J/C segments as mapped reads. Default: 1. Note: This parameter is still being tested
Paths to other software:
-bowtie2 : Path to bowtie2. If not used assumes bowtie2 is in the default PATH
-rsem: Path to RSEM. If not used assumes RSEM is in the default PATH
-samtools: Path to samtools. If not used assumes samtools is in the default PATH
Other parameters:
-singleCell : Add this tag if you are only running TRAPeS on a single cell (not a library of many single cells). Currently not active
-genome : The genome used for genomic alignment. By default, TRAPeS only supports hg19, hg38 and mm10, and for mm10 with NCBI chromosome naming use mm10_ncbi. However, TRAPeS can be used for any genome or any organism, as long as the genomic annotations for the V/J/C segments are available. In order to run TRAPeS on genomes besides mm10, hg19 or hg38, the user must create the relevant annotation files (see here for more information). In addition, this will require to add the '-Aminus'/'-Bminus' parameters if necessary.
-Aminus: Only used for user supplied genomes. Add this flag if the (majority) of V and J annotations of the alpha chain are on the minus strand.
-Bminus: Only used for user supplied genomes. Add this flag if the (majority) of V and J annotations of the beta chain are on the minus strand.
-strand : Strand orientation of the reads, options are [minus, plus, none]. For transcripts on the positive strand, to which strand does the rightmost (in genomic coordinates) mate of the read map to. Default is minus.
- sumF.summary.txt : Summary of the reconstruction status in each cell (successful/unsuccessful).
- sumF.TCRs.txt : A list of all reconstructions (productive, unproductive and partial) of all cells.
In addition, in each single cell folder you can find the following output files:
- output.\[alpha/beta\].junctions.txt : The set of V-J pairs found (before reconstruction)
- output.reconstructed.junctions.\[alpha/beta\].fa : the set of reconstructed junction. If reconstruction was unsuccessful, the partial V and J reconstructions will be separated by N’s.
- output.\[alpha/beta\].mapped.and.unammed.fa : the set of the putative CDR3-originating reads used for the reconstruction.
- output.\[alpha/beta\].\[R1/R2\].fa : Set of paired-end reads that are aligned to the reconstructed TCRs in order to quantify the expression of each TCR using RSEM.
- output.\[alpha/beta\].rsem.out* : The output files created by RSEM.
- output.\[alpha/beta\].full.TCRs.fa : Fasta file with the full TCR sequences.
- output.\[alpha/beta\].full.TCRs.bestIso.fa : Fasta file with the full sequences of the TCRs, after choosing only the highly expressed isoform in case of more than one isoform for the V/J/C segments.
- output.summary.txt : Summary of all the reconstructed chains in this cell.
For any questions or comments, please email Shaked Afik, safik@berkeley.edu.