-
Notifications
You must be signed in to change notification settings - Fork 29
Find Barcodes in CCS Reads
For officially supported barcoding analysis in SMRT Analysis 1.4, please refer to the instructions found at the PacificBioSciences barcoding wiki.
GOAL: Provide R+D code for identifying barcodes in PacBio® CCS sequencing reads along with a working example.
This code, along with proper flanking-insert barcoded sample products and PacBio CCS sequencing data, will label each read with the most-likely barcode, give the barcode score, and show where in the read the barcode occurred (so you might trim it out if desired in downstream analysis).
PacBioBarcodeCCSDist.zip: contains the barcode sequences, Python scripts and examples
This software is executed using the calling sequence:
PacBioBarcodeIDCCS.py "CCSReads.fasta" "barcodeIdentities.fasta" "outputDirectory"
where
- "CCSReads.fasta" is the file of CCS reads from the PacBio sequencer
- "barcodeIdentities.fasta" are the identities of the barcodes to look for
- "outputDirectory" is where all the results are stored
This software assumes flanking-insert barcodes. This means that the sample has been prepped such that every PacBio CCS read has the form (in either forward or reverse-complement): (BarcodeLEFT + sequenceOfInterest + BarcodeRIGHT) See our barcoding technical report for sample prep methods to achieve this. The barcode identites are given in a fasta file with a simple naming convention to show pairing: fasta ids "F+barcodeid" and "R+barcodeid" indicate the LEFT and RIGHT barcodes respectively. See below for a worked example.
The result of processing is a file: "outputDirectory/all.RESULT.reduceMax". This gives for all reads that had identifiable barcode pairs: the fasta id of the read, the id of the barcode pair, the score of the most-likely barcode hit, and the location in the reads where the paired barcodes were identified. See below for a worked example. NOTE: this software assumes barcoded-flanked inserts and CCS reads. Methods for barcode identification in raw PacBio reads (not CCS) or for non-flanking barcode topologies are not covered here explicity.
This software is distributed as a simple collection of Python scripts in the bin/ directory. The code depends on the HMMer sequencing modeling package, http://hmmer.janelia.org/ Version 2 of HMMer is need. The newer 3.0 version is NOT compatible. Debian maintains a stable hmmer package (2.3.2) and the source is available: http://hmmer.janelia.org/software/archive After HMMer installation, where hmmbuild, hmmsearch of the HMMer package are available on the command line, the code can be used.
Here we run through a toy example. I am using the barcodes: barcode96searchpair8.fa . This is a subset of barcode96searchpair.fa Here is the first barcode pair in that file:
>F0 gcgctctgtgtgcagc >R0 agagtactacatatga
This indicates that all properly barcoded CCS reads containing this pair should have the form: (gcgctctgtgtgcagc + sequenceOfInterest + agagtactacatatga) I constructed several test sequences that use this barcode and others as well as noisy indels: testseqs.fasta
>fwExact0 gcgctctgtgtgcagcACGTTGCAagagtactacatatga
>rcExact1 gcgatctatgcacacgTGCAACGTtagtgtcgactcatga
>fwMut2 tatctatcgtacacgcACGTTGCAatgtatctcgctgca
>rcMut3 gactctgctcgagtcTGCAACGTtcagatgtcagtgtgat
>noise aaccggttaaccggttACGTTGCAaaccggttaaccggtt
Exectute barcode identifcation: PacBioBarcodeIDCCS.py testseqs.fasta barcode96searchpair8.fa barcodeOutput This takes a bit less than 8 seconds on my machine. Full PacBio chips of CCS reads take a couple minutes of processing (depending on CCS chip yield). Now all of your results are in the barcodeOutput/ directory. I've also included a pre-computed version in barcodeOutputKEEP The main result is in barcodeOutput/all.RESULT.reduceMax:
fastaid barcode shortid scoreAll scoreRight leftExtent rightExtent fwExact0 F0 s0 38.800000 19.400000 1:16 25:40 rcExact1 [revcomp] F1 s1 38.800000 19.400000 1:16 25:40 fwMut2 F2 s2 29.600000 14.100000 1:16 25:39 rcMut3 [revcomp] F3 s3 32.300000 16.800000 1:17 26:40
The first line indicates that the first sequence, "fwExact0", contains the barcode pair F0 with a total barcode score of 38.8 bits with the LEFT barcode occupying bases 1:16 (one-based, inclusive indexing) and the RIGHT barcode occupying bases 25:40. We can dive into the results to see the alignments. Looking at the first result line, we see that the hit occured in the forward sense. The LEFT barcode alignment is contained in barcodeOutput/fw/bps.F0.LEFT.hmm.hmmsearch Searching for the "s0" shortid shows:
s0: domain 1 of 1, from 1 to 16: score 19.4, E = 7.3e-06 ->GCGCTCTGTGTGCAGC<- GCGCTCTGTGTGCAGC s0 1 GCGCTCTGTGTGCAGC 16
The RIGHT is in bps.R0.RIGHT.hmm.hmmsearch:
s0: domain 1 of 1, from 25 to 40: score 19.4, E = 7.2e-06 ->AGAGTACTACATATGA<- AGAGTACTACATATGA s0 25 AGAGTACTACATATGA 40
Both LEFT and RIGHT are perfect. The last result in the table occurs on the reverse-complement sense ("[revcomp]"). Note that all results are reported on the reverse-complemented sequence. The results are located in barcodeOutput/rc. bps.F3.LEFT.hmm.hmmsearch:
s3: domain 1 of 1, from 1 to 17: score 16.8, E = 4.3e-05 ->ATCACACTG.CATCTGA<- ATCACACTG CATCTGA s3 1 ATCACACTGaCATCTGA 17
bps.R3.RIGHT.hmm.hmmsearch:
s3: domain 1 of 1, from 26 to 40: score 15.5, E = 0.00011 ->GACTCGACGCAGAGTC<- GACTCGA GCAGAGTC s3 26 GACTCGA-GCAGAGTC 40
This shows a one-base insertion and a one-base deletion as designed.
We have provided python code to identify barcodes in PacBio CCS reads and have run through a complete example.
Visit the PacBio Developer's Network Website for the most up-to-date links to downloads, documentation and more. Terms of Use | Trademarks | Contact Us