-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: ORF and NMD predictions
This is the manual for running the ORF and NMD prediction pipelines.
This manual assumes that you have run TAMA Collapse or TAMA Merge and now have a bed file in their respective formats.
NOTE: If you are starting with a GTF/GFF file or a BED12 file that is formatted differently from the TAMA Collapse output, then please see the format convertor tools in TAMA-GO to convert into TAMA BED12 format before running this pipeline.
1. Bedtools
2. Blastp
3. Uniref database files
Use Bedtools to convert your bed file into a fasta file. (Bedtools 2.26.0 or newer)
Example command:
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}
fasta - this is the fasta file for the genome assembly
bed - this is the bed12 file for the annotation
outfile - this is the name of the output fasta file
The output fasta headers should look like this:
>G1;G1.1::1:29-2665(+) >G1;G1.2::1:219-3261(+) >G1;G1.3::1:1713-3246(+)
If you are using a newer version of bedtools it may look like this:
>G1;G1.1(+) >G1;G1.2(+) >G1;G1.3(+))
Use tama_orf_seeker.py to get ORF amino acid sequences.
Example command:
python tama_orf_seeker.py -f ${fasta} -o ${outfile}
fasta - this is the fasta file for your transcript sequences that you produced in the previous step
outfile - the name of the output fasta file which contains the ORF amino acid sequences
The output fasta file should have headers like this:
>G1;G1.1::1:29-2665(+):F1:0:719:1:239:239:I >G1;G1.1::1:29-2665(+):F1:132:719:45:239:195:M >G1;G1.1::1:29-2665(+):F3:2:64:1:20:20:F
If you used a newer version of bedtools in the first step it may look like this:
>G1;G1.1(+):F1:0:719:1:239:239:I >G1;G1.1(+):F1:132:719:45:239:195:M >G1;G1.1(+):F3:2:64:1:20:20:F
Note that the next step of Blastp can take a very long time to run. I recommend splitting the AA fasta file into smaller files to run in parallel. You can split using the tama_fasta_splitter.py tool. For more information: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Split-Files
After running Blastp on the individual split fasta files you can then just concatenate all the Blastp result files before running the next step.
Use blastp and a local Uniref database to find protein hits. (BLAST 2.2.31+ or newer)
Example command:
blastp -evalue 1e-10 -num_threads 16 -db ${uniref} -query ${orf}
uniref - Uniref/Uniprot fasta file
orf - ORF fasta file from previous step
Note: This parameter "-num_threads 16" specifies using 16 threads. Please adjust for your system. Also make sure to use the default output.
The individual hit entries in the output file should look like this:
>E1C721 Uncharacterized protein OS=Gallus gallus GN=LOC425783 PE=4 SV=2 Length=265 Score = 382 bits (982), Expect = 4e-136, Method: Compositional matrix adjust. Identities = 203/235 (86%), Positives = 209/235 (89%), Gaps = 12/235 (5%) Query 17 KSVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE 76 + VLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE Sbjct 31 REVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE 90 Query 77 ERQSQRGSGCSELRQNRHHVLCVA---VLCILVLALVAVIVLQRLSCLPPPPFSHVW--- 130 ERQSQRGSGCS+LRQNR VLCVA VLCILV +VAVIVLQR SC P PPFSHV+ Sbjct 91 ERQSQRGSGCSKLRQNRRCVLCVALCAVLCILVSVVVAVIVLQRPSCPPRPPFSHVYPNT 150 Query 131 ------KCYYFSYTKSDWNSSREHCHRLGASLATVDTEEEMEFMRGYRRPADRWIGLHRA 184 KCYYFS TKSDW SSRE CHRLGASLATVD EEEMEFMRGYRRPADRWIGLHRA Sbjct 151 WVGFHGKCYYFSDTKSDWKSSREQCHRLGASLATVDREEEMEFMRGYRRPADRWIGLHRA 210 Query 185 EGDEHWTWADGSAFSNWFELRGGGQCAYLHGDWINSTLCHSEKFWVCSRADSYVL 239 EGDEHWTWADGSAFSNWF+ +GGGQC YLHGDWINSTLCHSEKFWVCSRADSYVL Sbjct 211 EGDEHWTWADGSAFSNWFKPQGGGQCVYLHGDWINSTLCHSEKFWVCSRADSYVL 265
Use tama_orf_blastp_parser.py to parse the results from Blastp.
Example command:
python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile}
blastp - output from Blastp in previous step
outfile - name of output file from parsing
If you did a BlastP on an Ensembl CDS peptide sequence database, you can add "-f ensembl" to the command line to allow for parsing of the Ensembl ID's.
Example:
python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile} -f ensembl
The output should look like this:
G1;G1.1 F1 0 719 1 239 R4GL70 90_match 99 32 G1;G1.2 F1 369 596 124 198 R4GIW7 50_match 77 66 G1;G1.3 F1 0 386 1 128 H9KZN5 90_match 91 65
Use tama_cds_regions_bed_add.py to create the annotation bed file with CDS regions added.
Example command:
python tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}
orf - The output file from the previous step
bed - The bed annotation file that you started with in this pipeline
fasta - The transcript input fasta file from the first step (the input fasta)
outfile - The output bed file with CDS regions
Note: If you want to include the stop codon in the CDS region add this to the command line:
-s include_stop
Including the stop codon in the CDS region is a convention that both UCSC and NCBI use. However, the default for TAMA is to exclude the stop codon since it does not convert into an amino acid and would thus trip up tools that do codon translation.
-d SJ_dist_threshold
Distance from last splice junction to call NMD (default 50bp).
The output of step 5 is a bed file with CDS regions defined by column 7 and 8. Column 7 is the start of the CDS region and column 8 is the end. The 4th column contains the gene ID and transcript ID along with additional information.
This is the format of the 4th column:
gene_id;trans_id;gene_name;trans_length_flag;blastp_match_flag;nmd_flag;frame
This is an example of a line:
1 219 3261 G1;G1.2;R4GIW7;full_length;50_match;prot_ok;F1 40 + 2154 2662 200,0,255 5 98,93,181,107,714 0,1457,1757,2132,2328
gene_id -> G1
trans_id -> G1.2
gene_name -> R4GIW7
This is the name of the gene in the protein database that had a hit with Blastp.
trans_length_flag -> full_length
Either "full_length" or "5prime_degrade". "5prime_degrade" means that the start codon was not found within the ORF and thus the transcript looks like it is incomplete on the 5' end. "full_length" means that the complete ORF was contained within the transcript.
blastp_match_flag -> 50_match
Possible flags include:
full_match - Complete match of ORF and protein.
90_match - >90% match.
50_match - >50% match.
bad_match - <50% match.
no_hit - No matches.
no_orf - No ORF predicted.
nmd_flag -> prot_ok
This flag is just to identify potential for NMD but does not give information on protein coding potential.
Possible flags include:
prot_ok - Looks like a protein coding transcript, not an NMD.
NMD# - So this could be NMD1, NMD2, NMD3, and so forth. NMD indicates that this is an NMD and the number indicates the number of splice junctions between the stop codon and the last exon.
missing - This indicates that there were missing nucleotides in the genome assembly which means ORF prediction could not be done.
frame -> F1
This indicates the frame of the ORF with respect to the start of the transcript.
If you used an Ensembl CDS peptide sequence database you can make the Ensembl ID's the primary ID's using "tama_format_id_filter.py". Instructions for use can be found here: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting