Skip to content

TAMA GO: ORF and NMD predictions

GenomeRIK edited this page Nov 3, 2021 · 29 revisions

This is the manual for running the ORF and NMD prediction pipelines.

This manual assumes that you have run TAMA Collapse or TAMA Merge and now have a bed file in their respective formats.

NOTE: If you are starting with a GTF/GFF file or a BED12 file that is formatted differently from the TAMA Collapse output, then please see the format convertor tools in TAMA-GO to convert into TAMA BED12 format before running this pipeline.

Table of Contents

Other software and data required for this pipeline:

1. Bedtools

2. Blastp

3. Uniref database files

Pipeline:

1. Converting the bed files into fasta files

Use Bedtools to convert your bed file into a fasta file. (Bedtools 2.26.0 or newer)

Example command:

bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}

fasta - this is the fasta file for the genome assembly

bed - this is the bed12 file for the annotation

outfile - this is the name of the output fasta file

The output fasta headers should look like this:

  >G1;G1.1::1:29-2665(+)
  >G1;G1.2::1:219-3261(+)
  >G1;G1.3::1:1713-3246(+)

If you are using a newer version of bedtools it may look like this:

  >G1;G1.1(+)
  >G1;G1.2(+)
  >G1;G1.3(+))

2. Getting open read frames (ORF) from the transcript sequences

Use tama_orf_seeker.py to get ORF amino acid sequences.

Example command:

python tama_orf_seeker.py -f ${fasta} -o ${outfile}

fasta - this is the fasta file for your transcript sequences that you produced in the previous step

outfile - the name of the output fasta file which contains the ORF amino acid sequences

The output fasta file should have headers like this:

  >G1;G1.1::1:29-2665(+):F1:0:719:1:239:239:I
  >G1;G1.1::1:29-2665(+):F1:132:719:45:239:195:M
  >G1;G1.1::1:29-2665(+):F3:2:64:1:20:20:F

If you used a newer version of bedtools in the first step it may look like this:

  >G1;G1.1(+):F1:0:719:1:239:239:I
  >G1;G1.1(+):F1:132:719:45:239:195:M
  >G1;G1.1(+):F3:2:64:1:20:20:F

Note that the next step of Blastp can take a very long time to run. I recommend splitting the AA fasta file into smaller files to run in parallel. You can split using the tama_fasta_splitter.py tool. For more information: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Split-Files

After running Blastp on the individual split fasta files you can then just concatenate all the Blastp result files before running the next step.

3. Blasting the amino acid sequences against the Uniprot/Uniref protein database

Use blastp and a local Uniref database to find protein hits. (BLAST 2.2.31+ or newer)

Example command:

blastp -evalue 1e-10 -num_threads 16 -db ${uniref} -query ${orf}

uniref - Uniref/Uniprot fasta file

orf - ORF fasta file from previous step

Note: This parameter "-num_threads 16" specifies using 16 threads. Please adjust for your system. Also make sure to use the default output.

The individual hit entries in the output file should look like this:

  >E1C721 Uncharacterized protein OS=Gallus gallus GN=LOC425783 PE=4 SV=2
  Length=265
  
   Score = 382 bits (982),  Expect = 4e-136, Method: Compositional matrix adjust.
   Identities = 203/235 (86%), Positives = 209/235 (89%), Gaps = 12/235 (5%)
    
  Query  17   KSVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE  76
              + VLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE
  Sbjct  31   REVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE  90
  
  Query  77   ERQSQRGSGCSELRQNRHHVLCVA---VLCILVLALVAVIVLQRLSCLPPPPFSHVW---  130
              ERQSQRGSGCS+LRQNR  VLCVA   VLCILV  +VAVIVLQR SC P PPFSHV+
  Sbjct  91   ERQSQRGSGCSKLRQNRRCVLCVALCAVLCILVSVVVAVIVLQRPSCPPRPPFSHVYPNT  150
  
  Query  131  ------KCYYFSYTKSDWNSSREHCHRLGASLATVDTEEEMEFMRGYRRPADRWIGLHRA  184
                    KCYYFS TKSDW SSRE CHRLGASLATVD EEEMEFMRGYRRPADRWIGLHRA
  Sbjct  151  WVGFHGKCYYFSDTKSDWKSSREQCHRLGASLATVDREEEMEFMRGYRRPADRWIGLHRA  210
  
  Query  185  EGDEHWTWADGSAFSNWFELRGGGQCAYLHGDWINSTLCHSEKFWVCSRADSYVL  239
              EGDEHWTWADGSAFSNWF+ +GGGQC YLHGDWINSTLCHSEKFWVCSRADSYVL
  Sbjct  211  EGDEHWTWADGSAFSNWFKPQGGGQCVYLHGDWINSTLCHSEKFWVCSRADSYVL  265

4. Parsing the Blastp output file for top hits

Use tama_orf_blastp_parser.py to parse the results from Blastp.

Example command:

python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile}

blastp - output from Blastp in previous step

outfile - name of output file from parsing

If you did a BlastP on an Ensembl CDS peptide sequence database, you can add "-f ensembl" to the command line to allow for parsing of the Ensembl ID's.

Example:

python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile} -f ensembl

The output should look like this:

  G1;G1.1 F1      0       719     1       239     R4GL70  90_match        99      32
  G1;G1.2 F1      369     596     124     198     R4GIW7  50_match        77      66
  G1;G1.3 F1      0       386     1       128     H9KZN5  90_match        91      65

5. Create new bed file with CDS regions

Use tama_cds_regions_bed_add.py to create the annotation bed file with CDS regions added.

Example command:

python tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}

orf - The output file from the previous step

bed - The bed annotation file that you started with in this pipeline

fasta - The transcript input fasta file from the first step (the input fasta)

outfile - The output bed file with CDS regions

Note: If you want to include the stop codon in the CDS region add this to the command line:

 -s include_stop

Including the stop codon in the CDS region is a convention that both UCSC and NCBI use. However, the default for TAMA is to exclude the stop codon since it does not convert into an amino acid and would thus trip up tools that do codon translation.

 -d SJ_dist_threshold

Distance from last splice junction to call NMD (default 50bp).

Explanation of Output:

The output of step 5 is a bed file with CDS regions defined by column 7 and 8. Column 7 is the start of the CDS region and column 8 is the end. The 4th column contains the gene ID and transcript ID along with additional information.

This is the format of the 4th column:

gene_id;trans_id;gene_name;trans_length_flag;blastp_match_flag;nmd_flag;frame

This is an example of a line:

  1       219     3261    G1;G1.2;R4GIW7;full_length;50_match;prot_ok;F1  40      +       2154    2662    200,0,255       5       98,93,181,107,714       0,1457,1757,2132,2328

gene_id -> G1

trans_id -> G1.2

gene_name -> R4GIW7

This is the name of the gene in the protein database that had a hit with Blastp.

trans_length_flag -> full_length

Either "full_length" or "5prime_degrade". "5prime_degrade" means that the start codon was not found within the ORF and thus the transcript looks like it is incomplete on the 5' end. "full_length" means that the complete ORF was contained within the transcript.

blastp_match_flag -> 50_match

Possible flags include:

full_match - Complete match of ORF and protein.

90_match - >90% match.

50_match - >50% match.

bad_match - <50% match.

no_hit - No matches.

no_orf - No ORF predicted.

nmd_flag -> prot_ok

This flag is just to identify potential for NMD but does not give information on protein coding potential.

Possible flags include:

prot_ok - Looks like a protein coding transcript, not an NMD.

NMD# - So this could be NMD1, NMD2, NMD3, and so forth. NMD indicates that this is an NMD and the number indicates the number of splice junctions between the stop codon and the last exon.

missing - This indicates that there were missing nucleotides in the genome assembly which means ORF prediction could not be done.

frame -> F1

This indicates the frame of the ORF with respect to the start of the transcript.

If you used an Ensembl CDS peptide sequence database you can make the Ensembl ID's the primary ID's using "tama_format_id_filter.py". Instructions for use can be found here: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting