Skip to content

quxiaojian/PGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Plastid Genome Annotator
Copyright (C) 2019 Xiao-Jian Qu

Contact
quxiaojian@sdnu.edu.cn

Citation
If you use PGA in your scientific research, please cite:
Qu X-J, Moore MJ, Li D-Z, Yi T-S. 2019. PGA: a software package for rapid, accurate, and flexible batch annotation of plastomes. Plant Methods 15:50.
ResearchGate
Plant Methods

Notes in Chinese
https://www.jianshu.com/p/6ac8a9fad9c9

Prerequisites
BLAST 2.8.1 or higher (latest)
Perl 5
Windows, Linux or Mac

General Introduction to PGA
PGA (Plastid Genome Annotator), a standalone command line tool, can perform rapid, accurate, and flexible batch annotation of newly generated target plastomes based on well-annotated reference plastomes. In contrast to current existing tools, PGA uses reference plastomes as the query and unannotated target plastomes as the subject to locate genes, which we refer to as the reverse query-subject BLAST search approach. PGA accurately identifies gene and intron boundaries as well as intron loss. The program outputs GenBank-formatted files as well as a log file to assist users in verifying annotations.
We thank Rong Zhang, Ying-Ying Yang and Jian-Jun Jin from Kunming Institute of Botany Chinese Academy of Sciences, and Pin Gong from Institute of Botany Chinese Academy of Sciences for improving this tool.

Following six steps will be conducted to annotate plastomes: (1) Preparation of GenBank-formatted reference plastomes; (2) Preparation of FASTA-formatted target plastomes; (3) Reference database generation; (4) BLAST search; (5) Determining feature boundaries; (6) Generating GenBank and log files.

flowchart

Preparations

(1) download the latest BLAST+ software BLAST+.
The latest version (2018/11/27) is ncbi-blast-2.8.1+-win64.exe for Windows, ncbi-blast-2.8.1+-x64-linux.tar.gz for Linux, ncbi-blast-2.8.1+-x64-macosx.tar.gz for Mac.
For Windows, just install it following instructions. It will be in PATH automatically.
For Linux or Mac, we suggest to put it in PATH following below steps.

vim ~/.bashrc
export PATH=/home/xxx/blast-2.8.1+/bin:$PATH
source ~/.bashrc

Please check if the latest blast version is successfully installed by inputing "blastn -version" in cmd.

blastn -version

(2) download this repository to your local computer.
For Windows, just download, unzip and use it.
For Linux or Mac, we suggest to put it in PATH and make it read, write and executable folowing below steps.

git clone https://github.com/quxiaojian/PGA.git
vim ~/.bashrc
export PATH=/home/xxx/PGA:$PATH
source ~/.bashrc
chmod a+rwx PGA.pl

Then, you can test PGA.pl by type "perl PGA.pl", which will show the usage information.

Usage:
    PGA.pl -r -t [-i -p -q -o -f -l]
    Copyright (C) 2019 Xiao-Jian Qu
    Please contact <quxiaojian@sdnu.edu.cn>, if you have any bugs or questions.

    [-h -help]         help information.
    [-r -reference]    required: (default: reference) input directory name containing GenBank-formatted file(s) that from the same or close families.
    [-t -target]       required: (default: target) input directory name containing FASTA-formatted file(s) that will be annotated.
    [-i -ir]           optional: (default: 1000) minimum allowed inverted-repeat (IR) length.
    [-p -pidentity]    optional: (default: 40) any PCGs with a TBLASTN percent identity less than this value will be listed in the log file and
                       will not be annotated.
    [-q -qcoverage]    optional: (default: 0.5,2) any PCGs with a query coverage per annotated PCG less or greater than each of these two values (<1,>1)
                       will be listed in the log file.
    [-o -out]          optional: (default: gb) output directory name.
    [-f -form]         optional: (default: circular) circular or linear form for FASTA-formatted file.
    [-l -log]          optional: (default: warning) log file name containing warning information for annotated GenBank-formatted file(s).

Run Test

perl PGA.pl -r test/angiosperms/reference -t test/angiosperms/target

equal to

perl PGA.pl -r test/angiosperms/reference -t test/angiosperms/target -i 1000 -p 40 -q 0.5,2 -o gb -f circular -l warning

gif

Input and Output
Annotation of the plastome of Rosa roxburghii with the plastome of Amborella trichopoda as reference. (a) "Amborella_trichopoda.gb" shows the partial GenBank-formatted reference plastome of Amborella trichopoda, as revised from AJ506156. (b) "Rosa_roxburghii.fasta" shows the partial FASTA-formatted target plastome of Rosa roxburghii, revised from NC_032038. (c) "Rosa_roxburghii.gb" shows the output GenBank-formatted file containing partial annotation information for the target plastome of Rosa roxburghii. (d) "warning.log" shows warning and statistical items during the annotation of the target plastome of Rosa roxburghii. The log file indicates the loss of the atpF intron in Rosa roxburghii. There are 113 total genes in the reference and target plastomes.

(a) Amborella_trichopoda.gb
LOCUS       Amborella_trichopoda      162686 bp    DNA     circular UNA 08-JUN-2015
DEFINITION  Amborella trichopoda chloroplast genomic DNA, complete sequence.
ACCESSION   AJ506156
VERSION     AJ506156.2  GI:34481608
KEYWORDS    complete genome.
SOURCE      chloroplast Amborella trichopoda
  ORGANISM  Amborella trichopoda
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; basal Magnoliophyta; Amborellales;
            Amborellaceae; Amborella.
FEATURES          Location/Qualifiers
     source          1..162686
                    /organism="Amborella trichopoda"
                    /mol_type="genomic DNA"
     repeat_region    90951..117611
                    /note="inverted repeat region B; IRB repeat region"
                    /rpt_type="inverted"
     rRNA          complement(139284..142097)
                    /gene="rrn23"
                    /product="23S ribosomal RNA"
     gene           complement(139284..142097)
                    /gene="rrn23"
     tRNA          join(complement(4472..4508), complement(1840..1874))
                    /gene="trnK-UUU"
                    /product="tRNA-Lys"
     gene           complement(1840..4508)
                    /gene="trnK-UUU"
     CDS           join(complement(16186..16330), complement(14506..14915))
                    /gene="atpF"
                    /codon_start=1
                    /transl_table=11
                    /product="ATPase I subunit"
/translation="MKNVTDSFVSLGHWPSAGSFGFNTDIFATNPINLSVVLGVLIFF
                    GKGVLSDLLDNRKQRILSTIRNSEELRGGAIEQLEKARARLRKVEIEADEFRVNGYSE
                    IEREKSNLINAAYENLERLENYKNESIHFEQQRAMNQVRQRVFQQALQGALETLNSYL
                         NSELHLRTISANIGMLGTMKNITD"
     gene           complement(14506..16330)
                    /gene="atpF"

(b) Rosa_roxburghii.fasta
>Rosa_roxburghii
ATGGGCGAACGACGGGAATTGAACCCGCGCGTGGTGGATTCACAATCCACTGCCTTGATC

(c) Rosa_roxburghii.gb
LOCUS       Rosa_roxburghii  156749 bp    DNA     circular PLN 25-DEC-2017
FEATURES             Location/Qualifiers
     source          1..156749
                     /organism="Rosa_roxburghii"
                     /mol_type="genomic DNA"
     gene            106222..109027
                     /gene="rrn23"
     rRNA            106222..109027
                     /gene="rrn23"
                     /product="23S ribosomal RNA"
     gene            complement(1704..4278)
                     /gene="trnK-UUU"
     tRNA            join(complement(4242..4278), complement(1704..1738))
                     /gene="trnK-UUU"
                     /product="tRNA-Lys"
     gene            complement(12213..12767)
                     /gene="atpF"
     CDS             complement(12213..12767)
                     /gene="atpF"
                     /codon_start=1
                     /transl_table=11
                     /product="ATP synthase CF0 subunit I"

(d) warning.log
Rosa_roxburghii
Warning: atpF (negative one-intron PCG) lost intron!
Total number of genes in the reference plastome(s): 113.
Total number of genes annotated in the target plastome: 113.
All gene names from the reference plastome(s) that were not annotated in the target plastome:

Recommendations for using PGA
(1) Users should carefully check the GenBank-formatted reference plastome. PGA is packaged with several properly annotated plastomes, and it is thus possible for users to use PGA to re-annotate a plastome that is intended to be used as a reference, in order to correct possible inaccuracies.
(2) It is important that users select a reference plastome that contains sufficient numbers of annotated genes for the target taxa. The number of genes in the reference plastome(s) should equal or exceed the number in the target plastome(s). If the number of genes in the target is uncertain, it may be best to use multiple reference plastomes. The Amborella trichopoda (AJ506156) and Zamia furfuracea (JX416857) plastomes included within PGA are examples of plastomes that contain the highest gene numbers among known angiosperms and gymnosperms, and as such it is recommended that they be included as references during PGA runs.
(3) We do not recommend annotating highly incomplete plastomes using a complete reference plastome, because BLAST may annotate some genes redundantly (i.e., BLAST may return hits for genes that were not sequenced or are otherwise absent in the incomplete plastome, resulting in spurious annotations). To annotate highly incomplete plastomes or plastome segments, we recommend using progressiveMauve (as implemented in Mauve 2.4.0; Darling et al., 2010) to align the incomplete plastome to the reference plastome, followed by the use of the corresponding homologous block of the reference plastome as the reference for annotation in PGA.
(4) We suggest that users carefully check highly divergent or otherwise unusual target plastomes for incorrect annotations. This is particularly important for plastomes with a high degree of gene loss, pseudogenization or sequence divergence.

Releases

No releases published

Packages

No packages published

Languages