Plastid Genome Annotator
Copyright (C) 2019 Xiao-Jian Qu
Contact
quxiaojian@sdnu.edu.cn
Citation
If you use PGA in your scientific research, please cite:
Qu X-J, Moore MJ, Li D-Z, Yi T-S. 2019. PGA: a software package for rapid, accurate, and flexible batch annotation of plastomes. Plant Methods 15:50.
ResearchGate
Plant Methods
Notes in Chinese
https://www.jianshu.com/p/6ac8a9fad9c9
Prerequisites
BLAST 2.8.1 or higher (latest)
Perl 5
Windows, Linux or Mac
General Introduction to PGA
PGA (Plastid Genome Annotator), a standalone command line tool, can perform rapid, accurate, and flexible batch annotation of newly generated target plastomes based on well-annotated reference plastomes. In contrast to current existing tools, PGA uses reference plastomes as the query and unannotated target plastomes as the subject to locate genes, which we refer to as the reverse query-subject BLAST search approach. PGA accurately identifies gene and intron boundaries as well as intron loss. The program outputs GenBank-formatted files as well as a log file to assist users in verifying annotations.
We thank Rong Zhang, Ying-Ying Yang and Jian-Jun Jin from Kunming Institute of Botany Chinese Academy of Sciences, and Pin Gong from Institute of Botany Chinese Academy of Sciences for improving this tool.
Following six steps will be conducted to annotate plastomes: (1) Preparation of GenBank-formatted reference plastomes; (2) Preparation of FASTA-formatted target plastomes; (3) Reference database generation; (4) BLAST search; (5) Determining feature boundaries; (6) Generating GenBank and log files.
Preparations
(1) download the latest BLAST+ software BLAST+.
The latest version (2018/11/27) is ncbi-blast-2.8.1+-win64.exe for Windows, ncbi-blast-2.8.1+-x64-linux.tar.gz for Linux, ncbi-blast-2.8.1+-x64-macosx.tar.gz for Mac.
For Windows, just install it following instructions. It will be in PATH automatically.
For Linux or Mac, we suggest to put it in PATH following below steps.
vim ~/.bashrc
export PATH=/home/xxx/blast-2.8.1+/bin:$PATH
source ~/.bashrc
Please check if the latest blast version is successfully installed by inputing "blastn -version" in cmd.
blastn -version
(2) download this repository to your local computer.
For Windows, just download, unzip and use it.
For Linux or Mac, we suggest to put it in PATH and make it read, write and executable folowing below steps.
git clone https://github.com/quxiaojian/PGA.git
vim ~/.bashrc
export PATH=/home/xxx/PGA:$PATH
source ~/.bashrc
chmod a+rwx PGA.pl
Then, you can test PGA.pl by type "perl PGA.pl", which will show the usage information.
Usage:
PGA.pl -r -t [-i -p -q -o -f -l]
Copyright (C) 2019 Xiao-Jian Qu
Please contact <quxiaojian@sdnu.edu.cn>, if you have any bugs or questions.
[-h -help] help information.
[-r -reference] required: (default: reference) input directory name containing GenBank-formatted file(s) that from the same or close families.
[-t -target] required: (default: target) input directory name containing FASTA-formatted file(s) that will be annotated.
[-i -ir] optional: (default: 1000) minimum allowed inverted-repeat (IR) length.
[-p -pidentity] optional: (default: 40) any PCGs with a TBLASTN percent identity less than this value will be listed in the log file and
will not be annotated.
[-q -qcoverage] optional: (default: 0.5,2) any PCGs with a query coverage per annotated PCG less or greater than each of these two values (<1,>1)
will be listed in the log file.
[-o -out] optional: (default: gb) output directory name.
[-f -form] optional: (default: circular) circular or linear form for FASTA-formatted file.
[-l -log] optional: (default: warning) log file name containing warning information for annotated GenBank-formatted file(s).
Run Test
perl PGA.pl -r test/angiosperms/reference -t test/angiosperms/target
equal to
perl PGA.pl -r test/angiosperms/reference -t test/angiosperms/target -i 1000 -p 40 -q 0.5,2 -o gb -f circular -l warning
Input and Output
Annotation of the plastome of Rosa roxburghii with the plastome of Amborella trichopoda as reference. (a) "Amborella_trichopoda.gb" shows the partial GenBank-formatted reference plastome of Amborella trichopoda, as revised from AJ506156. (b) "Rosa_roxburghii.fasta" shows the partial FASTA-formatted target plastome of Rosa roxburghii, revised from NC_032038. (c) "Rosa_roxburghii.gb" shows the output GenBank-formatted file containing partial annotation information for the target plastome of Rosa roxburghii. (d) "warning.log" shows warning and statistical items during the annotation of the target plastome of Rosa roxburghii. The log file indicates the loss of the atpF intron in Rosa roxburghii. There are 113 total genes in the reference and target plastomes.
(a) Amborella_trichopoda.gb
LOCUS Amborella_trichopoda 162686 bp DNA circular UNA 08-JUN-2015
DEFINITION Amborella trichopoda chloroplast genomic DNA, complete sequence.
ACCESSION AJ506156
VERSION AJ506156.2 GI:34481608
KEYWORDS complete genome.
SOURCE chloroplast Amborella trichopoda
ORGANISM Amborella trichopoda
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; basal Magnoliophyta; Amborellales;
Amborellaceae; Amborella.
FEATURES Location/Qualifiers
source 1..162686
/organism="Amborella trichopoda"
/mol_type="genomic DNA"
repeat_region 90951..117611
/note="inverted repeat region B; IRB repeat region"
/rpt_type="inverted"
rRNA complement(139284..142097)
/gene="rrn23"
/product="23S ribosomal RNA"
gene complement(139284..142097)
/gene="rrn23"
tRNA join(complement(4472..4508), complement(1840..1874))
/gene="trnK-UUU"
/product="tRNA-Lys"
gene complement(1840..4508)
/gene="trnK-UUU"
CDS join(complement(16186..16330), complement(14506..14915))
/gene="atpF"
/codon_start=1
/transl_table=11
/product="ATPase I subunit"
/translation="MKNVTDSFVSLGHWPSAGSFGFNTDIFATNPINLSVVLGVLIFF
GKGVLSDLLDNRKQRILSTIRNSEELRGGAIEQLEKARARLRKVEIEADEFRVNGYSE
IEREKSNLINAAYENLERLENYKNESIHFEQQRAMNQVRQRVFQQALQGALETLNSYL
NSELHLRTISANIGMLGTMKNITD"
gene complement(14506..16330)
/gene="atpF"
(b) Rosa_roxburghii.fasta
>Rosa_roxburghii
ATGGGCGAACGACGGGAATTGAACCCGCGCGTGGTGGATTCACAATCCACTGCCTTGATC
(c) Rosa_roxburghii.gb
LOCUS Rosa_roxburghii 156749 bp DNA circular PLN 25-DEC-2017
FEATURES Location/Qualifiers
source 1..156749
/organism="Rosa_roxburghii"
/mol_type="genomic DNA"
gene 106222..109027
/gene="rrn23"
rRNA 106222..109027
/gene="rrn23"
/product="23S ribosomal RNA"
gene complement(1704..4278)
/gene="trnK-UUU"
tRNA join(complement(4242..4278), complement(1704..1738))
/gene="trnK-UUU"
/product="tRNA-Lys"
gene complement(12213..12767)
/gene="atpF"
CDS complement(12213..12767)
/gene="atpF"
/codon_start=1
/transl_table=11
/product="ATP synthase CF0 subunit I"
(d) warning.log
Rosa_roxburghii
Warning: atpF (negative one-intron PCG) lost intron!
Total number of genes in the reference plastome(s): 113.
Total number of genes annotated in the target plastome: 113.
All gene names from the reference plastome(s) that were not annotated in the target plastome:
Recommendations for using PGA
(1) Users should carefully check the GenBank-formatted reference plastome. PGA is packaged with several properly annotated plastomes, and it is thus possible for users to use PGA to re-annotate a plastome that is intended to be used as a reference, in order to correct possible inaccuracies.
(2) It is important that users select a reference plastome that contains sufficient numbers of annotated genes for the target taxa. The number of genes in the reference plastome(s) should equal or exceed the number in the target plastome(s). If the number of genes in the target is uncertain, it may be best to use multiple reference plastomes. The Amborella trichopoda (AJ506156) and Zamia furfuracea (JX416857) plastomes included within PGA are examples of plastomes that contain the highest gene numbers among known angiosperms and gymnosperms, and as such it is recommended that they be included as references during PGA runs.
(3) We do not recommend annotating highly incomplete plastomes using a complete reference plastome, because BLAST may annotate some genes redundantly (i.e., BLAST may return hits for genes that were not sequenced or are otherwise absent in the incomplete plastome, resulting in spurious annotations). To annotate highly incomplete plastomes or plastome segments, we recommend using progressiveMauve (as implemented in Mauve 2.4.0; Darling et al., 2010) to align the incomplete plastome to the reference plastome, followed by the use of the corresponding homologous block of the reference plastome as the reference for annotation in PGA.
(4) We suggest that users carefully check highly divergent or otherwise unusual target plastomes for incorrect annotations. This is particularly important for plastomes with a high degree of gene loss, pseudogenization or sequence divergence.