The script
can be used to:
- Download complete genomes in genbank format from NCBI given a list of NCBI taxonomy IDs.
- Build a blast database for use with Prokka.
./ --help
usage: [-h] [-t TAXID] [-n NAME] [-o OUTDIR]
[-s {refseq,genbank}]
[-l {all,complete,chromosome,scaffold,contig}]
[-g GROUP] [-p N] [-e EXT] [-b]
Download and Make a database for use with Prokka
optional arguments:
-h, --help show this help message and exit
-t TAXID, --taxid TAXID
Only download sequences of the provided NCBI taxonomy
ID. A comma-separated list of taxids is also possible.
For example: "9606,9685". (default: 93071)
-n NAME, --name NAME A name for the database (default: new_prokka_database)
-o OUTDIR, --outdir OUTDIR
A directory for storing intermediate outputs (default:
-s {refseq,genbank}, --section {refseq,genbank}
NCBI section to download (default: genbank)
-l {all,complete,chromosome,scaffold,contig}, --assembly-level {all,complete,chromosome,scaffold,contig}
Assembly level of genomes to download (default:
-g GROUP, --group GROUP
Taxonomic group, i.e bacteria, viral, etc (default:
-p N, --parallel N Run N downloads and converting gbk to faa in parallel
(default: 1)
-e EXT, --ext EXT File extension for scanning with sequence folder
(default:gz) (default: gz)
-b, --build Build database given from a folder of complete genbank
files? (default: False)
The script requires python 3.7
Before running the script, please make sure you have the following dependencies: cd-hit, blast, ncbi-genome-download, biopython
External dependencies can be installed from bioconda
conda install -c conda-forge -c bioconda cd-hit blast
Python dependecies can be installed via pip
pip install ncbi-genome-download==0.2.8 biopython
Taxonomy ID: 90371
./ -t 90371 -n salmonella_90371 -p 4
Results if run sucessfully.
Start downloading 90371
Location /home/ubuntu/mydata/salmonella/prokka_db_maker/ncbi
Start building DB for: 90371
Database salmonella_90371 for use with prokka has been saved to /home/ubuntu/mydata/salmonella/prokka_db_maker/salmonella_90371
Output files:
├── salmonella_90371.faa
├── salmonella_90371.phr
└── salmonella_90371.psq
Download complete sequences for Salmonella enterica subsp. enterica serovar Typhi and Salmonella enterica subsp. enterica serovar Typhimurium
Taxonomy ID: 590
./ -t 90370,90371 -n salmonella -p 4
./ -o ncbi -n salmonella -p 4 -b