The script should work in all Python 3 environents with Pandas library installed. Below I provide versions for which I tested the script:
- Python (3.11)
- Pandas (1.5.3)
The script allows for downloading genomes from NCBI GenBank FTP server, based on the content of assembly_summary_genbank.txt
or any TSV file that provides desirable data on genome assemblies in the following colums: assembly_accession
, taxid
, assembly_level
, asm_name and ftp_path
. For more information see README.txt
at the NCBI GenBank FTP site.
You can run it without any command line options or provide one or more taxid
values to narrow down the number of genomes you want to retrieve. Taxa IDs of your interest you may find in NCBI Taxonomy database. Genomes of requested taxa and all subordinate subtaxa will be retrieved.
- Download genomes for taxid 1279 (genus Staphylococcus) and 1350 (genus Enterococcus) and all subtaxa, i.e. all genomes assigned to the genera as well as all subordinate species, subspecies etc. Save the genomes to a default directory (
genomes
) in the current location:
./fetch_genomes.py -t 1279 1350
- If you have problems with network connection, you may rerun the script until all genomes are successfully retrived, i.e. when you see in the end a message saying:
[INFO] All files have been successfully fetched
. Simply resume previous downolading or retry to download skipped genomes based on saved filtered assembly summary from a previous search (existing files will not be redownloaded):
./fetch_genomes.py -a assembly_summary_copy.tsv
- Retrive filtered assembly summary only, i.e. assembly summary on genomes belonging to requested taxa, without downloading anything else in order to examine it or modify before use:
./fetch_genomes.py -t 1279 1350 -s
Option | Use |
---|---|
‑a , ‑‑assembly‑summary |
A path to a custom local file in TSV format that contains information on assemblies that are to be downloaded, default: assembly summary will be fetched from NCBI GenBank FTP site |
‑c , ‑‑summary-copy |
A path to a TSV file where to save the filtered assembly summary for chosen taxids in TSV format, default: assembly_summary_copy.tsv |
‑t , ‑‑taxids |
Space-separated IDs of taxa to retrive genomic sequences for, default: all existing(!) |
‑l , ‑‑assembly-levels |
Space-separated assembly levels that will be taken into consideration: chromosome (chr ), scaffold (scff ), complete (cmpl ), contig (ctg ), default: all levels |
‑o , ‑‑output-dir |
A path to the directory for downloaded genomes, dafault: genomes |
‑f , ‑‑formats |
Formats of data to be downloaded: genomic sequences in nucleotide fasta format (fna ), genomic sequences in GenBank format (gbff ), annotation table (gff ), RNA sequences in nucleotide fasta format (rna ), coding sequences (CDS ) in nucleotide fasta format (cds ), translations of CDS in protein fasta format (prot ), default: fna |
‑n , ‑‑non-interactive |
Do not ask questions and overwrite existing data (be absolutely sure what you do) |
‑s , ‑‑summary-only |
For given taxids or all, only download assembly summary |