-
Notifications
You must be signed in to change notification settings - Fork 4
18. Measuring codon usage dissimilarity of BGCs to their background genomes
We provide a simple program, compareBGCtoGenomeCodonUsage.py
which can measure the dissimilarity of a BGC to its background genome as a comparable measure of HGT. Inputs are a BGC Genbank (or multiple Genbanks if the user believes it is split into multiple pieces due to assembly fragmentation) and a full-genome Genbank. Both are produced by antiSMASH:
# Download assembly for Staphylococcus warneri st. LK413 / p3-SID855
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/025/145/625/GCF_025145625.2_ASM2514562v2/GCF_025145625.2_ASM2514562v2_genomic.fna.gz
gunzip GCF_025145625.2_ASM2514562v2_genomic.fna.gz
# Run antiSMASH in environment with antiSMASH available (not part of lsaBGC conda environment!)
antismash --genefinding-tool prodigal --output-dir Sw_LK413 --output-basename LK413 --minimal GCF_025145625.2_ASM2514562v2_genomic.fna
# with lsaBGC conda env activated
# run compareBGCtoGenomeCodonUsage for crt (BGC encoding for staphyloxanthin) found in chromosome:
compareBGCtoGenomeCodonUsage.py -b Sw_LK413/c00001_NZ_JALX...region001.gbk -g Sw_LK413/LK413.gbk -p -o crt_in_chromosome.txt
# run compareBGCtoGenomeCodonUsage for crt (BGC encoding for staphyloxanthin) found in plasmid:
compareBGCtoGenomeCodonUsage.py -b Sw_LK413/c00002_NZ_JALX...region001.gbk -g Sw_LK413/LK413.gbk -p -o crt_in_plasmid.txt
The results of crt_in_chromosome.txt
should be:
Cosine_Distance 0.004000
Empirical_Pvalue 0.230000
Spearman_Rho 0.989000
Spearman_Pvalue 0.000000
GCF_Codons AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, ATA, ATC, ATG, ATT, CAA, CAC, CAG, CAT, CCA, CCC, CCG, CCT, CGA, CGC, CGG, CGT, CTA, CTC, CTG, CTT, GAA, GAC, GAG, GAT, GCA, GCC, GCG, GCT, GGA, GGC, GGG, GGT, GTA, GTC, GTG, GTT, TAA, TAC, TAG, TAT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT
GCF_Codon_Frequencies 768, 178, 158, 507, 381, 42, 124, 200, 146, 55, 11, 195, 236, 238, 354, 654, 522, 59, 56, 229, 250, 24, 40, 148, 56, 33, 3, 133, 124, 42, 32, 140, 629, 171, 107, 581, 420, 81, 133, 298, 192, 131, 64, 503, 323, 135, 150, 331, 34, 108, 9, 402, 271, 29, 31, 159, 3, 13, 120, 74, 692, 192, 136, 405
Background_Codon_Frequencies 43824, 9285, 8580, 31551, 20161, 2457, 5549, 12786, 9313, 3054, 703, 12462, 13197, 11998, 18709, 35717, 28244, 3360, 3013, 13872, 11636, 935, 1883, 8646, 3218, 1693, 248, 9275, 6537, 1976, 1215, 7724, 40947, 9119, 6862, 33984, 20076, 3810, 5235, 15247, 10203, 6350, 2843, 24522, 17227, 5605, 6841, 17741, 1822, 5458, 343, 21929, 14672, 1550, 1929, 10604, 228, 808, 5314, 3430, 40801, 9128, 6797, 22231
And the results of crt_in_plasmid.txt
should be:
Cosine_Distance 0.021000
Empirical_Pvalue 0.042000
Spearman_Rho 0.967000
Spearman_Pvalue 0.000000
GCF_Codons AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, ATA, ATC, ATG, ATT, CAA, CAC, CAG, CAT, CCA, CCC, CCG, CCT, CGA, CGC, CGG, CGT, CTA, CTC, CTG, CTT, GAA, GAC, GAG, GAT, GCA, GCC, GCG, GCT, GGA, GGC, GGG, GGT, GTA, GTC, GTG, GTT, TAA, TAC, TAG, TAT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT
GCF_Codon_Frequencies 425, 74, 102, 337, 138, 16, 47, 93, 74, 32, 17, 125, 172, 60, 167, 240, 213, 32, 34, 133, 80, 14, 17, 80, 25, 12, 1, 53, 48, 16, 18, 56, 240, 67, 75, 279, 130, 23, 51, 122, 94, 66, 14, 172, 126, 57, 68, 144, 11, 58, 5, 223, 103, 17, 14, 102, 6, 8, 60, 27, 274, 56, 75, 197
Background_Codon_Frequencies 44167, 9389, 8636, 31721, 20404, 2483, 5626, 12893, 9385, 3077, 697, 12532, 13261, 12176, 18896, 36131, 28553, 3387, 3035, 13968, 11806, 945, 1906, 8714, 3249, 1714, 250, 9355, 6613, 2002, 1229, 7808, 41336, 9223, 6894, 34286, 20366, 3868, 5317, 15423, 10301, 6415, 2893, 24853, 17424, 5683, 6923, 17928, 1845, 5508, 347, 22108, 14840, 1562, 1946, 10661, 225, 813, 5374, 3477, 41219, 9264, 6858, 22439
So we see the cosine distance between the BGC codon usage profile and the background genome is higher in the plasmid instance of crt (which we describe in our manuscript on lsaBGC to correspond to a more recently acquired version of GCF_6) than for the chromosome instance of crt. Further, through calculating an empirical P-value with the -p
argument, we see the chances of observing as large a disagreement in the codon distribution of the plasmid BGC to the expected genome-wide distribution is <0.05 which is not the case for the chromosomal BGC.
If of interest or if you have ideas for improvement, please let me know your thoughts in the Issues or feel free to email me or reach me on Twitter/Mastodon!
To calculate an empirical P-value, we gather codons for all codons for each gene across the genome. First the codon frequency distribution of the BGC genes is compared to the codon frequency distribution of the background genome (all other genes). After, we perform 10,000 simulations where in each simulation we shuffle the full genome-wide set of genes and go through the first N genes until the same number of codons as are present in the focal BGC/GCF are observed. The cosine distance between the observed codon frequency is compared to the remainder of the genome-wide codon distribution and checked for whether it is higher than what was actually observed for the BGC; if so, then an empirical P-value counter is appended a count of 1. The final empirical P-value produced is simply this count plus a pseudocount of 1 over 10,001.
usage: compareBGCtoGenomeCodonUsage.py [-h] -g FULL_GENOME_GENBANK -b BGC_GENBANKS [BGC_GENBANKS ...] -o OUTPUT [-p]
Program: compareBGCtoGenomeCodonUsage.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
This program compares the codon-usage distribution of a BGC (provided as a Genbank) to the codon usage of the
background genome (gathered from a full-genome Genbank file - also an output of antiSMASH). It will report
the cosine distance and Spearman correlation between the two profiles. Only ORFs which are of length divisible
by 3 will be considered. It can accept multiple BGC Genbanks in the case that the BGC is fragmented across
multiple scaffolds.
Only works for bacterial genomes currently.
optional arguments:
-h, --help show this help message and exit
-g FULL_GENOME_GENBANK, --full_genome_genbank FULL_GENOME_GENBANK
Path to annotated full-genome Genbank for isolate's genome.
-b BGC_GENBANKS [BGC_GENBANKS ...], --bgc_genbanks BGC_GENBANKS [BGC_GENBANKS ...]
Path to BGC Genbank(s) for isolate. Locus tags must match with tags in full-genome Genbank.
-o OUTPUT, --output OUTPUT
Path to output file.
-p, --compute_empirical_pval
Compute empirical P-value for observed cosine distance being as different than expected to the background codon usage distribution.