Skip to content

18. Measuring codon usage dissimilarity of BGCs to their background genomes

Rauf Salamzade edited this page Dec 1, 2022 · 8 revisions

We provide a simple program, compareBGCtoGenomeCodonUsage.py which can measure the dissimilarity of a BGC to its background genome as a comparable measure of HGT. Inputs are a BGC Genbank (or multiple Genbanks if the user believes it is split into multiple pieces due to assembly fragmentation) and a full-genome Genbank. Both are produced by antiSMASH:

Example

# Download assembly for Staphylococcus warneri st. LK413 / p3-SID855
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/025/145/625/GCF_025145625.2_ASM2514562v2/GCF_025145625.2_ASM2514562v2_genomic.fna.gz
gunzip GCF_025145625.2_ASM2514562v2_genomic.fna.gz

# Run antiSMASH in environment with antiSMASH available (not part of lsaBGC conda environment!)
antismash --genefinding-tool prodigal --output-dir Sw_LK413 --output-basename LK413 --minimal GCF_025145625.2_ASM2514562v2_genomic.fna

# with lsaBGC conda env activated
# run compareBGCtoGenomeCodonUsage for crt (BGC encoding for staphyloxanthin) found in chromosome:
compareBGCtoGenomeCodonUsage.py -b Sw_LK413/c00001_NZ_JALX...region001.gbk -g Sw_LK413/LK413.gbk -p -o crt_in_chromosome.txt

# run compareBGCtoGenomeCodonUsage for crt (BGC encoding for staphyloxanthin) found in plasmid:
compareBGCtoGenomeCodonUsage.py -b Sw_LK413/c00002_NZ_JALX...region001.gbk -g Sw_LK413/LK413.gbk -p -o crt_in_plasmid.txt

The results of crt_in_chromosome.txt should be:

Cosine_Distance	0.004000
Empirical_Pvalue	0.230000
Spearman_Rho	0.989000
Spearman_Pvalue	0.000000
GCF_Codons	AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, ATA, ATC, ATG, ATT, CAA, CAC, CAG, CAT, CCA, CCC, CCG, CCT, CGA, CGC, CGG, CGT, CTA, CTC, CTG, CTT, GAA, GAC, GAG, GAT, GCA, GCC, GCG, GCT, GGA, GGC, GGG, GGT, GTA, GTC, GTG, GTT, TAA, TAC, TAG, TAT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT
GCF_Codon_Frequencies	768, 178, 158, 507, 381, 42, 124, 200, 146, 55, 11, 195, 236, 238, 354, 654, 522, 59, 56, 229, 250, 24, 40, 148, 56, 33, 3, 133, 124, 42, 32, 140, 629, 171, 107, 581, 420, 81, 133, 298, 192, 131, 64, 503, 323, 135, 150, 331, 34, 108, 9, 402, 271, 29, 31, 159, 3, 13, 120, 74, 692, 192, 136, 405
Background_Codon_Frequencies	43824, 9285, 8580, 31551, 20161, 2457, 5549, 12786, 9313, 3054, 703, 12462, 13197, 11998, 18709, 35717, 28244, 3360, 3013, 13872, 11636, 935, 1883, 8646, 3218, 1693, 248, 9275, 6537, 1976, 1215, 7724, 40947, 9119, 6862, 33984, 20076, 3810, 5235, 15247, 10203, 6350, 2843, 24522, 17227, 5605, 6841, 17741, 1822, 5458, 343, 21929, 14672, 1550, 1929, 10604, 228, 808, 5314, 3430, 40801, 9128, 6797, 22231

And the results of crt_in_plasmid.txt should be:

Cosine_Distance	0.021000
Empirical_Pvalue	0.042000
Spearman_Rho	0.967000
Spearman_Pvalue	0.000000
GCF_Codons	AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, ATA, ATC, ATG, ATT, CAA, CAC, CAG, CAT, CCA, CCC, CCG, CCT, CGA, CGC, CGG, CGT, CTA, CTC, CTG, CTT, GAA, GAC, GAG, GAT, GCA, GCC, GCG, GCT, GGA, GGC, GGG, GGT, GTA, GTC, GTG, GTT, TAA, TAC, TAG, TAT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT
GCF_Codon_Frequencies	425, 74, 102, 337, 138, 16, 47, 93, 74, 32, 17, 125, 172, 60, 167, 240, 213, 32, 34, 133, 80, 14, 17, 80, 25, 12, 1, 53, 48, 16, 18, 56, 240, 67, 75, 279, 130, 23, 51, 122, 94, 66, 14, 172, 126, 57, 68, 144, 11, 58, 5, 223, 103, 17, 14, 102, 6, 8, 60, 27, 274, 56, 75, 197
Background_Codon_Frequencies	44167, 9389, 8636, 31721, 20404, 2483, 5626, 12893, 9385, 3077, 697, 12532, 13261, 12176, 18896, 36131, 28553, 3387, 3035, 13968, 11806, 945, 1906, 8714, 3249, 1714, 250, 9355, 6613, 2002, 1229, 7808, 41336, 9223, 6894, 34286, 20366, 3868, 5317, 15423, 10301, 6415, 2893, 24853, 17424, 5683, 6923, 17928, 1845, 5508, 347, 22108, 14840, 1562, 1946, 10661, 225, 813, 5374, 3477, 41219, 9264, 6858, 22439

So we see the cosine distance between the BGC codon usage profile and the background genome is higher in the plasmid instance of crt (which we describe in our manuscript on lsaBGC to correspond to a more recently acquired version of GCF_6) than for the chromosome instance of crt. Further, through calculating an empirical P-value with the -p argument, we see the chances of observing as large a disagreement in the codon distribution of the plasmid BGC to the expected genome-wide distribution is <0.05 which is not the case for the chromosomal BGC.

Algorithm for computing empirical p-value

If of interest or if you have ideas for improvement, please let me know your thoughts in the Issues or feel free to email me or reach me on Twitter/Mastodon!

To calculate an empirical P-value, we gather codons for all codons for each gene across the genome. First the codon frequency distribution of the BGC genes is compared to the codon frequency distribution of the background genome (all other genes). After, we perform 10,000 simulations where in each simulation we shuffle the full genome-wide set of genes and go through the first N genes until the same number of codons as are present in the focal BGC/GCF are observed. The cosine distance between the observed codon frequency is compared to the remainder of the genome-wide codon distribution and checked for whether it is higher than what was actually observed for the BGC; if so, then an empirical P-value counter is appended a count of 1. The final empirical P-value produced is simply this count plus a pseudocount of 1 over 10,001.

image

Usage

usage: compareBGCtoGenomeCodonUsage.py [-h] -g FULL_GENOME_GENBANK -b BGC_GENBANKS [BGC_GENBANKS ...] -o OUTPUT [-p]

	Program: compareBGCtoGenomeCodonUsage.py
	Author: Rauf Salamzade
	Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
		
	This program compares the codon-usage distribution of a BGC (provided as a Genbank) to the codon usage of the 
	background genome (gathered from a full-genome Genbank file - also an output of antiSMASH). It will report
	the cosine distance and Spearman correlation between the two profiles. Only ORFs which are of length divisible 
	by 3 will be considered. It can accept multiple BGC Genbanks in the case that the BGC is fragmented across 
	multiple scaffolds.
	
	Only works for bacterial genomes currently.
	

optional arguments:
  -h, --help            show this help message and exit
  -g FULL_GENOME_GENBANK, --full_genome_genbank FULL_GENOME_GENBANK
                        Path to annotated full-genome Genbank for isolate's genome.
  -b BGC_GENBANKS [BGC_GENBANKS ...], --bgc_genbanks BGC_GENBANKS [BGC_GENBANKS ...]
                        Path to BGC Genbank(s) for isolate. Locus tags must match with tags in full-genome Genbank.
  -o OUTPUT, --output OUTPUT
                        Path to output file.
  -p, --compute_empirical_pval
                        Compute empirical P-value for observed cosine distance being as different than expected to the background codon usage distribution.