-
Notifications
You must be signed in to change notification settings - Fork 9
Low coverage
This type of kmer analysis is based on the number of times different k-mers occur in a set of sequencing reads. But in order to accurately estimate features based off of the kmer counts, we need pretty good coverage. Exactly how good is pretty good depends on both the specific sample characteristics and the sequencing technology. For example, a higher sequencing error rate means we need higher coverage to compensate. Higher ploidy also requires higher coverage to distinguish the many different peaks.
As a general rule, about 15x coverage of the haploid genome is sufficient.
The second example is Mercurialis, a nice plant from the spurge family with the monoploid genome size expected to be around 640Mbp [Obbard et al. 2006]. When the genome was originally published[Veltsos et al. 2019], it contained the default GenomeScope run as follow
1. What do you think is wrong with this model?
What happened here is that the model struggled to converge, because the error, heterozygous and homozygous peaks are blended. If you see a model fit to this-looking k-mer spectra, you should keep in mind it there might be a convergence problem and the best guide is the genome size - which in this particular case is 1/2 of the expectation.
Now that we see GenomeScope probably missed the heterozygosity peak, we would like to help the model to converge. That we can do by specifying a prior for 1n coverage (GenomeScope calls this parameter lambda).
2. Download the mecurialis 31-mer spectra and fit there genomescope model with an appropriate coverage prior.
The coverage prior is specified by parameter -l
. So the genomescope run will be...
genomescope.R -i fastq_countsM1_31.histo -o mecurialis_genomescope_l9 -l 9
So, first, this model is a lot closer to the expected genome size, I do think we are getting close. But there is one issue here. By default, GenomeScope discards the several low-coverage bins for the purpose of better convergence. That however makes sometimes estimates problematic in cases when those coverages contain lots of true genomic k-mers, like in this case.
To overwrite GenomeScope behavior and get this model to estimate the hetrozygosity from the low coverages too, we need to add --num_rounds 1
flag.
3. Fit the model so it includes low coverages too.
genomescope.R -i fastq_countsM1_31.histo -o mecurialis_genomescope_l9_r1/ -l9 --num_rounds 1
Finally, this looks like a model I can get onboard with. As a matter of fact, this (kind of) model was published as a correction of the original analysis. For the record, the analysis back then used a bit different flags because back then the software was still under development.
Here we have an example from the Darwin Tree of Life database, of the Agriotes lineatus beetle. When we initially look at the kmer spectra it appears we have the recommended 15x coverage. However, the model is clearly not fitted properly and we cannot distinuish any peaks.
By examining the transformed plot, where the y-axis is frequency*coverage rather than just frequency, we can see that this sample actually has far less coverage than was initially visible.
The first defined peak occurs at ~7x coverage, which is likely not enough to clearly distinguish the heterozygous and homozygous kmer peaks.
Let's look into other examples, next a very homozygous diploid genome.
👆 Go back to Table of Content.
👉 ⚒ Adjust the model for a highly homozygous (diploid) genome here.
👉 📖 Read about Characterization of polyploid genomes using k mer spectra analysis.
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others