-
Notifications
You must be signed in to change notification settings - Fork 9
Wrong ploidy
The final, and the most complex example is the sycamore maple tree (Acer pseudoplantanus), also sequenced as part of the Darwin tree of life project. You can take a look at the default GenomeScope model from the linked tolqc webpage:
- There are a few red-flags with this spectra that might indicate poor convergence. Can you spot them?
What's wrong with the model?
First of all, the sequencing was done with PacBio HiFi, which has notoriously low error rates. However, this spectrum has extremely high (0.447%) expected error. Additionally, we can visually assess the error model fit (highlighted in orange) and can see that it does not align well with the actual data (the blue bars).
- Fit a better model
We can test if there is a better fit by helping the model to converge on the first peak (at ~10x coverage) as the (1n) heterozygous peak. The k-mer spectra is available here, assuming you have GenomeScope installed.
Check which options of GenomeScope (`genomescope.R --help`) make sense to tweak and try to fit the model with different parameters.
The first thing to try is to specify a prior (-l
) that would be approximately what we would expect.
genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10 -l 10 -n drAcePsed1
In a majority of cases it works. It did not this time.
While, the model did converge on the right coverage, by default it excludes some of the low, coverage k-mers that caused that the errors are overestimate in expense of underestimated heteoryzogisity.
TODO explain --num_rounds=1
genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1 -l 10 --num_rounds=1 -n drAcePsed1
This model looks a lot better, but... That's not really the genome size we expect and furthermore, Acer happens to be quite often tetraploid. This is how it looks like if we specify tetraploidy to the fit.
genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1_p4 -l 10 --num_rounds=1 -p 4 -n drAcePsed1
Ha, this is finally a model that predict the genome size in the right ballpark, and also nicely fits the data. TODO: finish conclusions here/
TODO: Katie, please check you can reproduce this.
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others