Wrong ploidy

Acer pseudoplantanus

The final, and the most complex example is the sycamore maple tree (Acer pseudoplantanus), also sequenced as part of the Darwin tree of life project. You can take a look at the default GenomeScope model from the linked tolqc webpage:

drAcePsed1 k31_transformed_linear_plot

There are a few red-flags with this spectra that might indicate poor convergence. Can you spot them?

What's wrong with the model?

First of all, the sequencing was done with PacBio HiFi, which has notoriously low error rates. However, this spectrum has extremely high (0.447%) expected error. Additionally, we can visually assess the error model fit (highlighted in orange) and can see that it does not align well with the actual data (the blue bars).

Fit a better model

We can test if there is a better fit by helping the model to converge on the first peak (at ~10x coverage) as the (1n) heterozygous peak. The k-mer spectra is available here, assuming you have GenomeScope installed.

Check which options of GenomeScope (`genomescope.R --help`) make sense to tweak and try to fit the model with different parameters.

The first thing to try is to specify a prior (-l) that would be approximately what we would expect.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10 -l 10 -n drAcePsed1

In a majority of cases it works. It did not this time.

drAcePsed1_linear_plot

While, the model did converge on the right coverage, by default it excludes some of the low, coverage k-mers that caused that the errors are overestimate in expense of underestimated heteoryzogisity.

TODO explain --num_rounds=1

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1 -l 10 --num_rounds=1 -n drAcePsed1

drAcePsed1_linear_plot

This model looks a lot better, but... That's not really the genome size we expect and furthermore, Acer happens to be quite often tetraploid. This is how it looks like if we specify tetraploidy to the fit.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1_p4 -l 10 --num_rounds=1 -p 4 -n drAcePsed1

drAcePsed1_linear_plot

Ha, this is finally a model that predict the genome size in the right ballpark, and also nicely fits the data. TODO: finish conclusions here/

TODO: Katie, please check you can reproduce this.

Table of content

Introduction

Concept of k-mers

k-mer spectra analysis

📖 Introduction to K-mer spectra analysis
- ⚒ Generating k-mer spectra tutorial
📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
📖 Analysing genome skimming data

Separation of chromosomes

📖Separate sub-genomes of an allopolyploid
📖Separating chromosomes by comparison of sequencing libraries
- ⚒ Extracting sex chromosome k-mers from a male and female sample
- ⚒ Extract k-mers specific to germ-line restricted chromosomes
- ⚒ Matching k-mers to a reference (bwa-mem)
- ⚒ Matching k-mers to sequencing reads (cookiecutter)

Species assignment using short k-mers

📖Identifying haplotypes within targeted amplicon sequencing datasets
- ⚒ Performing species assigment from targeted amplicon sequencing data

Others

🖥️ Installation of the kmer_tools conda evironment
📖 Other k-mer resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong ploidy

Acer pseudoplantanus

Table of content

Clone this wiki locally