Wrong ploidy

Acer pseudoplantanus

The final, and the most complex example is the sycamore maple tree (Acer pseudoplantanus), also sequenced as part of the Darwin tree of life project. You can take a look at the default GenomeScope model from the linked tolqc webpage:

drAcePsed1 k31_transformed_linear_plot

There are a few red-flags with this spectra that might indicate poor convergence. Can you spot them?

What's wrong with the model?

First of all, the sequencing was done with PacBio HiFi, which has notoriously low error rates. However, this spectrum has extremely high (0.447%) expected error. Additionally, we can visually assess the error model fit (highlighted in orange) and can see that it does not align well with the actual data (the blue bars).

Fit a better model

We can test if there is a better fit by helping the model to converge on the first peak (at ~10x coverage) as the (1n) heterozygous peak. The k-mer spectra is available here, assuming you have GenomeScope installed.

Check which options of GenomeScope (`genomescope.R --help`) make sense to tweak and try to fit the model with different parameters.

The first thing to try is to specify a prior (-l) that would be approximately what we would expect.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10 -l 10 -n drAcePsed1

In a majority of cases it works. It did not this time.

drAcePsed1_linear_plot

While, the model did converge on the right coverage, by default it excludes some of the low, coverage k-mers that caused that the errors are overestimate in expense of underestimated heteoryzogisity.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10 -l 10 -n drAcePsed1

drAcePsed1_linear_plot

This model looks a lot better, but... That's not really the genome size we expect and furthermore, Acer happens to be quite often tetraploid. This is how it looks like if we specify tetraploidy to the fit.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_p4 -l 10 -p 4 -n drAcePsed1

drAcePsed1_linear_plot

Ha, this is finally a model that predict the genome size in the right ballpark, and also nicely fits the data. This is why it is always important to gather all information about our species of interest. In this case we could detect it was a tetraploid from the k-mer spectrum and the genome model fit, but it is always good to know in advance whether our species could be a polyploid (or a hybrid)!

What's next

This was our last example of common problems with model fitting. Next, we will be looking into using k-mers to further characterise polyploids using smudgeplots.

👆 Go back to Table of Content

👉 📖 Read about Characterization of polyploid genomes using k mer spectra analysis].