Skip to content

Wrong ploidy

Kamil S. Jaron edited this page Feb 14, 2023 · 2 revisions

Acer pseudoplantanus

The final, and the most complex example is the sycamore maple tree (Acer pseudoplantanus), also sequenced as part of the Darwin tree of life project. You can take a look at the default GenomeScope model from the linked tolqc webpage:

drAcePsed1 k31_transformed_linear_plot

  1. There are a few red-flags with this spectra that might indicate poor convergence. Can you spot them?
What's wrong with the model?

First of all, the sequencing was done with PacBio HiFi, which has notoriously low error rates. However, this spectrum has extremely high (0.447%) expected error. Additionally, we can visually assess the error model fit (highlighted in orange) and can see that it does not align well with the actual data (the blue bars).

  1. Fit a better model

We can test if there is a better fit by helping the model to converge on the first peak (at ~10x coverage) as the (1n) heterozygous peak. The k-mer spectra is available here, assuming you have GenomeScope installed.

Check which options of GenomeScope (`genomescope.R --help`) make sense to tweak and try to fit the model with different parameters.

The first thing to try is to specify a prior (-l) that would be approximately what we would expect.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10 -l 10 -n drAcePsed1

In a majority of cases it works. It did not this time.

drAcePsed1_linear_plot

While, the model did converge on the right coverage, by default it excludes some of the low, coverage k-mers that caused that the errors are overestimate in expense of underestimated heteoryzogisity.

TODO explain --num_rounds=1

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1 -l 10 --num_rounds=1 -n drAcePsed1

drAcePsed1_linear_plot

This model looks a lot better, but... That's not really the genome size we expect and furthermore, Acer happens to be quite often tetraploid. This is how it looks like if we specify tetraploidy to the fit.

genomescope.R -i drAcePsed1.k31.hist.txt -o drAcePsed1_genomescope_l10_round1_p4 -l 10 --num_rounds=1 -p 4 -n drAcePsed1

drAcePsed1_linear_plot

Ha, this is finally a model that predict the genome size in the right ballpark, and also nicely fits the data. TODO: finish conclusions here/

TODO: Katie, please check you can reproduce this.

Table of content

Introduction

k-mer spectra analysis

Separation of chromosomes

Species assignment using short k-mers

Others

Clone this wiki locally