Low coverage

This type of kmer analysis is based on the number of times different k-mers occur in a set of sequencing reads. But in order to accurately estimate features based off of the kmer counts, we need pretty good coverage. Exactly how good is pretty good depends on both the specific sample characteristics and the sequencing technology. For example, a higher sequencing error rate means we need higher coverage to compensate. Higher ploidy also requires higher coverage to distinguish the many different peaks.

As a general rule, about 15x coverage of the haploid genome is sufficient.

Mercurialis

The second example is Mercurialis, a nice plant from the spurge family with the monoploid genome size expected to be around 640Mbp [Obbard et al. 2006]. When the genome was originally published[Veltsos et al. 2019], it contained the default GenomeScope run as follow

mecurialis_linear_plot

1. What do you think is wrong with this model?

What happened here is that the model struggled to converge, because the error, heterozygous and homozygous peaks are blended. If you see a model fit to this-looking k-mer spectra, you should keep in mind it there might be a convergence problem and the best guide is the genome size - which in this particular case is 1/2 of the expectation.

Now that we see GenomeScope probably missed the heterozygosity peak, we would like to help the model to converge. That we can do by specifying a prior for 1n coverage (GenomeScope calls this parameter lambda).

2. Download the mecurialis 31-mer spectra and fit there genomescope model with an appropriate coverage prior.

The coverage prior is specified by parameter -l. So the genomescope run will be...

genomescope.R -i fastq_countsM1_31.histo -o mecurialis_genomescope_l9 -l 9

linear_plot

So, first, this model is a lot closer to the expected genome size, I do think we are getting close. But there is one issue here. By default, GenomeScope discards the several low-coverage bins for the purpose of better convergence. That however makes sometimes estimates problematic in cases when those coverages contain lots of true genomic k-mers, like in this case.

To overwrite GenomeScope behavior and get this model to estimate the hetrozygosity from the low coverages too, we need to add --num_rounds 1 flag.

3. Fit the model so it includes low coverages too.

genomescope.R -i fastq_countsM1_31.histo -o mecurialis_genomescope_l9_r1/ -l9 --num_rounds 1

linear_plot

Finally, this looks like a model I can get onboard with. As a matter of fact, this (kind of) model was published as a correction of the original analysis. For the record, the analysis back then used a bit different flags because back then the software was still under development.

Beetle

Here we have an example from the Darwin Tree of Life database, of the Agriotes lineatus beetle. When we initially look at the kmer spectra it appears we have the recommended 15x coverage. However, the model is clearly not fitted properly and we cannot distinuish any peaks.

icAgrLine1 k31_linear_plot

By examining the transformed plot, where the y-axis is frequency*coverage rather than just frequency, we can see that this sample actually has far less coverage than was initially visible.

icAgrLine1 k31_transformed_linear_plot