Simple diploid

In the easiest case, the genome to be modeled is diploid, meaning it only contains two haplotypes. Humans fall into this category, as long as you ignore sex chromosomes in males. In the later sections we will consider more complicated cases, but for now, let's go through simple diploid cases we know a lot about.

Fruit fly

Now that we have our kmer histogram in this file SRR9969842.21.kmc.hist we can start modeling! If you don't want to download the raw reads, you can download a copy of the kmer histogram here.

Next we can use genomescope to plot the histogram and help us model the genome.

Rscript genomescope.R -i SRR9969842.21.kmc.hist -o SRR9969842_GS_OUT -k 21

Here, we use the default genomescope settings, and specify -k as 21 (since we used 21-mers for KMC as well).

The command line output should look like this: Screen Shot 2023-01-27 at 11 08 24

Indicating that the heterozygosity is roughly 0.6% with very little error and a genome length of 133,033,813bp. In the output directory we have more information and plots.

linear_plot

Later tutorials will go into more detail about how to interpret this plot and also how to build it by hand. But for now, you can observe that there is an estimated genome size, estimated heterozygosity, estimated error, etc.

There are a few parameters in the above example that I'll highlight as especially important. First of all, the -k parameter (in this case set to 21) is the length of k used by KMC. Picking a perfect value of k is not within the scope of this tutorial, but thankfully this theoretically shouldn't matter too much when modeling the genome. That's why we provide the model with the length of k when we run genomescope. On a practical note, the value of k does slightly change the results, but you can think about why that is on your own time. To show how changing k can slightly alter the estimates we can re-run the same test as above, but this time with k=16, and then again with k=31.

With k=16, we end up with a kmer profile like this: linear_plot

And with a k=31, we end up with a kmer profile like this: linear_plot

That's not too different, right? I agree. These all give relatively similar estimates. Not the same, but similar.

Another parameter that does impact the modeling is the maximal counter value (-cx with KMC). This is roughly equivalent to truncating the histogram at that value. Smaller -cs values can have a pretty dramatic impact on genome modeling. To see this, lets try re-running KMC with a much smaller value (-cs100).

SAMPLE=SRR9969842
mkdir -p tmp

ls "$SAMPLE".fasta > FILES
kmc -k21 -t24 -m96 -ci1 -cs100 -fa @FILES $SAMPLE.21.cs100.kmc tmp/

kmc_tools transform $SAMPLE.21.cs100.kmc histogram $SAMPLE.21.cs255.kmc.hist -cx255

histo=$SAMPLE.21.cs100.kmc.hist
output= $SAMPLE.cs100_gs_out
Rscript genomescope.R -i "$histo" -o "$output" -k 21

The output of the above looks like this: linear_plot

The differences weren't quite as dramatic as I'd hoped, but you can still see that the genome size estimate decreases by over 16mbp. This would be even more evident with a more repetitive genome.

Why would you ever want to truncate the .hist file?? Because it makes the file smaller and speeds up the modeling. All valid reasons, but I think it's worth a few extra seconds and a little bit more space for more accurate genome models.

Human

To see an example of this in a human cell line, Figure 8a contains a genomescope plot for CHM13 (from Miga et al., 2020). CHM13 is a haploid human cell line used to complete the first Telomere-to-Telomere human reference. The Miga et al. paper from 2020 used GenomeScope 1.0 on 10x Genomics reads from the CHM13 cell line. As expected in an entirely haploid cell line, there was virtually no estimated heterozygosity.

What's next

Now that you have got an idea of how model fitting works and have seen a few examples, you can either work on more solid understanding underlying principles of k-mer coverage and sequencing errors or look into trickier genomes in the next sections, where will provide examples of genomes with very different characteristics.

👆 Go back to Table of Content

👉 ⚒ For deeper understanding of how coverage works in genome profiling, check this tutorial demonstrating the effect of sequencing error rate on k mer coverage.

👉 📖 Read about some of the challenges faced when modeling diploid genomes Common difficulties in characterisation of diploid genomes using k mer spectra analysis