-
Notifications
You must be signed in to change notification settings - Fork 9
Simple diploid
In the easiest case, the genome to be modeled is diploid, meaning it only contains two haplotypes. Humans fall into this category, as long as you ignore sex chromosomes in males. In the later sections we will consider more complicated cases, but for now, let's go through simple diploid cases we know a lot about.
Now that we have our kmer histogram in this file SRR9969842.21.kmc.hist we can start modeling! If you don't want to download the raw reads, you can download a copy of the kmer histogram here.
Next we can use genomescope to plot the histogram and help us model the genome.
Rscript genomescope.R -i SRR9969842.21.kmc.hist -o SRR9969842_GS_OUT -k 21
Here, we use the default genomescope settings, and specify -k as 21 (since we used 21-mers for KMC as well).
The command line output should look like this:
Indicating that the heterozygosity is roughly 0.6% with very little error and a genome length of 133,033,813bp. In the output directory we have more information and plots.
Later tutorials will go into more detail about how to interpret this plot and also how to build it by hand. But for now, you can observe that there is an estimated genome size, estimated heterozygosity, estimated error, etc.
There are a few parameters in the above example that I'll highlight as especially important. First of all, the -k parameter (in this case set to 21) is the length of k used by KMC. Picking a perfect value of k is not within the scope of this tutorial, but thankfully this theoretically shouldn't matter too much when modeling the genome. That's why we provide the model with the length of k when we run genomescope. On a practical note, the value of k does slightly change the results, but you can think about why that is on your own time. To show how changing k can slightly alter the estimates we can re-run the same test as above, but this time with k=16, and then again with k=31.
With k=16, we end up with a kmer profile like this:
And with a k=31, we end up with a kmer profile like this:
That's not too different, right? I agree. These all give relatively similar estimates. Not the same, but similar.
Another parameter that does impact the modeling is the maximal counter value (-cx with KMC). This is roughly equivalent to truncating the histogram at that value. Smaller -cs values can have a pretty dramatic impact on genome modeling. To see this, lets try re-running KMC with a much smaller value (-cs100).
SAMPLE=SRR9969842
mkdir -p tmp
ls "$SAMPLE".fasta > FILES
kmc -k21 -t24 -m96 -ci1 -cs100 -fa @FILES $SAMPLE.21.cs100.kmc tmp/
kmc_tools transform $SAMPLE.21.cs100.kmc histogram $SAMPLE.21.cs255.kmc.hist -cx255
histo=$SAMPLE.21.cs100.kmc.hist
output= $SAMPLE.cs100_gs_out
Rscript genomescope.R -i "$histo" -o "$output" -k 21
The output of the above looks like this:
The differences weren't quite as dramatic as I'd hoped, but you can still see that the genome size estimate decreases by over 16mbp. This would be even more evident with a more repetitive genome.
Why would you ever want to truncate the .hist file?? Because it makes the file smaller and speeds up the modeling. All valid reasons, but I think it's worth a few extra seconds and a little bit more space for more accurate genome models.
To see an example of this in a human cell line, Figure 8a contains a genomescope plot for CHM13 (from Miga et al., 2020). CHM13 is a haploid human cell line used to complete the first Telomere-to-Telomere human reference. The Miga et al. paper from 2020 used GenomeScope 1.0 on 10x Genomics reads from the CHM13 cell line. As expected in an entirely haploid cell line, there was virtually no estimated heterozygosity.
Now that you have got an idea of how model fitting works and have seen a few examples, you can either work on more solid understanding underlying principles of k-mer coverage and sequencing errors or look into trickier genomes in the next sections, where will provide examples of genomes with very different characteristics.
👆 Go back to Table of Content
👉 ⚒ For deeper understanding of how coverage works in genome profiling, check this tutorial demonstrating the effect of sequencing error rate on k mer coverage.
👉 📖 Read about some of the challenges faced when modeling diploid genomes Common difficulties in characterisation of diploid genomes using k mer spectra analysis
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others