Skip to content

Estimating ploidy

Kamil S. Jaron edited this page Feb 9, 2023 · 1 revision

Just like for the other k-mer analyses, the input is a set of whole genome sequencing reads, the more coverage the better. Give KMC all the files with reads to calculate kmer frequencies and generate a histogram of kmers:

cd $USERWORK

module purge
module load Miniconda3/4.9.2
conda init bash
source ~/.bashrc

conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools


# we will again use the same kmc scripts
cp /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/*sh .

Just like for the other k-mer analyses, the input is a set of whole genome sequencing reads, the more coverage the better. The usage is shown here using tbenavi1/KMC and GenomeScope. Give KMC all the files with sequencing reads to calculate kmer frequencies and generate a histogram of kmers:

mkdir -p kmer_dbs
ls /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*fastq.gz > sach_files # create a file with both the raw read files

# build database - if you feel confortable, modify the scripts and try a different k
sbatch ./build_KMC_db.sh sach_files kmer_dbs/scer_SRR3265401

# as usual, we want the histogram as well
sbatch ./get_histogram.sh kmer_dbs/scer_SRR3265401 scer_SRR3265401_k21.hist

# and fit the default genomescope model
sbatch fit_genomescope.sh scer_SRR3265401_k21.hist genomescope_out SRR3265401 2

The next step is to extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using command smudgeplot.py cutoff <kmcdb_k21.hist> <L/U>. This job is so small you can run it on front end.

L=$(smudgeplot.py cutoff scer_SRR3265401_k21.hist L)
U=$(smudgeplot.py cutoff scer_SRR3265401_k21.hist U)
echo $L $U

These need to be sane values, look at the histogram, do they cut out the errors while preserving as many genomic kmers as possible? In this case it should be fine.

sbatch create_kmer_dump_kmc.sh kmer_dbs/scer_SRR3265401 $L $U SRR3265401_k21.dump

you can check the head SRR3265401_k21.dump to see how the dump file looks like (it's the same principle as with the jellyfish). Now, let's find find all the k-mer pairs using smudgeplot!

mkdir -p scer_smudgeplot
sbatch smudgeplot_pairs.sh SRR3265401_k21.dump scer_smudgeplot/scer_k21_L"$L"

Finally, we can plot a smudgeplot using.

sbatch smudgeplot_plot.sh scer_smudgeplot/scer_k21_L"$L"_coverages.tsv scer_smudgeplot/scer

Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. In this case, it just does not look great, it seems that the error kmer pairs have messed up the 1n coverage estimate. You can inform smudgeplot just like genomescope about the coverage by specifying parameter -n. In this case the solution is to specify -n 17, but to do that, you would have to modify it in the script or start an interactive session and run the smudgeplot plotting script directly. This second step and solutions are in this tutorial.

other values of L

Note that smudgeplots are very sensitive to vlues of L, if chosen too low, erros will hinder the plot, if too high, important genomic signals are missed. Although we provide a way to pick a sensible L, it does not always work and you should always look at the kmer spectra and decide L for yourself! The point is to separate the error peak and the genomic k-mers.

Hints and notes

There are more Smudgeplot tutorials provided in the manual of the tool: strawberry genome analysis. If you are interested in running Jellyfish instead of KMC, look at the manual of smudgeplot with Jellyfish.

speed-up

The pythonic version for seatching for heterozygous kmers (smudgeplot.py hetkmers) is rather slow, there is a faster version written in C called smudge_pairs, available at tbenavi1/KMC. But to run it you need to compile it. After compiling this version of KMC, smudge_pairs will be in the bin directory. The input is then directly the kmer database.

kmc_tools transform kmcdb -ci"$L" -cx"$U" reduce kmcdb_L"$L"_U"$U"
smudge_pairs kmcdb_L"$L"_U"$U" kmcdb_L"$L"_U"$U"_coverages.tsv kmcdb_L"$L"_U"$U"_pairs.tsv > kmcdb_L"$L"_U"$U"_familysizes.tsv

Table of content

Introduction

k-mer spectra analysis

Separation of chromosomes

Species assignment using short k-mers

Others

Clone this wiki locally