-
Notifications
You must be signed in to change notification settings - Fork 9
Estimating ploidy
Just like for the other k-mer analyses, the input is a set of whole genome sequencing reads, the more coverage the better. Give KMC all the files with reads to calculate kmer frequencies and generate a histogram of kmers:
cd $USERWORK
module purge
module load Miniconda3/4.9.2
conda init bash
source ~/.bashrc
conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools
# we will again use the same kmc scripts
cp /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/*sh .
Just like for the other k-mer analyses, the input is a set of whole genome sequencing reads, the more coverage the better. The usage is shown here using tbenavi1/KMC and GenomeScope. Give KMC all the files with sequencing reads to calculate kmer frequencies and generate a histogram of kmers:
mkdir -p kmer_dbs
ls /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*fastq.gz > sach_files # create a file with both the raw read files
# build database - if you feel confortable, modify the scripts and try a different k
sbatch ./build_KMC_db.sh sach_files kmer_dbs/scer_SRR3265401
# as usual, we want the histogram as well
sbatch ./get_histogram.sh kmer_dbs/scer_SRR3265401 scer_SRR3265401_k21.hist
# and fit the default genomescope model
sbatch fit_genomescope.sh scer_SRR3265401_k21.hist genomescope_out SRR3265401 2
The next step is to extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using command smudgeplot.py cutoff <kmcdb_k21.hist> <L/U>
. This job is so small you can run it on front end.
L=$(smudgeplot.py cutoff scer_SRR3265401_k21.hist L)
U=$(smudgeplot.py cutoff scer_SRR3265401_k21.hist U)
echo $L $U
These need to be sane values, look at the histogram, do they cut out the errors while preserving as many genomic kmers as possible? In this case it should be fine.
sbatch create_kmer_dump_kmc.sh kmer_dbs/scer_SRR3265401 $L $U SRR3265401_k21.dump
you can check the head SRR3265401_k21.dump
to see how the dump file looks like (it's the same principle as with the jellyfish).
Now, let's find find all the k-mer pairs using smudgeplot!
mkdir -p scer_smudgeplot
sbatch smudgeplot_pairs.sh SRR3265401_k21.dump scer_smudgeplot/scer_k21_L"$L"
Finally, we can plot a smudgeplot using.
sbatch smudgeplot_plot.sh scer_smudgeplot/scer_k21_L"$L"_coverages.tsv scer_smudgeplot/scer
Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. In this case, it just does not look great, it seems that the error kmer pairs have messed up the 1n coverage estimate. You can inform smudgeplot
just like genomescope
about the coverage by specifying parameter -n
. In this case the solution is to specify -n 17
, but to do that, you would have to modify it in the script or start an interactive session and run the smudgeplot plotting script directly. This second step and solutions are in this tutorial.
other values of L
Note that smudgeplots are very sensitive to vlues of L, if chosen too low, erros will hinder the plot, if too high, important genomic signals are missed. Although we provide a way to pick a sensible L, it does not always work and you should always look at the kmer spectra and decide L for yourself! The point is to separate the error peak and the genomic k-mers.
There are more Smudgeplot tutorials provided in the manual of the tool: strawberry genome analysis. If you are interested in running Jellyfish instead of KMC, look at the manual of smudgeplot with Jellyfish.
speed-up
The pythonic version for seatching for heterozygous kmers (smudgeplot.py hetkmers
) is rather slow, there is a faster version written in C called smudge_pairs
, available at tbenavi1/KMC. But to run it you need to compile it. After compiling this version of KMC, smudge_pairs
will be in the bin directory. The input is then directly the kmer database.
kmc_tools transform kmcdb -ci"$L" -cx"$U" reduce kmcdb_L"$L"_U"$U"
smudge_pairs kmcdb_L"$L"_U"$U" kmcdb_L"$L"_U"$U"_coverages.tsv kmcdb_L"$L"_U"$U"_pairs.tsv > kmcdb_L"$L"_U"$U"_familysizes.tsv
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others