Skip to content

manual of smudgeplot with jellyfish

Kamil S. Jaron edited this page Oct 8, 2018 · 8 revisions

jellyfish is an analogous program to KMC. KMC is bit faster, but if you are already familiar with Jellywish, there is no reason to learn a new tool, smudgeplot works with Jellyfish just fine.

  1. Give jellyfish all the sequencing read files to calculate kmer frequencies and then generate a histogram of kmers
jellyfish count -C -m 21 -s 1000000000 -t 8 *.fastq -o kmer_counts.jf
jellyfish histo kmer_counts.jf > kmer_k21.hist
  1. Extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using the script kmer_cov_cutoff.R <kmer.hist> <L/U>. Then, extract kmers in the coverage range from L to U using Jellyfish dump and pipe them directly to the python script hetkmers.py to compute the set of kmer pairs.
L=$(kmer_cov_cutoff.R kmer_k21.hist L)
U=$(kmer_cov_cutoff.R kmer_k21.hist U)
echo $L $U # these need to be sane values like 30 800 or so
jellyfish dump -c -L $L -U $U kmer_counts.jf | hetkmers.py -k 21 -t 8 -o kmer_pairs
  1. After generating the list of kmer pair coverages, generate the smudgeplot using the coverages of the kmer pairs (*_coverages_2.tsv file). You can either supply the haploid kmer coverage (reported by GenomeScope) or let it be estimated directly from the data and compare it afterwards. If GenomeScope correctly identifies the peak of haploid kmers, the expected positions of the haplotype structures will overlap with high density smudges on the smudgeplot. If the overlap is not great you might consider adjusting the GenomeScope model and redoing the plot with a better estimate of the haploid coverage. Something like
sudgeplot.R -i kmer_pairs_coverages_2.tsv -o my_genome -t "Title"

will generate a basic smugeplot, the full usage is described in the help of smudgeplot smudgeplot.R -h or README.md of this repository.