-
Notifications
You must be signed in to change notification settings - Fork 23
manual of smudgeplot with jellyfish
This is a manual using smudgeplot v0.2.1
Previous versions:
- version v0.1.3 - the corresponding version of this page.
- version v0.2.0 - the corresponding version of this page.
jellyfish is an analogous program to KMC.
KMC is faster and more versatile tool, but a basic smudgeplot of a small genome Jellyfish will do the job just fine.
There are two possible algorithms to extract kmer pairs (the task called hetkmers), all
and middle
.
The default algorithm all
extracts kmer pairs that differ by a single nucleotide on any position (we consider all positions, therefore all
).
The way faster alternative extracts only kmers that differ by the middle nucleotide (therefore middle
) and requires the dump file on the input to be sorted. You can sort the dump file using for instance bash sort, but I would discourage you to do that as this process is not optimized at all for nucleotide sequences, use KMC instead. Therefore here we show usage of smudgeplot
with all
algorithm.
- Give jellyfish all the sequencing read files to calculate kmer frequencies and then generate a histogram of kmers
jellyfish count -C -m 21 -s 1000000000 -t 8 *.fastq -o kmer_counts.jf
jellyfish histo kmer_counts.jf > kmer_k21.hist
- Extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using the script
smudgeplot.py cutoff <kmer.hist> <L/U>
. Then, extract kmers in the coverage range fromL
toU
usingJellyfish dump
and pipe them directly tosmudgeplot.py hetkmers
to compute the set of kmer pairs.
L=$(smudgeplot.py cutoff kmer_k21.hist L)
U=$(smudgeplot.py cutoff kmer_k21.hist U)
echo $L $U # these need to be sane values like 30 800 or so
jellyfish dump -c -L $L -U $U kmer_counts.jf | smudgeplot.py hetkmers -o kmer_pairs
# note that if you would like use --middle flag, you would have to sort the jellyfish dump first
- After generating the list of kmer pair coverages, generate the smudgeplot using the coverages of the kmer pairs (
*_coverages.tsv
file). You can either supply the haploid kmer coverage (reported by GenomeScope) or let it be estimated directly from the data and compare it afterwards. If GenomeScope correctly identifies the peak of haploid kmers, the expected positions of the haplotype structures will overlap with high-density smudges on the smudgeplot. If the overlap is not great you might consider adjusting the GenomeScope model and redoing the plot with a better estimate of the haploid coverage. Something like
smudgeplot.py plot kmer_pairs_coverages_2.tsv -o my_genome
will generate a basic smugdeplot, the full usage is described in the help of smudgeplot smudgeplot.py plot -h
.