-
Notifications
You must be signed in to change notification settings - Fork 23
tutorial saccharomyces
The default smudgeplot was misleading. There was a smudge on an unexpected place, and I thought it's because of too low cutoff (parameter L), but the problem was somewhere else. The sneaky part in this dataset was that there is no AB smudge and to figure that the "mysterious smudge" is just a problem of 1n coverage estimate and incorrect smudge annotation as a consequence. Once smudgeplot is told what is the expected monoploid coverage, everything starts to make sense.
There is an open issue asking about a strange smudgeplot of supposedly tetraploid saccharomyces [Knaus et al 2018].
Looking at the posted smudgeplot - yeah, it's really weird. Beside the AB smudge, there is also a one more ~ at the L * cov line (the error line). I am suspecting that the L was too low, but I got to investigate a tiny bit if I am right.
Get the data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_[12].fastq.gz
Use kmc
to get a kmer database (kmer_counts
)
mkdir tmp logs # create a directory for temporary files
ls SRR3265401_1.fastq.gz SRR3265401_2.fastq.gz > FILES # create a file with both the raw read files
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp # run kmc
I suspected that the L is too low, hence we should look at the kmer histogram. I fit there a default diploid GenomeScope model, because I more care about the shape of the histogram than anything else at this point
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Scer_genomescope
Alright. We see three peaks, ~15x, ~30x and ~60x. So the only question is if the smallest peak the 1n of the genome? I also expect that choosing L = 15 and high U (3000) would generate a smudgeplot similar to the one posted
kmc_dump -ci15 -cx3000 kmer_counts kmer_k21_L15.dump
smudgeplot.py hetkmers -o kmer_pairs_L15 < kmer_k21.dump
smudgeplot.py plot -o saccharomyces_L15 -t "yeast" kmer_pairs_L15_coverages.tsv
and indeed
Alright, what is going on? The 1n estimate is ~35x (but looking the smudge behind the annotation, it seems it's slightly less). Therefore the AA and AB smudges are (a bit less than) ~70x. Hence they probably correspond to pairing of 15x peak with 45x peak in the AA case and kmers from 30x peak with itself. Interesting enough, there is no smudge around 30x, where one would expect one made of kmer pairs within 1n peak. The absence of pairs within the 1n peak is actually the main reason why the smudgeplot was confused with the annotation of smudges. So perhaps the smudge is real, but misannotated and the ~15x is a real genomic peak. So if we fit a tetraploid model with a coverage prior = 16 (it will converge on the best fitting value)
genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -l 16 -n Scer_genomescope_p4
then we will get a decent fit. The genome size is approximately right (14Mbp vs expected 12Mbp) and at least the spacing of the individual peaks seems to be exactly matching the tetraploid model
The coverage has converged to ~17x haploid coverage. Lets then redo the smudgeplot, while enforcing what we think is a haploid coverage
smudgeplot.py plot -o saccharomyces_L15_n17 -n 17 -t "yeast" kmer_pairs_coverages.tsv
and there you go...
we get a tetraploid smudgeplot.