Skip to content

tutorial saccharomyces

Kamil S. Jaron edited this page May 20, 2020 · 5 revisions

with hindsight

The default smudgeplot was misleading. There was a smudge on an unexpected place, and I thought it's because of too low cutoff (parameter L), but the problem was somewhere else. The sneaky part in this dataset was that there is no AB smudge and to figure that the "mysterious smudge" is just a problem of 1n coverage estimate and incorrect smudge annotation as a consequence. Once smudgeplot is told what is the expected monoploid coverage, everything starts to make sense.

Intro to the funky yeast

There is an open issue asking about a strange smudgeplot of supposedly tetraploid saccharomyces [Knaus et al 2018].

strange_smudgeplot

Looking at the posted smudgeplot - yeah, it's really weird. Beside the AB smudge, there is also a one more ~ at the L * cov line (the error line). I am suspecting that the L was too low, but I got to investigate a tiny bit if I am right.

The analysis

Get the data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz

Use kmc to get a kmer database (kmer_counts)

mkdir tmp logs # create a directory for temporary files
ls SRR3265401_1.fastq.gz SRR3265401_2.fastq.gz > FILES # create a file with both the raw read files
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp # run kmc

I suspected that the L is too low, hence we should look at the kmer histogram. I fit there a default diploid GenomeScope model, because I more care about the shape of the histogram than anything else at this point

kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Scer_genomescope

strange_genomescope

Alright. We see three peaks, ~15x, ~30x and ~60x. So the only question is if the smallest peak the 1n of the genome? I also expect that choosing L = 15 and high U (3000) would generate a smudgeplot similar to the one posted

kmc_dump -ci15 -cx3000 kmer_counts kmer_k21_L15.dump
smudgeplot.py hetkmers -o kmer_pairs_L15 < kmer_k21.dump
smudgeplot.py plot -o saccharomyces_L15 -t "yeast" kmer_pairs_L15_coverages.tsv

and indeed

replicated_smudgeplot

Alright, what is going on? The 2n estimate is ~35x (but looking the smudge behind the annotation, it seems it's slightly less). Therefore the AA and AB smudges are (a bit less than) ~70x. Hence they probably correspond to pairing of 15x peak with 45x peak in the AA case and kmers from 30x peak with itself. Interesting enough, there is no smudge ~30x, where one would expect one made of kmer pairs between 1n peak, which is the main reason why the smudgeplot was confused with the annotation of smudges. So perhaps the smudge is real, but misannotated and the ~15x is a real genomic peak. So if we fit a tetraploid model with a coverage prior = 16 (it will converge on the best fitting value)

genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -l 16 -n Scer_genomescope_p4

then we will get a decent fit. The genome size is approximately right (14Mbp vs expected 12Mbp) and at least the spacing of the individual peaks seems to be exactly matching the tetraploid model

strange_genomescope2

The coverage has converged to ~17x haploid coverage. Lets then redo the smudgeplot, while enforcing what we think is a haploid coverage

smudgeplot.py plot -o saccharomyces_L15_n17 -n 17 -t "yeast" kmer_pairs_coverages.tsv

and there you go...

reannotated_smudgeplot

we get a tetraploid smudgeplot.