-
Notifications
You must be signed in to change notification settings - Fork 23
chosing L and U
Parameters L
and U
are determining lower and upper thresholds for coverage of kmers that will be considered as genomic kmers. Some approximate estimates can be make with smudgeplot smudgeplot cutoff function, but there is nothing wrong in eyeballing it directly from kmer spectra (and very often it does give a better estimate).
The most important bit is to chose L so there are not many (no) error k-mers in the dataset, so pick one as high as you can, but make sure not cut off your monoploid kmers (the first bump should be at least partially included).
Oerhaps less important than L, you might want to exclude super repetitive kmers (like mt DNA or kmers from centro/telomeres) from your analysis. These kmers have usually enormous coverage, so U can go up to several thousands without a bit problem.
I am actually considering removing this argument and explore if ultra-repetitive kmers would actually represent a problem (we thought that they might so we have kicked them out, but we actually never checked).
TODO add a couple of examples of kmer spectra with appropriate L and U