Skip to content
Kamil S Jaroň edited this page Aug 1, 2020 · 2 revisions

A script that reads a file with kmer pair sequences and a file with their corresponding coverages and prints on the standard output all the kmers that follow the user specification in a fasta format. It can be thought as extraction of kmer pairs that fall in a user-defined rectangle within the smudgeplot.

For example, to extract the core kmer pairs of the AAB smudge in the smudgeplot in the README file, you could run this module with specified parameters -minc 500 -maxc 700 -minr 0.3 -maxr 0.367 and it would sub-select kmers falling in the following rectange

smudge_extract_example

The header of each kmer has a follwing format >kmer_<INDEX>_<1/2>_<COV>; the <INDEX> is 0-based order of the kmer in the kmer pair file; <1/2> is corresponding to the two kmers in the pair (1/2 correspond to the one with the smaller/higher coverage) and cov is the frequency of the kmer in the original read set.

Look at wikipage about mapping of these kmers using bwa.

Usage

usage: smudgeplot extract [-h] -cov COVERAGEFILE -seq SEQFILE -minc COUNTMIN
                          -maxc COUNTMAX -minr RATIOMIN -maxr RATIOMAX > extracted_kmer_pairs.fasta

Extract kmer pairs within specified coverage sum and minor covrage ratio
ranges.

optional arguments:
  -h, --help            show this help message and exit
  -cov COVERAGEFILE, --coverageFile COVERAGEFILE
                        coverage file for the kmer pairs
  -seq SEQFILE, --seqFile SEQFILE
                        sequences of the kmer pairs
  -minc COUNTMIN, --countMin COUNTMIN
                        lower bound of the summed coverage
  -maxc COUNTMAX, --countMax COUNTMAX
                        upper bound of the summed coverage
  -minr RATIOMIN, --ratioMin RATIOMIN
                        lower bound of minor allele ratio
  -maxr RATIOMAX, --ratioMax RATIOMAX
                        upper bound of minor allele ratio