Skip to content

Matching k‐mers to a reference

Kamil S. Jaron edited this page Mar 22, 2024 · 1 revision

3. Matching k-mers to a genome

Here we want to find exact matches of k-mers in the genome assembly. We can tweak a fast popular mapper bwa mem to do so (thanks to Daniel Standage for the tip) by setting up minimum seed length -k (the initial match required to consider mapping) as well as the minimum score to output -T to the value k we used. We also specify -a as we want all the matches (although we don't expect to see many of those). -c just skips all seeds (k-mers) with more than hits than the value specified (5000 in the example)

bwa mem -k <k> -T <k> -a -c 5000 <assembly.fasta> <kmers.fasta>

The lates Glossina assembly is downloaded from NCBI here: /cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/ncbi-genomes-2021-09-12. The mapped bams can be summarised into a table of scaffolds with the number of k-mers from each category. To do that we will use another home-made script, bams2table.py:

python3 bams2table.py <A_kmers_mapped.bam> <X_kmers_mapped.bam> <Y_kmers_mapped.bam> <output_per_scf_table.tsv>

Finally, now <output_per_scf_table.tsv> contains names of all scaffolds and number of chromosome specific k-mers. Just looking at the table... Q: did the assembly contain Y? Q: What is the longest clearly X-linked scaffold? The signal is usually clear, but hardly ever perfect. What causes that k-mers assigned to all three chromosomes map to so many scaffolds? Can we find any probably chimeric scaffold?

If there will be enough time, we can plot the properties of mapped k-mers and decide on scaffolds that can be clearly assigned to chromosomes and actually generate a table of assignments. Do you have an idea to get the Y chromosome too?

Table of content

Introduction

k-mer spectra analysis

Separation of chromosomes

Species assignment using short k-mers

Others

Clone this wiki locally