Matching k‐mers to a reference

3. Matching k-mers to a genome

Here we want to find exact matches of k-mers in the genome assembly. We can tweak a fast popular mapper bwa mem to do so (thanks to Daniel Standage for the tip) by setting up minimum seed length -k (the initial match required to consider mapping) as well as the minimum score to output -T to the value k we used. We also specify -a as we want all the matches (although we don't expect to see many of those). -c just skips all seeds (k-mers) with more than hits than the value specified (5000 in the example)

bwa mem -k <k> -T <k> -a -c 5000 <assembly.fasta> <kmers.fasta>

The lates Glossina assembly is downloaded from NCBI here: /cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/ncbi-genomes-2021-09-12. The mapped bams can be summarised into a table of scaffolds with the number of k-mers from each category. To do that we will use another home-made script, bams2table.py:

python3 bams2table.py <A_kmers_mapped.bam> <X_kmers_mapped.bam> <Y_kmers_mapped.bam> <output_per_scf_table.tsv>

Finally, now <output_per_scf_table.tsv> contains names of all scaffolds and number of chromosome specific k-mers. Just looking at the table... Q: did the assembly contain Y? Q: What is the longest clearly X-linked scaffold? The signal is usually clear, but hardly ever perfect. What causes that k-mers assigned to all three chromosomes map to so many scaffolds? Can we find any probably chimeric scaffold?

If there will be enough time, we can plot the properties of mapped k-mers and decide on scaffolds that can be clearly assigned to chromosomes and actually generate a table of assignments. Do you have an idea to get the Y chromosome too?

Table of content

Introduction

Concept of k-mers

k-mer spectra analysis

📖 Introduction to K-mer spectra analysis
- ⚒ Generating k-mer spectra tutorial
📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
📖 Analysing genome skimming data

Separation of chromosomes

📖Separate sub-genomes of an allopolyploid
📖Separating chromosomes by comparison of sequencing libraries
- ⚒ Extracting sex chromosome k-mers from a male and female sample
- ⚒ Extract k-mers specific to germ-line restricted chromosomes
- ⚒ Matching k-mers to a reference (bwa-mem)
- ⚒ Matching k-mers to sequencing reads (cookiecutter)

Species assignment using short k-mers

📖Identifying haplotypes within targeted amplicon sequencing datasets
- ⚒ Performing species assigment from targeted amplicon sequencing data

Others

🖥️ Installation of the kmer_tools conda evironment
📖 Other k-mer resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching k‐mers to a reference

3. Matching k-mers to a genome

Table of content

Clone this wiki locally