Fast generation of hashed kmers from a python string
To install, wheels are available for Python >= 3.7 < 3.10 on linux/MacOSX:
pip install kmerhash
Or build from source:
git clone https://github.com/kcleal/kmerhash cd kmerhash pip install .
kmers are packed into an int64 using a rolling hash. The maximum kmer size is 32 bp and currently an alphabet of {A, T, C, G} is supported (uppercase only):
import numpy as np
from kmerhash import kmerhasher, hashes2seq
seq = "TTCGGACCGGATT"
k = 11
a = kmerhasher(seq, k) # numpy array returned
print(a)
# [4039016 3573155 1709711]
kmer integers can be converted back to string format:
print(hashes2seq(a, k))
# ['TTCGGACCGGA', 'TCGGACCGGAT', 'CGGACCGGATT']
kmers can be counted efficiently using e.g. Counter:
from collections import Counter
counter = Counter(a)
kmerhash was compared to a pure python implementation with k=21:
# pure python
[hash(seq[i:i+k]) for i in range(len(seq) - k + 1)]
# kmerhash
kmerhasher(seq, k)
Sequence length | Throughput python (Mbp / s) | Throughput kmerhash (Mbp / s) |
---|---|---|
21 | 9.8 | 1.1 |
100 | 5.6 | 9.1 |
1000 | 5.2 | 50.5 |
10,000 | 4.8 | 90.79 |
100,000 | 4.9 | 102.9 |
- Maximum kmer length is 32
- Only {A, T, C, G} bases are supported
- Uppercase bases only
- N bases will be intepreted as A's
Alexander Kearsey https://github.com/kearseya