Fast and accurate Thai tokenization library using supervised BPE designed for full-text search applications.
pip3 install thai_tokenizer
Default set of pairs is optimized for short Thai-English product descriptions.
from thai_tokenizer import Tokenizer
tokenizer = Tokenizer()
tokenizer('iPad Mini 256GB เครื่องไทย') #> 'iPad Mini 256GB เครื่อง ไทย'
tokenizer.split('เครื่องไทย') #> ['เครื่อง', 'ไทย']
See Training for guidelines to train your own pairs.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.