- the tokenizer utilizes transfer learning from a character language model which is trained on a large Thai hotel review corpus and InterBEST2009.
- at the moment, the tokenizer supports only Thai texts. Texts that includes English characters or special symbols will not be tokenized correctly, since the model was trained exclusively using Thai texts (also with out any spaces, special symbols, and digits).
- we will soon release the model that supports those characters as well.
- Try ThaiLMCut in Colab
- Paper: ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation
- bibtex
- an example input from a hotel review
- Python 3.5+
- PyTorch 1.0+
- numpy
https://drive.google.com/file/d/1e39tNMfUFzYQ4MDHTMyNWfNUxu9RoaTA/view?usp=sharing
lmcut/weight/.
python setup.py bdist_wheel
pip install dist/lmcut*
After importing the package, you can tokenize any Thai text by using:
from lmcut import tokenize
text = "โรงแรมดี สวยงามน่าอยู่มากๆ"
result = tokenize(text)
print(result)
Result will be a list of tokens:
['โรง', 'แรม', 'ดี', 'สวยงาม', 'น่า', 'อยู่', 'มาก', 'ๆ']
- Define the training and development dataset in
train/get_corpus_lm.py
- Input data can be any text. Example of an input text can be found in
data/TEST_100K.txt
- If you use InterBEST2009, the boundary markers must be removed (see
train/get_corpus.py
) To train a new language model, you could run:
python train/LanguageModel.py --dataset [dataset name] --batchSize 60 --char_dropout_prob 0.01 --char_embedding_size 200 --hidden_dim 500 --layer_num 3 --learning_rate 0.0001 --sequence_length 100 --epoch 20 --len_lines_per_chunk 1000 --optim [adam or sgd] --lstm_num_direction [2 for bidirectional LSTM] --add_note "..add some note.."
To resume the training of a language model, you could run
python train/LanguageModel.py --load_from [model name] --dataset [dataset name] --learning_rate 0.0001 --epoch 20 --optim [adam or sgd] "
- Expected input is InterBEST2009 or any corpus with boundary marker
|
- Define the train, dev, test dataset in
train/get_corpus.py
- Example of an input text can be found in
data/news_00001.txt
To train a new tokenizer, you could run:
python Tokenizer.py --epoch 5 --lstm_num_direction 2 --batchSize 30 --sequence_length 80 --char_embedding_size 100 --hidden_dim 60 --layer_num 2 [adam or sgd] --learning_rate 0.0001
to transfer the embedding layer and recurrent layer of a pre-trained language model, you could run
python Tokenizer.py --load_from [language model name] --epoch 5 --learning_rate 0.0001
to resume the training of a tokenizer, you could run
python Tokenizer.py --load_from [tokenizer name] --epoch 5 --learning_rate 0.0001
-
use
--over_write 1
if you want to over write the weights to the resumed model -
with
--over_write 0
it will save the trained model as a new model -
More detail about other arguments, see
train/Tokenizer.py
andtrain/LanguageModel.py
-
data/news_00001.txt
anddata/TEST_100K.txt
is from InterBEST2009 corpus which is publicly available at NECTEC
- Most of the code are borrowed from Tabula nearly rasa: Probing the Linguistic Knowledge of Character-Level Neural Language Models Trained on Unsegmented Text
- Some codes are borrowed from DeepCut and Attacut
- We would like to thank all the contributors
The project is funded by TrustYou. The author would like to sincerely thank TrustYou and other contributors of this project.
- Suteera Seeha
- Ivan Bilan
- Liliana Mamani Sanchez
- Johannes Huber
- Michael Matuschek
All original code in this project is licensed under the MIT License. See the included LICENSE file.