tests | |
---|---|
package |
This package provides utilities for Thai sentence segmentation, word tokenization and POS tagging. Because of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.
Besides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings,
there are also functions for working with large amounts of data in a streaming fashion.
They are also accessible with a commandline script thai-segmenter
that accepts file or standard in/output.
Options allow working with meta-headers or tabulator separated data files.
The main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, Question Generation Thai.
LongLexTo is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (original?) versions github and homepage. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.
For POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, paper.
- Free software: MIT license
pip install thai-segmenter
To use the project:
sentence = """foo bar 1234"""
# [A] Sentence Segmentation
from thai_segmenter.tasks import sentence_segment
# or even easier:
from thai_segmenter import sentence_segment
sentences = sentence_segment(sentence)
for sentence in sentences:
print(str(sentence))
# [B] Lexeme Tokenization
from thai_segmenter import tokenize
tokens = tokenize(sentence)
for token in tokens:
print(token, end=" ", flush=True)
# [C] POS Tagging
from thai_segmenter import tokenize_and_postag
sentence_info = tokenize_and_postag(sentence)
for token, pos in sentence_info.pos:
print("{}|{}".format(token, pos), end=" ", flush=True)
See more possibilities in tasks.py
or cli.py
.
Streaming larger sequences can be achieved like this:
# Streaming
sentences = ["sent1\n", "sent2\n", "sent3\n"] # or any iterable (like File)
from thai_segmenter import line_sentence_segmenter
sentences_segmented = line_sentence_segmenter(sentences)
This project also provides a nifty commandline tool thai-segmenter
that does most of the work for you:
usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...
Thai Segmentation utilities.
optional arguments:
-h, --help show this help message and exit
Tasks:
{clean,sentseg,tokenize,tokpos}
clean Clean input from non-thai and blank lines.
sentseg Sentence segmentize input lines.
tokenize Tokenize input lines.
tokpos Tokenize and POS-tag input lines.
You can run sentence segmentation like this:
thai-segmenter sentseg -i input.txt -o output.txt
or even pipe data:
cat input.txt | thai-segmenter sentseg > output.txt
Use -h
/--help
to get more information about possible control flow options.
You can run it somewhat interactively with:
thai-segmenter tokpos --stats
and standard input and output are used. Lines terminated with Enter
are immediatly processed and printed. Stop work with key combination Ctrl
+ D
and the --stats
parameter will helpfully output some statistics.
The project also provides a demo WebApp (using Flask
and gevent
) that can be installed with:
pip install -e .[webapp]
and then simply run (in the foreground):
thai-segmenter-webapp
Consider running it in a screen
session.
# create the screen detached and then attach
screen -dmS thai-senseg-webapp
screen -r thai-senseg-webapp
# in the screen:
thai-segmenter-webapp
# and detach with keys [Ctrl]+[D]
Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.
To install the package for development:
git clone https://github.com/Querela/thai-segmenter.git cd thai-segmenter/ pip install -e .[dev]
After changing the source, run auto code formatting with:
isort <file>.py black <file>.py
And check it afterwards with:
flake8 <file>.py
The setup.py
also contains the flake8
subcommand as well as an extended clean
command.
To run the all tests run:
tox
You can also optionally run pytest
alone:
pytest
Or with:
python setup.py test
Note, to combine the coverage data from all the tox environments run:
Windows | set PYTEST_ADDOPTS=--cov-append tox |
---|---|
Other | PYTEST_ADDOPTS=--cov-append tox |