Workflow and scripts that help user create Chinese Wikepedia corpus easily form scratch.
Clone or download this repo to local filesystem.
Python 3.4+ is well supported, python2 is not supported.
Script install_dependencies_on_ubunut.bash
will install everything for you.
install requirements by:
pip install -r ./requirements.txt
OpenCCC is required. User should install it by self.
For Uubntu / debian user, opencc
can be installed by command apt
:
sudo apt-get install opencc
allinone_process.bash
see workflow
Jieba has a poor model performance, replace it with LTP or THULAC, prefer using THULAC
for it's an open source software.