Punctuation restoration using TensorFlow
The repository was inspired by github.com/ottokart/punctuator2. I did not want to use phyton in serving the model. So I tried to change Theano to TensorFlow. Attention and Late Fusion layers were the big challanges for me, I'm not sure they are 100% correct. I compared the results with punctuator2 and they looked very similar.
Added support of Kaldi like word features. Word hashing was originally described in Learning deep structured semantic models for web search using clickthrough data - by Po-Sen Huang et al. 2013. It allows to use nearly unlimited vocabulary instead of a shortlist for NN training. Samples added for data prepareation and training.
conda create --name punct python=3.10 Implementation was tested with python 3.10. The training code uses modules:
pip install -r requirements
pip install tqdm
The optimization of hyper parameters:
pip install parameter-sherpa
pip install keras
For the initial data requirements see github.com/ottokart/punctuator2. You can configure punctuation and vocabulary size in the ptf/data/data.py. To prepare the data for the training:
python ptf/punctuator2/data.py <initialDataDir> <dataDir>
Or see the egs for sample scripts
To train one model, you can configure model parameters in ptf/train.py. The taining can be performed by:
mkdir model1 && cd model1
python ../ptf/train.py <dataDir> <modelPrefix>
The trained model is saved as keras hd5 format in the working folder model1.
There is the python script to optimize hyperparameters of a model using sherpa tool. See optimize/optimize.py. The sample to start an optimization:
mkdir optim1 && cd optim1
python ../optimize/optimize.py <dataDir>
All models are saved in in the working folder optim1.
To predict punctuation for a test text:
python ptf/predict.py <testTextFile> <vocaburaly> <hd5ModelFile> <predictedOutputFile>
To evaluate error scores of the prediction:
python ptf/punctuator2/error_calculator.py <testTextFile> <predictedOutputFile>
During the training all models are saved in keras format. To save a model in a pure tensorflow format there is a script:
python ptf/save_as_tf.py <hd5ModelFile> <tfModelOutputDir>
Sample go code on how to load the trained tensorflow model: examples/goload/loadtf.go. To compile the sample go code you need to install tensorflow library and configure LD_LIBRARY_PATH. See https://www.tensorflow.org/install/lang_go
Airenas Vaičiūnas
Copyright © 2020, Airenas Vaičiūnas. Released under the The 3-Clause BSD License.
Also, please, see the License Ottokar Tilk.