Code for the experiments in the paper Gated Word-Character Recurrent Language Model. The base code is here: https://github.com/nyu-dl/dl4mt-tutorial.
- Theano
- numpy
- scipy
- sklearn
- pyyaml
$ pip install pyyaml
word_char_lm.py
- This model takes the word-level and character-level inputs. You can choose "gate" or "concat" by specifying in a config file.char_only.py
- This model takes the character-level input only.word_lm.py
- This model takes the word-level input only.
Input data should be a text file. Each line contains one tokenized sentence. The Penn Treebank dataset preprocessed by Tomas Mikolov et al. (2010) is available as an example.
If you use your own dataset, please tokenize sentences and split the data into training, validation, and test sets. And then, please create word and character dictionaries using scripts like tools/build_dictionary_char.py
and tools/build_dictionary_char.py
. You can specify paths to the data and dictionary files in the config file (.yaml).
First, clone the repository:
git clone https://github.com/Yasumasa/gated_word_char_rlm.git
To run the training pipeline, make sure the required packages are installed and run the following command lines from the root directory of this repository.
If you run on GPU (recommended):
cd gated_word_char_rlm
# gated word & char with pretraining
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/gate_word_char_pretrain.yaml
# gated word & char
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/gate_word_char.yaml
# concat word & char with pretraining
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/concat_word_char_pretrain.yaml
# concat word & char
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/concat_word_char.yaml
# char only
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python char_lm.py ./config_files/char_only.yaml
# word only
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_lm.py ./config_files/word_only.yaml
If you run on CPU:
cd gated_word_char_rlm
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python -u [model name]_lm.py ./config_files/[model name].yaml