Project 1: Language Modeling with Recurrent Neural Networks in Tensorflow and Continuation of Sentences
The goal of the project was building language model based on recurrent neural network with LSTM cells, just from tensorflow cell
implementation. This means that graph of RNN is unrolled manually and dynamically. The model is evaluated based on Perplexity metric. The second part of the project requires greedy continuation of sentences given their beggining. The text of the project can be found here.
- Florian Chlan fchlan@student.ethz.ch
- Sam Kessler sakessle@student.ethz.ch
- Jovan Nikolic jovan.nikolic@gess.ethz.ch
- Jovan Andonov andonovj@student.ethz.ch
Data was provided by NLU teaching staff and we cannot disclose it.
Requirements:
- We expect all data (
sentences.train
,sentences.test
,sentences.continuation
) to be in./data
folder, where current folder is folder in which *.py scripts are - We expect
wordembeddings-dim100.word2vec
file in the same directory as *.py scripts
Training:
To train model for experiment X (A, B or C), run the following command:
python3 main.py -x X
This will:
- preprocess training data (splitting sentences in words; adding
<bos>
,<eos>
,<unk>
and<pad>
tokens; removing sentences longer than 28 words) - serialize and write to disc the following files:
padded_sentences.pickle
vocabulary.pickle
word_2_index.pickle
index_2_word.pickle
On next training, the script will use existing pickle files. The files are saved in the same directory as *.py scripts.
- save trained graph at given frequency. By default, graph is saved in the same directory as *.py scripts. The name of the graph is of the following format:
expX-epY-NUM.*
where X is substituted with A, B or C, Y is the epoch and NUM is number of batches/steps the model is trained on. Note that after training on one sweep of data, number of batches is not reset, but it's continiously incrementing.
Perplexity calculations:
Perplexity calculations for experiment X are obtained by running:
python3 perplexity.py -x X -c <path_to_checkpoint>
where:
X
can be substituted with A, B or C for each experiment<path_to_checkpoint>
is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.
Requirements:
word_2_index.pickle
andindex_2_word.pickle
in the same directory as *.py scripts.
This will output the file named "group01.perplexityX", where X is substituted with A, B or C.
Continuation of Sentences:
Continuation of sentences is generated by running:
python3 continuation.py -x C -c <path_to_checkpoint>
where:
<path_to_checkpoint>
is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.
Requirements:
word_2_index.pickle
andindex_2_word.pickle
in the same directory as *.py scripts.
This will output file "group01.continuation". Note that continuation is using model trained in experiment C.