Natural Language Understanding 2017

Project 1: Language Modeling with Recurrent Neural Networks in Tensorflow and Continuation of Sentences

The goal of the project was building language model based on recurrent neural network with LSTM cells, just from tensorflow cell implementation. This means that graph of RNN is unrolled manually and dynamically. The model is evaluated based on Perplexity metric. The second part of the project requires greedy continuation of sentences given their beggining. The text of the project can be found here.

Authors:

Florian Chlan fchlan@student.ethz.ch
Sam Kessler sakessle@student.ethz.ch
Jovan Nikolic jovan.nikolic@gess.ethz.ch
Jovan Andonov andonovj@student.ethz.ch

Data was provided by NLU teaching staff and we cannot disclose it.

RUNNING THE CODE:

Requirements:

We expect all data (sentences.train, sentences.test, sentences.continuation) to be in ./data folder, where current folder is folder in which *.py scripts are
We expect wordembeddings-dim100.word2vec file in the same directory as *.py scripts

Training:

To train model for experiment X (A, B or C), run the following command:

python3 main.py -x X

This will:

preprocess training data (splitting sentences in words; adding <bos>, <eos>, <unk> and <pad> tokens; removing sentences longer than 28 words)
serialize and write to disc the following files:
- padded_sentences.pickle
- vocabulary.pickle
- word_2_index.pickle
- index_2_word.pickle On next training, the script will use existing pickle files. The files are saved in the same directory as *.py scripts.
save trained graph at given frequency. By default, graph is saved in the same directory as *.py scripts. The name of the graph is of the following format: expX-epY-NUM.* where X is substituted with A, B or C, Y is the epoch and NUM is number of batches/steps the model is trained on. Note that after training on one sweep of data, number of batches is not reset, but it's continiously incrementing.

Perplexity calculations:

Perplexity calculations for experiment X are obtained by running:

python3 perplexity.py -x X -c <path_to_checkpoint>

where:

X can be substituted with A, B or C for each experiment
<path_to_checkpoint> is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.

Requirements:

word_2_index.pickle and index_2_word.pickle in the same directory as *.py scripts.

This will output the file named "group01.perplexityX", where X is substituted with A, B or C.

Continuation of Sentences:

Continuation of sentences is generated by running:

python3 continuation.py -x C -c <path_to_checkpoint>

where:

<path_to_checkpoint> is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.

Requirements:

word_2_index.pickle and index_2_word.pickle in the same directory as *.py scripts.

This will output file "group01.continuation". Note that continuation is using model trained in experiment C.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
continuation.py		continuation.py
graphs.py		graphs.py
load_embeddings.py		load_embeddings.py
main.py		main.py
perplexity.py		perplexity.py
preprocessing.py		preprocessing.py
project_pt1&2.pdf		project_pt1&2.pdf
submit_to_euler.sh		submit_to_euler.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Understanding 2017

Project 1: Language Modeling with Recurrent Neural Networks in Tensorflow and Continuation of Sentences

Authors:

RUNNING THE CODE:

About

Releases

Packages

Languages

License

jovan-ioanis/nlu-project-1

Folders and files

Latest commit

History

Repository files navigation

Natural Language Understanding 2017

Project 1: Language Modeling with Recurrent Neural Networks in Tensorflow and Continuation of Sentences

Authors:

RUNNING THE CODE:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages