This repository contains the code for the paper Supervised Learning of Universal Sentence Representations from Natural Language Inference Data for the course Advanced Topics in Computational Semantics at the University of Amsterdam.
The repository is structured as follows:
data/
contains the scripts to download the data for the experiments and SentEval. After training the vocabulary and embeddings are stored here as well.logs/
contains the Lisa logs from the training.models/
contains the pre-trained models.runs/
contains the Tensorboard logs. The logs are stored in a directory with the name of the model.src/
contains the source code of the project.results.ipynb
contains the prediction code, results of the experiments, and discussion.requirements.txt
contains the requirements for the project.README.md
contains the instructions for the project.pyproject.toml
contains the project configuration.
The code is written in Python 3.10. The requirements can be installed using pip install -r requirements.txt
or with the conda environment file conda env create -f environment.yml
.
The The Stanford Natural Language Inference (SNLI) will be downloaded automatically when running the training script. The SentEval datasets can be downloaded using the following command from the data/downstream
directory:
bash ./get_transfer_data.bash
You can train a model using the following command:
python src/train.py --encoder <encoder>
You can evaluate a model using the following command:
python src/eval.py --checkpoint <checkpoint> --encoder <encoder> --eval --senteval
The --eval
flag will evaluate the model on the SNLI dataset. The --senteval
flag will evaluate the model on the SentEval datasets. See -h
for more options such as changing the batch size or the number of epochs.
The pre-trained models can be downloaded from here. The models should be placed in the models/
directory.