This repo contains my contribution for the FEVER 2021 Shared Task
More information about the FEVER 2021 Shared Task can be found here: https://fever.ai/task.html
Note: To be able to run this repo you will need at least 70 GB of free memory on your drive and preferably 32 GB of RAM.
The code is tested on a machine with the following specs:
- OS: Ubuntu 20.04.2 LTS (64-bit)
- Graphics: NVIDIA GeForce GTX 970
- CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8
- RAM: 32 GB
Use git clone --recurse-submodules -j8 https://github.com/Martin36/FEVER2021_SharedTask.git
to clone the repo. This will make sure that the submodules are cloned with the repo.
Run the following command to create the conda environment, which includes packages needed to run.
conda create --name fever2021 --file fever-conda.txt
Then activate the environment by running:
conda activate fever2021
Install the required pip packages:
pip install -r requirements.txt
Navigate to the FEVEROUS src folder and install the pip requirements for that submodule:
cd FEVEROUS/src
pip install -r requirements.txt
Tapas needs protobuf-compiler
to be able to run. Install it by running the following command (for Ubuntu/Debian):
sudo apt-get install protobuf-compiler
If you are using another OS you can find the relevant download here.
Then you can navigate to the tapas folder and run the pip install:
cd tapas
pip install -e .
To be able to run this system, you should first download the data for the FEVEROUS task, which can be found here: https://fever.ai/dataset/feverous.html
The data that you will need is the following:
- The training dataset
- The development dataset
- The SQLite database
To start of, we need to collect the most relevant documents to extract the evidence most likely to support or refute the given claim. In this system, this is done with a TF-IDF similarity score, for both the body text and the title of each document in the database.
Run the following code to create the text corpus:
python src/data_processing/create_corpus.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--out_path=data/corpus/
It will created a bunch of json files with all the documents in the database. Each file contains 100,000 documents (except for the last one).
Once we have created the corpus documents we can create the TF-IDF matrix for all the documents. To do that, run the following code:
python src/doc_retrieval/create_tfidf.py \
--use_stemming \
--corpus_path=data/corpus/ \
--out_path=tfidf/
The --use_stemming
argument will reduce each word in the corpus to its stemmed version before calculating the TF-IDF matrix.
If that argument is removed, stemming will not be used and the creation of the matrix will go significantly faster.
To create the TF-IDF matrix for the titles, run the following code:
python src/doc_retrieval/create_title_tfidf.py \
--corpus_path=data/corpus/ \
--vectorizer_out_file=tfidf/title_vectorizer-32bit.pickle \
--wm_out_file=tfidf/title_tfidf_wm-32bit.pickle \
--n_gram_min=2 \
--n_gram_max=2
The n_gram_min
and n_gram_max
parameters are used to tell the program which type of n-grams to use for the title TF-IDF matrix.
In this case a bigram TF-IDF matrix is created.
This second matrix will take significantly less time to create than the first one.
Now we have done all the prerequistics to retrieve the top documents. To retrieve the top k documents for the training data, run the following code:
Note: Remove "stemmed" if you did not use stemming when creating the body text TF-IDF. This goes for both the vectorizer and word model.
python src/doc_retrieval/document_retrieval.py \
--doc_id_map_path=tfidf/doc_id_map.json \
--data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--vectorizer_path=tfidf/vectorizer-stemmed-32bit.pickle \
--wm_path=tfidf/tfidf_wm-stemmed-32bit.pickle \
--title_vectorizer_path=tfidf/title_vectorizer-32bit.pickle \
--title_wm_path=tfidf/title_tfidf_wm-32bit.pickle \
--out_path=data/document_retrieval
By default the document retrieval script returns the top 5 documents each, for the body text and the title match respectively. This means that the script will return a total of 10 documents, unless some of the matches for the body text and title are overlapping. If you want to change the number of returned documents, set the argument --nr_of_docs
to the desired amount e.g. --nr_of_docs=3
if you only want to retrieve 6 documents for each claim.
If you would like to see how well the document retrieval model performed, run the code below:
python src/doc_retrieval/calculate_doc_retrieval_accuracy.py \
--data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--top_k_docs_path=data/document_retrieval/top_docs.jsonl
The next step is to retrieve the most relevant sentences from the documents that we just retrieved.
python src/sent_retrieval/sentence_retrieval.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--top_docs_path=data/document_retrieval/top_docs.jsonl \
--out_path=data/document_retrieval \
--n_gram_min=1 \
--n_gram_max=3
Here we use a TF-IDF matrix with unigrams, bigrams and trigrams to retrieve the most relevant sentences from the documents. Feel free to change the parameters --n_gram_min
and --n_gram_max
, to see if you get a better accuracy.
If you want to see how well the sentence retrieval performed, run the following code (remember to set k accordingly):
python src/sent_retrieval/calculate_sentence_retrieval_accuracy.py \
--data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--top_k_sents_path=data/document_retrieval/top_5_sents.jsonl \
--k=5
To extract tables from the documents we will use the TaPaS repository. First the FEVEROUS data needs to be converted to a format that is suitable for the TaPaS input. This can be done by running the following script:
python src/data_processing/create_tapas_data.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--out_file=data/tapas/train/tapas_train.jsonl
And for the dev set:
python src/data_processing/create_tapas_data.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--data_path=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
--out_file=data/tapas/train/tapas_dev.jsonl
Now for the part where we need to use the tapas repo. This repo is used to train a model for retrieving the most relevant tables. Make sure that you have installed Tapas. If not, take a look at the instructions here.
You will also need to download a pretrained Tapas model from here. The one I used were tapas_dual_encoder_proj_256_tiny
, due to a shortage of time and compute resources. But if you have the capability and want better results, you might choose a larger model. These instructions will however assume that you used the tiny pretrained model. If you choose another one, then make sure that you replace all instances of tapas_dual_encoder_proj_256_tiny
with the name of your model, in the following commands.
To produce the input Tapas data, run the following command:
python tapas/tapas/scripts/preprocess_nq.py \
--input_path=data/tapas \
--output_path=data/tapas \
--runner_type=direct \
--fever
Then we create the retrieval data:
python tapas/tapas/retrieval/create_retrieval_data_main.py \
--input_interactions_dir=data/tapas/tf_records/interactions \
--input_tables_dir=data/tapas/tf_records/tables \
--output_dir=data/tapas/tf_records/tf_examples \
--vocab_file=tapas_dual_encoder_proj_256_tiny/vocab.txt \
--max_seq_length=512 \
--max_column_id=512 \
--max_row_id=512 \
--use_document_title
Now it is time to fine tune the tapas model:
python tapas/tapas/experiments/table_retriever_experiment.py \
--do_train \
--keep_checkpoint_max=40 \
--model_dir=tapas_models \
--input_file_train=data/tapas/tf_records/tf_examples/train.tfrecord \
--bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
--init_checkpoint=tapas_dual_encoder_proj_256_tiny/model.ckpt \
--init_from_single_encoder=false \
--down_projection_dim=256 \
--num_train_examples=65000 \
--learning_rate=1.25e-5 \
--train_batch_size=256 \
--warmup_ratio=0.01 \
--max_seq_length=512
Evaluate the model:
python tapas/tapas/experiments/table_retriever_experiment.py \
--do_predict \
--model_dir=tapas_models \
--input_file_eval=data/tapas/tf_records/tf_examples/dev.tfrecord \
--bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
--init_from_single_encoder=false \
--down_projection_dim=256 \
--eval_batch_size=32 \
--num_train_examples=65000 \
--max_seq_length=512
Now we generate results for each checkpoint:
python tapas/tapas/experiments/table_retriever_experiment.py \
--do_predict \
--model_dir=tapas_models \
--prediction_output_dir=tapas_models/train \
--evaluated_checkpoint_metric=precision_at_1 \
--input_file_predict=data/tapas/tf_records/tf_examples/train.tfrecord \
--bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
--init_from_single_encoder=false \
--down_projection_dim=256 \
--eval_batch_size=32 \
--max_seq_length=512
python tapas/tapas/experiments/table_retriever_experiment.py \
--do_predict \
--model_dir=tapas_models \
--prediction_output_dir=tapas_models/tables \
--evaluated_checkpoint_metric=precision_at_1 \
--input_file_predict=data/tapas/tf_records/tf_examples/tables.tfrecord \
--bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
--init_from_single_encoder=false \
--down_projection_dim=256 \
--eval_batch_size=32 \
--max_seq_length=512
python tapas/tapas/experiments/table_retriever_experiment.py \
--do_predict \
--model_dir=tapas_models \
--prediction_output_dir=tapas_models/test \
--evaluated_checkpoint_metric=precision_at_1 \
--input_file_predict=data/tapas/tf_records/tf_examples/test.tfrecord \
--bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
--init_from_single_encoder=false \
--down_projection_dim=256 \
--eval_batch_size=32 \
--max_seq_length=512
Next we generate the retrieval results for each of the checkpoints:
python tapas/tapas/scripts/eval_table_retriever.py \
--prediction_files_local=tapas_models/train/predict_results_253.tsv \
--prediction_files_global=tapas_models/tables/predict_results_253.tsv \
--retrieval_results_file_path=tapas_models/train_knn.jsonl
python tapas/tapas/scripts/eval_table_retriever.py \
--prediction_files_local=tapas_models/test/predict_results_253.tsv \
--prediction_files_global=tapas_models/tables/predict_results_253.tsv \
--retrieval_results_file_path=tapas_models/test_knn.jsonl
python tapas/tapas/scripts/eval_table_retriever.py \
--prediction_files_local=tapas_models/eval_results_253.tsv \
--prediction_files_global=tapas_models/tables/predict_results_253.tsv \
--retrieval_results_file_path=tapas_models/dev_knn.jsonl
For the table cell extraction we will fine-tune the pretrained Tapas model of the Huggingface library.
First the dataset needs to be transformed into the correct format. This is done by running the following script:
python src/data_processing/create_tapas_tables.py \
--tapas_train_path=data/tapas/tapas_train.jsonl \
--out_path=data/tapas/train_data/ \
--table_out_path=data/tapas/tables/
For the dev set:
python src/data_processing/create_tapas_tables.py \
--tapas_train_path=data/tapas/dev/tapas_dev.jsonl \
--out_path=data/tapas/dev/ \
--table_out_path=data/tapas/dev/tables/ \
Then we can train the model:
python src/entailment/entailment_with_tapas.py \
--train_csv_path=data/tapas/train_data/tapas_data.csv \
--tapas_model_name=google/tapas-tiny \
--model_path=models/ \
--batch_size=32
And evaluate it:
python src/entailment/entailment_with_tapas_predict.py \
--table_csv_path=data/tapas/dev/tables/ \
--eval_csv_path=data/tapas/dev/tapas_data.csv \
--tapas_model_name=google/tapas-tiny \
--model_path=models/tapas_model.pth \
--batch_size=1
The last part is to predict the label for the claim. This is done by training a deep neural network with input representations given by the last hidden layer of a pretrained Tapas model for the most relevant table and a pretrained RoBERTa model for the retrieved sentences.
First we need to create the map between the claim id and the label. The following script will create this map for both the train and dev dataset:
python src/data_processing/create_id_label_map.py \
--train_data_file=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--dev_data_file=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
--train_out_file=data/id_label_map_train.json \
--dev_out_file=data/dev/id_label_map_eval.json
Then we need to create the entailment data for the sentences:
python src/data_processing/create_sentence_entailment_data.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--input_data_file=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
--output_data_file=data/sentence_entailment_train_data.jsonl
And for the dev set:
python src/data_processing/create_sentence_entailment_data.py \
--db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
--input_data_file=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
--output_data_file=data/sentence_entailment_eval_data.jsonl
Then we merge the sentence data with the previously created dataset for the tabular data (as used in step 4):
python src/data_processing/merge_table_and_sentence_data.py \
--tapas_csv_file=data/tapas/train_data/tapas_data.csv \
--sentence_data_file=data/sentence_entailment_train_data.jsonl \
--id_label_map_file=data/id_label_map_train.json \
--out_file=data/entailment_train_data.csv
For the dev set:
python src/data_processing/merge_table_and_sentence_data.py \
--tapas_csv_file=data/tapas/dev/tapas_data.csv \
--sentence_data_file=data/sentence_entailment_eval_data.jsonl \
--id_label_map_file=data/id_label_map_eval.json \
--out_file=data/entailment_eval_data.csv
Now we can train the model:
python src/entailment/train_veracity_prediction_model.py \
--train_csv_path=data/entailment_train_data.csv \
--batch_size=32 \
--out_path=models/
Once it has finished training we can run the evaluation script for evaluating the model:
python src/entailment/eval_veracity_prediction_model.py \
--csv_file=data/entailment_eval_data.csv \
--model_file=models/veracity_prediction_model.pth \
--batch_size=16 \
--out_file=data/dev/veracity_prediction_accuracy.json