FEVER 2021 Shared Task

This repo contains my contribution for the FEVER 2021 Shared Task

More information about the FEVER 2021 Shared Task can be found here: https://fever.ai/task.html

Note: To be able to run this repo you will need at least 70 GB of free memory on your drive and preferably 32 GB of RAM.

The code is tested on a machine with the following specs:

OS: Ubuntu 20.04.2 LTS (64-bit)
Graphics: NVIDIA GeForce GTX 970
CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8
RAM: 32 GB

Cloning the repo

Use git clone --recurse-submodules -j8 https://github.com/Martin36/FEVER2021_SharedTask.git to clone the repo. This will make sure that the submodules are cloned with the repo.

Creating Anaconda environment and installing requirements

Run the following command to create the conda environment, which includes packages needed to run.

conda create --name fever2021 --file fever-conda.txt

Then activate the environment by running:

conda activate fever2021

Install the required pip packages:

pip install -r requirements.txt

Navigate to the FEVEROUS src folder and install the pip requirements for that submodule:

cd FEVEROUS/src
pip install -r requirements.txt

Installing Tapas

Tapas needs protobuf-compiler to be able to run. Install it by running the following command (for Ubuntu/Debian):

sudo apt-get install protobuf-compiler

If you are using another OS you can find the relevant download here.

Then you can navigate to the tapas folder and run the pip install:

cd tapas
pip install -e .

How to run the system

Prerequistics

To be able to run this system, you should first download the data for the FEVEROUS task, which can be found here: https://fever.ai/dataset/feverous.html

The data that you will need is the following:

The training dataset
The development dataset
The SQLite database

Step 1: Retrieving the most relevant documents

To start of, we need to collect the most relevant documents to extract the evidence most likely to support or refute the given claim. In this system, this is done with a TF-IDF similarity score, for both the body text and the title of each document in the database.

Creating the corpus

Run the following code to create the text corpus:

Note: Create a folder named `data` directly in the repo to store all the files in

python src/data_processing/create_corpus.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --out_path=data/corpus/

It will created a bunch of json files with all the documents in the database. Each file contains 100,000 documents (except for the last one).

Creating the TF-IDF matrices

Once we have created the corpus documents we can create the TF-IDF matrix for all the documents. To do that, run the following code:

Note: This might take a lot of your RAM

python src/doc_retrieval/create_tfidf.py \
    --use_stemming \
    --corpus_path=data/corpus/ \
    --out_path=tfidf/

The --use_stemming argument will reduce each word in the corpus to its stemmed version before calculating the TF-IDF matrix. If that argument is removed, stemming will not be used and the creation of the matrix will go significantly faster.

To create the TF-IDF matrix for the titles, run the following code:

python src/doc_retrieval/create_title_tfidf.py \
    --corpus_path=data/corpus/ \
    --vectorizer_out_file=tfidf/title_vectorizer-32bit.pickle \
    --wm_out_file=tfidf/title_tfidf_wm-32bit.pickle \
    --n_gram_min=2 \
    --n_gram_max=2

The n_gram_min and n_gram_max parameters are used to tell the program which type of n-grams to use for the title TF-IDF matrix. In this case a bigram TF-IDF matrix is created.

This second matrix will take significantly less time to create than the first one.

Retrieving the top documents

Now we have done all the prerequistics to retrieve the top documents. To retrieve the top k documents for the training data, run the following code:

Note: Remove "stemmed" if you did not use stemming when creating the body text TF-IDF. This goes for both the vectorizer and word model.

python src/doc_retrieval/document_retrieval.py \
    --doc_id_map_path=tfidf/doc_id_map.json \
    --data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --vectorizer_path=tfidf/vectorizer-stemmed-32bit.pickle \
    --wm_path=tfidf/tfidf_wm-stemmed-32bit.pickle \
    --title_vectorizer_path=tfidf/title_vectorizer-32bit.pickle \
    --title_wm_path=tfidf/title_tfidf_wm-32bit.pickle \
    --out_path=data/document_retrieval

By default the document retrieval script returns the top 5 documents each, for the body text and the title match respectively. This means that the script will return a total of 10 documents, unless some of the matches for the body text and title are overlapping. If you want to change the number of returned documents, set the argument --nr_of_docs to the desired amount e.g. --nr_of_docs=3 if you only want to retrieve 6 documents for each claim.

Optional: Calculate document retrieval accuracy

If you would like to see how well the document retrieval model performed, run the code below:

python src/doc_retrieval/calculate_doc_retrieval_accuracy.py \
    --data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --top_k_docs_path=data/document_retrieval/top_docs.jsonl

Step 2: Retrieve top sentences

The next step is to retrieve the most relevant sentences from the documents that we just retrieved.

python src/sent_retrieval/sentence_retrieval.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --top_docs_path=data/document_retrieval/top_docs.jsonl \
    --out_path=data/document_retrieval \
    --n_gram_min=1 \
    --n_gram_max=3

Here we use a TF-IDF matrix with unigrams, bigrams and trigrams to retrieve the most relevant sentences from the documents. Feel free to change the parameters --n_gram_min and --n_gram_max, to see if you get a better accuracy.

Optional: Calculate sentence retrieval accuracy

If you want to see how well the sentence retrieval performed, run the following code (remember to set k accordingly):

python src/sent_retrieval/calculate_sentence_retrieval_accuracy.py \
    --data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --top_k_sents_path=data/document_retrieval/top_5_sents.jsonl \
    --k=5

Step 3: Extract tables from documents

To extract tables from the documents we will use the TaPaS repository. First the FEVEROUS data needs to be converted to a format that is suitable for the TaPaS input. This can be done by running the following script:

python src/data_processing/create_tapas_data.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --data_path=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --out_file=data/tapas/train/tapas_train.jsonl

And for the dev set:

python src/data_processing/create_tapas_data.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --data_path=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
    --out_file=data/tapas/train/tapas_dev.jsonl

Now for the part where we need to use the tapas repo. This repo is used to train a model for retrieving the most relevant tables. Make sure that you have installed Tapas. If not, take a look at the instructions here.

You will also need to download a pretrained Tapas model from here. The one I used were tapas_dual_encoder_proj_256_tiny, due to a shortage of time and compute resources. But if you have the capability and want better results, you might choose a larger model. These instructions will however assume that you used the tiny pretrained model. If you choose another one, then make sure that you replace all instances of tapas_dual_encoder_proj_256_tiny with the name of your model, in the following commands.

To produce the input Tapas data, run the following command:

python tapas/tapas/scripts/preprocess_nq.py \
    --input_path=data/tapas \
    --output_path=data/tapas \
    --runner_type=direct \
    --fever

Then we create the retrieval data:

python tapas/tapas/retrieval/create_retrieval_data_main.py \
    --input_interactions_dir=data/tapas/tf_records/interactions \
    --input_tables_dir=data/tapas/tf_records/tables \
    --output_dir=data/tapas/tf_records/tf_examples \
    --vocab_file=tapas_dual_encoder_proj_256_tiny/vocab.txt \
    --max_seq_length=512 \
    --max_column_id=512 \
    --max_row_id=512 \
    --use_document_title

Now it is time to fine tune the tapas model:

python tapas/tapas/experiments/table_retriever_experiment.py \
    --do_train \
    --keep_checkpoint_max=40 \
    --model_dir=tapas_models \
    --input_file_train=data/tapas/tf_records/tf_examples/train.tfrecord \
    --bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
    --init_checkpoint=tapas_dual_encoder_proj_256_tiny/model.ckpt \
    --init_from_single_encoder=false \
    --down_projection_dim=256 \
    --num_train_examples=65000 \
    --learning_rate=1.25e-5 \
    --train_batch_size=256 \
    --warmup_ratio=0.01 \
    --max_seq_length=512

Evaluate the model:

python tapas/tapas/experiments/table_retriever_experiment.py \
    --do_predict \
    --model_dir=tapas_models \
    --input_file_eval=data/tapas/tf_records/tf_examples/dev.tfrecord \
    --bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
    --init_from_single_encoder=false \
    --down_projection_dim=256 \
    --eval_batch_size=32 \
    --num_train_examples=65000 \
    --max_seq_length=512

Now we generate results for each checkpoint:

python tapas/tapas/experiments/table_retriever_experiment.py \
    --do_predict \
    --model_dir=tapas_models \
    --prediction_output_dir=tapas_models/train \
    --evaluated_checkpoint_metric=precision_at_1 \
    --input_file_predict=data/tapas/tf_records/tf_examples/train.tfrecord \
    --bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
    --init_from_single_encoder=false \
    --down_projection_dim=256 \
    --eval_batch_size=32 \
    --max_seq_length=512

python tapas/tapas/experiments/table_retriever_experiment.py \
    --do_predict \
    --model_dir=tapas_models \
    --prediction_output_dir=tapas_models/tables \
    --evaluated_checkpoint_metric=precision_at_1 \
    --input_file_predict=data/tapas/tf_records/tf_examples/tables.tfrecord \
    --bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
    --init_from_single_encoder=false \
    --down_projection_dim=256 \
    --eval_batch_size=32 \
    --max_seq_length=512

python tapas/tapas/experiments/table_retriever_experiment.py \
    --do_predict \
    --model_dir=tapas_models \
    --prediction_output_dir=tapas_models/test \
    --evaluated_checkpoint_metric=precision_at_1 \
    --input_file_predict=data/tapas/tf_records/tf_examples/test.tfrecord \
    --bert_config_file=tapas_dual_encoder_proj_256_tiny/bert_config.json \
    --init_from_single_encoder=false \
    --down_projection_dim=256 \
    --eval_batch_size=32 \
    --max_seq_length=512

Next we generate the retrieval results for each of the checkpoints:

python tapas/tapas/scripts/eval_table_retriever.py \
    --prediction_files_local=tapas_models/train/predict_results_253.tsv \
    --prediction_files_global=tapas_models/tables/predict_results_253.tsv \
    --retrieval_results_file_path=tapas_models/train_knn.jsonl

python tapas/tapas/scripts/eval_table_retriever.py \
    --prediction_files_local=tapas_models/test/predict_results_253.tsv \
    --prediction_files_global=tapas_models/tables/predict_results_253.tsv \
    --retrieval_results_file_path=tapas_models/test_knn.jsonl

python tapas/tapas/scripts/eval_table_retriever.py \
    --prediction_files_local=tapas_models/eval_results_253.tsv \
    --prediction_files_global=tapas_models/tables/predict_results_253.tsv \
    --retrieval_results_file_path=tapas_models/dev_knn.jsonl

Step 4: Table cell extraction

For the table cell extraction we will fine-tune the pretrained Tapas model of the Huggingface library.

First the dataset needs to be transformed into the correct format. This is done by running the following script:

python src/data_processing/create_tapas_tables.py \
    --tapas_train_path=data/tapas/tapas_train.jsonl \
    --out_path=data/tapas/train_data/ \
    --table_out_path=data/tapas/tables/

For the dev set:

python src/data_processing/create_tapas_tables.py \
    --tapas_train_path=data/tapas/dev/tapas_dev.jsonl \
    --out_path=data/tapas/dev/ \
    --table_out_path=data/tapas/dev/tables/ \

Then we can train the model:

python src/entailment/entailment_with_tapas.py \
    --train_csv_path=data/tapas/train_data/tapas_data.csv \
    --tapas_model_name=google/tapas-tiny \
    --model_path=models/ \
    --batch_size=32

And evaluate it:

python src/entailment/entailment_with_tapas_predict.py \
    --table_csv_path=data/tapas/dev/tables/ \
    --eval_csv_path=data/tapas/dev/tapas_data.csv \
    --tapas_model_name=google/tapas-tiny \
    --model_path=models/tapas_model.pth \
    --batch_size=1

Step 5: Claim verification

The last part is to predict the label for the claim. This is done by training a deep neural network with input representations given by the last hidden layer of a pretrained Tapas model for the most relevant table and a pretrained RoBERTa model for the retrieved sentences.

First we need to create the map between the claim id and the label. The following script will create this map for both the train and dev dataset:

python src/data_processing/create_id_label_map.py \
    --train_data_file=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --dev_data_file=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
    --train_out_file=data/id_label_map_train.json \
    --dev_out_file=data/dev/id_label_map_eval.json

Then we need to create the entailment data for the sentences:

python src/data_processing/create_sentence_entailment_data.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --input_data_file=<REPLACE_WITH_PATH_TO_YOUR_train.jsonl_FILE> \
    --output_data_file=data/sentence_entailment_train_data.jsonl

And for the dev set:

python src/data_processing/create_sentence_entailment_data.py \
    --db_path=<REPLACE_WITH_PATH_TO_YOUR_DB_FILE> \
    --input_data_file=<REPLACE_WITH_PATH_TO_YOUR_dev.jsonl_FILE> \
    --output_data_file=data/sentence_entailment_eval_data.jsonl

Then we merge the sentence data with the previously created dataset for the tabular data (as used in step 4):

python src/data_processing/merge_table_and_sentence_data.py \
    --tapas_csv_file=data/tapas/train_data/tapas_data.csv \
    --sentence_data_file=data/sentence_entailment_train_data.jsonl \
    --id_label_map_file=data/id_label_map_train.json \
    --out_file=data/entailment_train_data.csv

For the dev set:

python src/data_processing/merge_table_and_sentence_data.py \
    --tapas_csv_file=data/tapas/dev/tapas_data.csv \
    --sentence_data_file=data/sentence_entailment_eval_data.jsonl \
    --id_label_map_file=data/id_label_map_eval.json \
    --out_file=data/entailment_eval_data.csv

Now we can train the model:

python src/entailment/train_veracity_prediction_model.py \
    --train_csv_path=data/entailment_train_data.csv \
    --batch_size=32 \
    --out_path=models/

Once it has finished training we can run the evaluation script for evaluating the model:

python src/entailment/eval_veracity_prediction_model.py \
    --csv_file=data/entailment_eval_data.csv \
    --model_file=models/veracity_prediction_model.pth \
    --batch_size=16 \
    --out_file=data/dev/veracity_prediction_accuracy.json

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.vscode		.vscode
FEVEROUS @ 985efc6		FEVEROUS @ 985efc6
src		src
tapas @ c2578e8		tapas @ c2578e8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Dataset_Statistics.ipynb		Dataset_Statistics.ipynb
Error_analysis.ipynb		Error_analysis.ipynb
LICENSE		LICENSE
README.md		README.md
evaluate_final_results.py		evaluate_final_results.py
fever-conda.txt		fever-conda.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_dev_e2e.sh		run_dev_e2e.sh
run_e2e.py		run_e2e.py
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FEVER 2021 Shared Task

Note: To be able to run this repo you will need at least 70 GB of free memory on your drive and preferably 32 GB of RAM.

Cloning the repo

Creating Anaconda environment and installing requirements

Installing Tapas

How to run the system

Prerequistics

Step 1: Retrieving the most relevant documents

Creating the corpus

Note: Create a folder named `data` directly in the repo to store all the files in

Creating the TF-IDF matrices

Note: This might take a lot of your RAM

Retrieving the top documents

Note: Remove "stemmed" if you did not use stemming when creating the body text TF-IDF. This goes for both the vectorizer and word model.

Optional: Calculate document retrieval accuracy

Step 2: Retrieve top sentences

Optional: Calculate sentence retrieval accuracy

Step 3: Extract tables from documents

Step 4: Table cell extraction

Step 5: Claim verification

About

Releases

Packages

Languages

License

Martin36/FEVER2021_SharedTask

Folders and files

Latest commit

History

Repository files navigation

FEVER 2021 Shared Task

Note: To be able to run this repo you will need at least 70 GB of free memory on your drive and preferably 32 GB of RAM.

Cloning the repo

Creating Anaconda environment and installing requirements

Installing Tapas

How to run the system

Prerequistics

Step 1: Retrieving the most relevant documents

Creating the corpus

Note: Create a folder named data directly in the repo to store all the files in

Creating the TF-IDF matrices

Note: This might take a lot of your RAM

Retrieving the top documents

Note: Remove "stemmed" if you did not use stemming when creating the body text TF-IDF. This goes for both the vectorizer and word model.

Optional: Calculate document retrieval accuracy

Step 2: Retrieve top sentences

Optional: Calculate sentence retrieval accuracy

Step 3: Extract tables from documents

Step 4: Table cell extraction

Step 5: Claim verification

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Note: Create a folder named `data` directly in the repo to store all the files in

Packages