This is the repository for the paper: Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (COLING 2020).
- We release the para_with_hyperlink.zip file that contains all articles/paragraphs with hyperlink information (except for some error paragraphs). Each paragraph has the following information:
id
: Wikipedia id of an articletitle
sentences
: a list of sentencementions
: a list of hyperlink information, each element has the following information:id
,start
,end
,ref_url
,ref_ids
,sent_idx
.
- Here is the link of the dataset that has fixed the inconsistency of sentence segmentation in paragraphs.
- Due to the multiple names of an entity in Wikidata, we add
evidences_id
andanswer_id
to our dataset. Here are the details:- For Inference and Compositional questions: we add to all questions.
- For Comparison and Bridge_comparison questions: we add to questions that have relations:
country
,country of origin
, andcountry of citizenship
.
- We update the new evaluation script 2wikimultihop_evaluation_v1.1.py. We can use this evaluation script to evaluate the dataset with
evidences_id
andanswer_id
. - We also update the results of the baseline model by using the new evaluation script and the dataset with
evidences_id
andanswer_id
. The updated results of tables 5, 6, and 7 in the paper are in the folder update_results. - Here is the link of the dataset with
evidences_id
andanswer_id
. Fileid_aliases.json
is used for evaluation.
Date | Model | Ans EM |
Ans F1 |
Sup EM |
Sup F1 |
Evi EM |
Evi F1 |
Joint EM |
Joint F1 |
---|---|---|---|---|---|---|---|---|---|
Oct 29, 2021 | NA-Reviewer | 76.73 | 81.91 | 89.61 | 94.31 | 53.66 | 70.83 | 52.75 | 65.23 |
Oct 26, 2021 | CRERC | 69.58 | 72.33 | 82.86 | 90.68 | 54.86 | 68.83 | 49.80 | 58.99 |
June, 2022 | BigBird-base model | 74.05 | 79.68 | 77.14 | 92.13 | 45.75 | 76.64 | 39.30 | 63.24 |
Jan 12, 2022 | BigBird-base model - Weighted (Anonymous) | 73.04 | 78.90 | 76.92 | 91.95 | 45.05 | 76.13 | 38.72 | 62.33 |
Jan 12, 2022 | BigBird-base model - Unweighted (Anonymous) | 72.38 | 77.98 | 75.68 | 91.56 | 35.07 | 71.09 | 29.86 | 57.74 |
June 14, 2021 | BigBird-base model (Anonymous) | 71.42 | 77.64 | 73.84 | 90.68 | 24.64 | 63.69 | 21.37 | 51.44 |
Dec 11, 2021 | RoBERTa-base (Anonymous) | 32.24 | 40.90 | 40.91 | 71.85 | 13.80 | 41.37 | 6.92 | 20.54 |
Oct 25, 2020 | Baseline model | 36.53 | 43.93 | 24.99 | 65.26 | 1.07 | 14.94 | 0.35 | 5.41 |
Aug 2, 2023 | Beam Retrieval | 88.47 | 90.87 | 95.87 | 98.15 | x | x | x | x |
July 30, 2021 | HGN-revise model (Anonymous) | 71.20 | 75.69 | 69.35 | 89.07 | x | x | x | x |
To evaluate your model on the test data, please contact us. Please prepare the following information:
- Your prediction file (follow format in file: prediction_format.json)
- The name of your model
- Public repository of your model (optional)
- Reference to your publication (optional)
The full dataset is in here.
Our dataset follows the format of HotpotQA. Each sample has the following keys:
_id
: a unique id for each samplequestion
: a stringanswer
: an answer to the question. The test data does not have this information.supporting_facts
: a list, each element is a list that contains:[title, sent_id]
,title
is the title of the paragraph,sent_id
is the sentence index (start from 0) of the sentence that the model uses. The test data does not have this information.context
: a list, each element is a list that contains[title, setences]
,sentences
is a list of sentences.evidences
: a list, each element is a triple that contains[subject entity, relation, object entity]
. The test data does not have this information.type
: a string, there are four types of questions in our dataset: comparison, inference, compositional, and bridge-comparison.entity_ids
: a string that contains the two Wikidata ids (four for bridge_comparison question) of the gold paragraphs, e.g., 'Q7320430_Q51759'.
Our baseline model is based on the baseline model in HotpotQA. The process to train and test is quite similar to HotpotQA.
- Process train data
python3 main.py --mode prepro --data_file wikimultihop/train.json --para_limit 2250 --data_split train
- Process dev data
python3 main.py --mode prepro --data_file wikimultihop/dev.json --para_limit 2250 --data_split dev
- Train a model
python3 -u main.py --mode train --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0
- Evaluation on dev (Local Evaluation)
python3 main.py --mode test --data_split dev --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --save WikiMultiHop-20201024-023745 --prediction_file predictions/wikimultihop_dev_pred.json
python3 2wikimultihop_evaluate.py predictions/wikimultihop_dev_pred.json data/dev.json
- Use new evaluation script
python3 2wikimultihop_evaluate_v1.1.py predictions/wikimultihop_dev_pred.json data_ids/dev.json id_aliases.json
If you plan to use the dataset, please cite our paper:
@inproceedings{xanh2020_2wikimultihop,
title = "Constructing A Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps",
author = "Ho, Xanh and
Duong Nguyen, Anh-Khoa and
Sugawara, Saku and
Aizawa, Akiko",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.580",
pages = "6609--6625",
}
The baseline model and the evaluation script are adapted from https://github.com/hotpotqa/hotpot