Please use the following citation:
@misc{https://doi.org/10.48550/arxiv.2212.09577,
doi = {10.48550/ARXIV.2212.09577},
url = {https://arxiv.org/abs/2212.09577},
author = {Funkquist, Martin and Kuznetsov, Ilia and Hou, Yufang and Gurevych, Iryna},
title = {CiteBench: A benchmark for Scientific Citation Text Generation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution Share Alike 4.0 International}
}
Abstract: Science progresses by incrementally building upon the prior body of knowledge documented in scientific publications. The acceleration of research across many fields makes it hard to stay up-to-date with the recent developments and to summarize the ever-growing body of prior work. To target this issue, the task of citation text generation aims to produce accurate textual summaries given a set of papers-to-cite and the citing paper context. Existing studies in citation text generation are based upon widely diverging task definitions, which makes it hard to study this task systematically. To address this challenge, we propose CiteBench: a benchmark for citation text generation that unifies multiple diverse datasets and enables standardized evaluation of citation text generation models across task designs and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into the task definition and evaluation to guide future research in citation text generation.
Contact persons: Martin Funkquist (martin.funkquist@liu.se), Ilia Kuznetsov (kuznetsov@ukp.informatik.tu-darmstadt.de)
https://www.ukp.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
The src
folder contains the code for this project. The data can be downloaded using the provided script.
The code has been developed and tested on Ubuntu 20.04 and is not guaranteed to work on other operating systems.
The easiest way to run the code in this repo is to use Anaconda. If you haven't installed it, you can find the installation guidelines here: https://docs.anaconda.com/anaconda/install/
Start by creating a new conda envirionment:
conda create --name citebench python=3.9
And activate it:
conda activate citebench
Install requirements:
pip install -r requirements.txt
Download the raw datasets:
sh get_raw_data.sh
When the download is finished, the processed benchmark dataset can be created by running the following Python script:
PYTHONPATH=src python src/data_processing/related_work_benchmark_construction.py
Alternatively, you can download the processed data directly:
sh get_processed_data.sh
You can also download the raw data from here: https://drive.google.com/file/d/1rvfB1s6GpVxxSwjnhlT1hl-x5eWXXz61/view?usp=sharing or the processed data from here: https://drive.google.com/file/d/1opDbbnQ74DTnwtUo8CCzTuQ9sF_rceYF/view?usp=sharing
The CHEN dataset (see paper for details) is not included in the data that we provide due to incomplete licensing information on the Delve data (see paper). Please contact the authors to obtain the data: https://github.com/iriscxy/relatedworkgeneration
If your model is created using Huggingface, you can run the provided test script in this repo:
PYTHONPATH=src python src/rel_work/test.py \
--model=<PATH_TO_YOUR_MODEL> \
--output_folder=<PATH_TO_OUTPUT_FOLDER> \
--evaluation_metrics=rouge,bert-score
The test script will produce two output files for each dataset: [DATASET]_predictions.json
which contains a list of dictionaries with two keys: target
is the labels and prediction
is the output of the model and [DATASET].json
which contains the results for the specified evaluation metrics (only ROUGE by default).
--model
- Path to pretrained model or shortcut name
--decoding_strategy
- Strategy to use for decoding, e.g. beam_search, top_k or top_p
--top_k
- K for top-k decoding
--top_p
- P for top-p decoding
--use_sep_token
- If True the input documents will be separated by the sep token in the input. Default False
--datasets
- Comma separated list of datasets to use e.g. 'lu_et_al,xing_et_al'. Default uses all datasets
--batch_size
- Batch size for running predictions. Default 8
--base_data_path
- Path to the base data folder. This is the folder where the indiviual dataset folder e.g. 'lu_et_al' is located. Default 'data/related_work/benchmark/'
--output_csv_file
- Path to the output csv file. If this argument is present, it will store the results in this file in addition to the other files.
--ignore_cache
- If True, will ignore the cache and always re-run the predictions, even for the dataset where these are already calculated.
--evaluation_metrics
- Comma separated list of evaluation metrics to use e.g. "rouge,bert-score". Avaliable evaluation metrics are: "rouge" and "bert-score". Default uses only "rouge" metric.
--use_doc_idx
- If True, will separate the documents with [idx] tokens e.g. [0] for the first document. Default False.
--no_tags
- If True, will remove the special tags e.g. '<abs>' from the inputs. Default False.
--manual_seed
- Manual seed for random number generator. Default 15.
--output_folder
- Path to the folder where the results will be saved. If not provided, will save in a folder named after the model.
--use_cpu
- If True, will use CPU instead of GPU even if GPU is available.
If your model is not created with Huggingface then you will have to create an output json
file for each dataset with the naming convention [DATASET_NAME]_predictions.json
. This file should contain a list of dictionaries with the keys target
and prediction
. When you have this, you can run the evaluation script:
PYTHONPATH=src python src/rel_work/evaluation.py \
--results_folder=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \
--output_file=<PATH_TO_FILE_TO_STORE_RESULTS> \
--evaluation_metrics=rouge,bert-score
The output is a csv
file with the calculated scores. Each line contains the different scores on a dataset.
--results_folder
- Path to the results folder, where the results from the model is stored. It will match the files that end with
_predictions.json
and these files should consist of a list with objects with keys 'prediction' and 'target'
- Path to the results folder, where the results from the model is stored. It will match the files that end with
--output_file
- Path to the output file. This file will contain the results of the evaluation.
--evaluation_metrics
- Comma separated list of evaluation metrics to use e.g. "rouge,bert-score". Avaliable evaluation metrics are: "rouge" and "bert-score". Default uses all metrics.
--use_stemmer
- If True, will set the 'use_stemmer' argument to True in the calculation of the ROUGE score. Defaults to False.
To run the citation intent evaluation, first convert your model outputs to the SciCite format by running the following script:
PYTHONPATH=src python src/data_processing/convert_to_scicite.py \
--model_predictions_folders=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \
The outputs will be stored in files on this format <PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL>/acl_arc/inputs/[DATASET_NAME].jsonl
one for each dataset.
Then follow the instructions here: https://github.com/allenai/scicite to get the citation intent outputs. Use the ACL-ARC
pretrained model.
Similar to citation intent, convert the model outputs to the CORWA format:
PYTHONPATH=src python src/data_processing/convert_to_corwa.py \
--model_predictions_folders=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \
Outputs are stored in <PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL>/corwa/inputs/[DATASET_NAME].jsonl
Then follow the instructions here: https://github.com/jacklxc/corwa to get the discourse tagging outputs.
We want the community to be able to update this benchmark with new datasets and evaluation metrics. We give instructions below of how to do this.
To add a new dataset, you either need to provide a processed version matching the structure of the other datasets in the benchmark, or you need to add conversion code to the related_work_benchmark_construction.py
script.
To add new evaluation metrics you need to extend the evaluation.py
script with code that calculates the result on these metrics. See how the other scores e.g. ROUGE and BERTScore are implemented as a reference.
There are three extractive baselines for the related work benchmark included in this repo. The following is a guide of how to run them.
The Lead baseline simply takes the first sentences from the input(s) as predictions. To run Lead with default setting, run the following:
PYTHONPATH=src python src/rel_work/baselines/lead.py \
--results_path=outputs/predictions/lead/ \
--csv_results_path=outputs/lead.csv
This will run the Lead baseline on all the datasets in the benchmark and output the results in the results_path
folder. A summary of the results can be found in csv_results_path
.
This baseline has few parameters which you will find in the arguments in the file. Feel free to play around with these.
Example of how to run the TextRank baseline with default settings:
PYTHONPATH=src python src/rel_work/baselines/text_rank.py \
--results_path=outputs/predictions/textrank/ \
--csv_results_path=outputs/lexrank.csv
Example of how to run the LexRank baseline with default settings:
PYTHONPATH=src python src/rel_work/baselines/lex_rank.py \
--results_path=outputs/predictions/lexrank/ \
--csv_results_path=outputs/lexrank.csv
Example of how to start training a LED base model, on all datasets included in the benchmark:
PYTHONPATH=src python src/rel_work/train.py \
--model=allenai/led-base-16384 \
--output_dir=models/led-base/
Note: this script has only been tested with Huggingface models