This README describes how to use Doc-Prism an extension of the original Prism metric that can be used for document-level evaluation.
Contrary to the original implementation that used a multilingual MT model, we use mBART-50, a multilingual language model that is pre-trained at the document level, to score the MT outputs.
This codebase is an implementation of the Prism metric using the Hugging Face Transformers library. For a detailed presnetation of the BERTScore metric, including usage examples and instructions see the original documentation.
sacrebleu -t wmt21 -l en-de --echo src | head -n 20 > src.en
sacrebleu -t wmt21 -l en-de --echo ref | head -n 20 > ref.de
sacrebleu -t wmt21 -l en-de --echo ref | head -n 20 > hyp.de # put your system output here
To evaluate at the document level we need to know where the document boundaries are in the test set, so that we only use valid context. This is passed in as a file where each line contains a document ID.
For WMT test sets this can be obtained via sacreBLEU:
sacrebleu -t wmt21 -l en-de --echo docid | head -n 20 > docids.ende
In order to use Doc-Prism with python simply add doc=True
when calling the score function.
from prism import MBARTPrism
from add_context import add_context
# load data files
doc_ids = [x.strip() for x in open('docids.ende', 'rt').readlines()]
hyp = [x.strip() for x in open('hyp.de', 'rt').readlines()]
ref = [x.strip() for x in open('ref.de', 'rt').readlines()]
# load prism model
model_path = "facebook/mbart-large-50"
prism = MBARTPrism(checkpoint=model_path, src_lang="en", tgt_lang="de")
# add contexts to reference and hypothesis texts
hyp = add_context(orig_txt=hyp, context=ref, doc_ids=doc_ids, sep_token=prism.encoder.tokenizer.sep_token)
ref = add_context(orig_txt=ref, context=ref, doc_ids=doc_ids, sep_token=prism.encoder.tokenizer.sep_token)
seg_score = prism.score(cand=hyp, ref=ref, doc=True)
To reproduce the Doc-Prism results from the paper run the score_doc-metrics.py script with the flags --model prism
and --doc
.
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download # Puts ~1G of data into $HOME/.mt-metrics-eval.
To obtain system-level scores of Doc-Prism (mBART-50) for the WMT21 testet run:
python score_doc-prism.py --campaign wmt21.news --lp en-de --doc --level sys
If you use the code in your work, please cite Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric:
@inproceedings{easy_doc_mt
title = {Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric},
author = {Vernikos, Giorgos and Thompson, Brian and Mathur, Prashant and Federico, Marcello},
booktitle = "Proceedings of the Seventh Conference on Machine Translation",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://statmt.org/wmt22/pdf/2022.wmt-1.6.pdf",
}