This repository contains the data we collected to assess the impact of document-level context on human perception of machine translation quality.
We briefly outline the contents of each file below. Please see our paper for more detailed information:
@inproceedings{laeubli2018parity,
author = "L{\"a}ubli, Samuel and Sennrich, Rico and Volk, Martin",
title = "Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1808.07048"
}
Meta information on all participants – professional translators recruited from ProZ. They produced the ratings available in ratings.csv
.
55 full articles randomly sampled from the WMT 2017 English–Chinese test set, only considering the 123 native Chinese articles. We only used Chinese sources from WMT; human (Reference-HT; the human
column) and machine translations (Combo-6; the mt
column) were obtained from data released by Microsoft.
Documents E-1 to E-55 and I-1 to I-55 are the same articles with a different random subset (5 articles) converted to control items (spam).
2 x 120 sentences randomly sampled from the WMT 2017 English–Chinese test set, only considering the 123 native Chinese articles. We only used Chinese sources from WMT; human (Reference-HT; the human
column) and machine translations (Combo-6; the mt
column) were obtained from data released by Microsoft.
Sentences U-1 to U-120 overlap with the full documents in documents.csv
. 16 random sentences were converted to control items (spam) in each set.
Ratings produced by participants (see participants.csv
). In our paper, we excluded ratings for sentences U-1 to U-120 from analysis because of overlap with full documents (see above).
The same as ratings.csv
, including control items (spam).