We propose to measure the performance of a dialogue system by computing the distributionwise distance between its generated conversations and real-world conversations.
To appear in Findings of ACL 2021.
Note that this is not an officially supported Tencent product.
This repository requires the packages:
- pytorch
- huggingface/transformers.
To evaluate the system-level human correlations of metrics:
python eval_metric.py \
--data_path ./datasets/convai2_annotation.json \
--metric fbd \
--sample_num 10 \
--model_type roberta-base \
--batch_size 32
Currently, our repo supports the common metrics used in text generation field, inclduing bleu
, meteor
, rouge
, greedy
, average
, extrema
, bert_score
, fbd
and prd
.
Here are some details of the six corpura compared in the main paper:
File Name | Dataset Name | Num. of Samples | Reference |
---|---|---|---|
personam_annotation.json |
Persona(M) | 60 | Shikib/usr |
dailyh_annotation.json |
Daily(H) | 150 | li3cmz/GRADE |
convai2_annotation.json |
Convai2 | 150 | li3cmz/GRADE |
empathetic_annotation.json |
Empathetic | 150 | li3cmz/GRADE |
dailyz_annotation.json |
Daily(Z) | 100 | ZHAOTING/dialog-processing |
personaz_annotation.json |
Persona(Z) | 150 | ZHAOTING/dialog-processing |
If you use this research/codebase/dataset, please cite our paper:
@article{xiang2021assessing,
title={Assessing Dialogue Systems with Distribution Distances},
author={Xiang, Jiannan and Liu, Yahui and Cai, Deng and Li, Huayang and Lian, Defu and Liu, Lemao},
journal={arXiv preprint arXiv:2105.02573},
year={2021}
}
Other related papers:
- [1] FID, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS 2017
- [2] PRD, Assessing Generative Models via Precision and Recall, NIPS 2018
- [3] BERTScore, BERTScore: Evaluating Text Generation with BERT, ICLR 2020