-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the fatal weakness of the Embedding-based metric #23
Comments
Hi @g32M7fT6b8Y , We believe this is mostly an issue of usage, not a weakness in the method itself. Indeed, we have found that BERTScore computed with deep contextual embedding models can sometimes have a small numerical range (also pointed out by #20 ). However, this does not suggest that BERTScore cannot distinguish bad candidates (bad responses in your case) from good candidates. If we rank the candidates, the good candidates would score higher than the bad candidates. On this note, we also refer you to the correlation studies in our paper. We also don’t want to simply ignore this “numerical range” problem because it hinders the readability of our method. After rounds of considerations, here’s what we propose: We take a large monolingual corpus and randomly assign sentences to be candidate-reference pairs. When we evaluate these pairs with BERTScore, the output score (averaged) should serve as a lower bound because the candidate and reference are irrelevant to each other. We propose to use this lower bound to rescale BERTScore. To do this, we subtract this lower bound from a BERTScore and divide the difference by For some numbers: After rescaling, this distribution looks like this: Note that this modification would only change the range of BERTScore and won’t affect BERTScore’s correlation with human judgment. Currently, we are adding software support in this repo. Stay tuned and we’ll push this change into the new version soon. I am closing this issue but feel free to continue the thread here. |
Thank you for your response. |
Hi, thank you for your wonderful repo.
In my view, I think BERTScore is a kind of the embedding-based metric for measuring the quality of the responses which is similar to the Embedding-Average and Greedy Matching.
After trying the Embedding-Average, Greedy Matching, Vector Extrema, and BERTScore, I found that the average scores of these embedding-based metrics are very high (average 0.817 on Dailydialog dataset and Cornell dataset). In this case, any responses or even very bad responses could achieve the "Good" score and the difference between the "Good" and "Bad" are very small.
I attribute this question to the "fuzzy" representation of the word embedding. So I think the embedding-based metrics are not very appropriate for measuring the performance of the generative models such as dialog systems and NMT.
How do you think about this issue ? And how to alleviate it ?
Hope to get the response from you. Thanks.
The text was updated successfully, but these errors were encountered: