About the fatal weakness of the Embedding-based metric #23

g32M7fT6b8Y · 2019-11-24T04:37:33Z

Hi, thank you for your wonderful repo.
In my view, I think BERTScore is a kind of the embedding-based metric for measuring the quality of the responses which is similar to the Embedding-Average and Greedy Matching.
After trying the Embedding-Average, Greedy Matching, Vector Extrema, and BERTScore, I found that the average scores of these embedding-based metrics are very high (average 0.817 on Dailydialog dataset and Cornell dataset). In this case, any responses or even very bad responses could achieve the "Good" score and the difference between the "Good" and "Bad" are very small.
I attribute this question to the "fuzzy" representation of the word embedding. So I think the embedding-based metrics are not very appropriate for measuring the performance of the generative models such as dialog systems and NMT.

How do you think about this issue ? And how to alleviate it ?

Hope to get the response from you. Thanks.

Tiiiger · 2019-11-29T21:09:02Z

Hi @g32M7fT6b8Y ,

We believe this is mostly an issue of usage, not a weakness in the method itself. Indeed, we have found that BERTScore computed with deep contextual embedding models can sometimes have a small numerical range (also pointed out by #20 ). However, this does not suggest that BERTScore cannot distinguish bad candidates (bad responses in your case) from good candidates. If we rank the candidates, the good candidates would score higher than the bad candidates. On this note, we also refer you to the correlation studies in our paper.

  We also don’t want to simply ignore this “numerical range” problem because it hinders the readability of our method. After rounds of considerations, here’s what we propose:  

We take a large monolingual corpus and randomly assign sentences to be candidate-reference pairs. When we evaluate these pairs with BERTScore, the output score (averaged) should serve as a lower bound because the candidate and reference are irrelevant to each other. We propose to use this lower bound to rescale BERTScore. To do this, we subtract this lower bound from a BERTScore and divide the difference by 1-lower bound.

For some numbers:
On the WMT17 news crawl English corpus, a lower bound for BERTScore computed with RoBERTa-Large is 0.83. With this recalling, the average BERTScore on the WMT18 De-EN translation evaluation dataset drops from 0.9311 to 0.5758. For a concrete example, let's look at the example mentioned in #20. Before this rescaling the score distribution is like this:

After rescaling, this distribution looks like this:

  Note that this modification would only change the range of BERTScore and won’t affect BERTScore’s correlation with human judgment. Currently, we are adding software support in this repo. Stay tuned and we’ll push this change into the new version soon.   

I am closing this issue but feel free to continue the thread here.

gmftbyGMFTBY · 2019-12-01T10:53:46Z

Thank you for your response.
I think it maybe a appropriate way to alleviate this issue.
Cannot wait to try the new version of BERTScore.

Tiiiger closed this as completed Nov 29, 2019

Tiiiger mentioned this issue Nov 29, 2019

Unreasonably high cosine similarity between words #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the fatal weakness of the Embedding-based metric #23

About the fatal weakness of the Embedding-based metric #23

g32M7fT6b8Y commented Nov 24, 2019 •

edited

Loading

Tiiiger commented Nov 29, 2019

gmftbyGMFTBY commented Dec 1, 2019

About the fatal weakness of the Embedding-based metric #23

About the fatal weakness of the Embedding-based metric #23

Comments

g32M7fT6b8Y commented Nov 24, 2019 • edited Loading

Tiiiger commented Nov 29, 2019

gmftbyGMFTBY commented Dec 1, 2019

g32M7fT6b8Y commented Nov 24, 2019 •

edited

Loading