-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BLEURT evaluation errors #316
Conversation
…val into generative_metric_error
…d types. Detailed issue description: huggingface#315
…d types. Detailed issue description: huggingface#315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Only a few nits to fix and we will be good
src/lighteval/metrics/metrics.py
Outdated
@@ -123,12 +123,14 @@ class Metrics(Enum): | |||
corpus_level_fn=np.mean, | |||
higher_is_better=True, | |||
) | |||
def compute_mean(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a lambda function inside the task def
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Would you mind taking a look at the errors described here in items 3 and 4: #315 - I was having trouble making the lambda function work and would appreciate some advice on how to do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure ! Can you tell me what task you are running so that I can replicate ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! https://github.com/chuandudx/lighteval/blob/replicate_bleurt/community_tasks/replicate_bleurt_error_eval.py
Could you try on this example task?
When I tried to run this from lighteval I ran into the formatted_doc error again), and it's also happening for bert_score when it hadn't occurred before.
TypeError: BLEURT.compute() got an unexpected keyword argument 'formatted_doc'
I thought this was fixed by adding **kwargs in the compute function, but I wasn't able to identify what had changed. Do you have insights into why this may be if there was any other backend updates/incompatibilities that - I was unfortunately unable to get to the bottom of this yet.
When I ran it from my repo, it did work, so please let me know if this works for you. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took a look this morning and your fork should work. You only need to remove this function and only call np.mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Nathan! I just implemented this change :)
Hi @NathanHB :) Just wanted to followup on this PR and welcome any feedback from the concerns we previously discussed. Thank you! |
Merged the latest changes from main and fixed style error. Thank you! |
Thanks a lot! FYI, we're working on some other important features atm, but we'll come back to this PR as soon as we can |
src/lighteval/metrics/metrics.py
Outdated
category=MetricCategory.GENERATIVE, | ||
use_case=MetricUseCase.TRANSLATION, | ||
corpus_level_fn=lambda x: np.mean(x.flatten()), # flatten, then average | ||
corpus_level_fn=compute_mean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you return the scores.item() you can simply use np.mean
here
bleurt = SampleLevelMetric( | ||
metric_name="bleurt", | ||
sample_level_fn=BLEURT.compute, | ||
sample_level_fn=BLEURT().compute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good !
@@ -715,8 +715,7 @@ def compute(self, golds: list[str], predictions: list[str]) -> float: | |||
if len(predictions) == 1: | |||
predictions = predictions * len(golds) | |||
scores = self.model(**self.tokenizer(golds, predictions, return_tensors="pt"))[0].squeeze() | |||
|
|||
return scores | |||
return scores.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works ! Thanks for finding this bug
These changes address the issues described in: #315
I made the code changes such that it built on the BERTScore changes (#311) that haven't been merged yet, so we see those changes here. Please let me know if there is preference on removing those from this PR. Thank you!