Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BLEURT evaluation errors #316

Merged
merged 24 commits into from
Oct 16, 2024
Merged

Conversation

chuandudx
Copy link
Contributor

These changes address the issues described in: #315

I made the code changes such that it built on the BERTScore changes (#311) that haven't been merged yet, so we see those changes here. Please let me know if there is preference on removing those from this PR. Thank you!

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Only a few nits to fix and we will be good

@@ -123,12 +123,14 @@ class Metrics(Enum):
corpus_level_fn=np.mean,
higher_is_better=True,
)
def compute_mean(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a lambda function inside the task def

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Would you mind taking a look at the errors described here in items 3 and 4: #315 - I was having trouble making the lambda function work and would appreciate some advice on how to do it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure ! Can you tell me what task you are running so that I can replicate ? :)

Copy link
Contributor Author

@chuandudx chuandudx Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! https://github.com/chuandudx/lighteval/blob/replicate_bleurt/community_tasks/replicate_bleurt_error_eval.py
Could you try on this example task?

When I tried to run this from lighteval I ran into the formatted_doc error again), and it's also happening for bert_score when it hadn't occurred before.

TypeError: BLEURT.compute() got an unexpected keyword argument 'formatted_doc'

I thought this was fixed by adding **kwargs in the compute function, but I wasn't able to identify what had changed. Do you have insights into why this may be if there was any other backend updates/incompatibilities that - I was unfortunately unable to get to the bottom of this yet.

When I ran it from my repo, it did work, so please let me know if this works for you. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a look this morning and your fork should work. You only need to remove this function and only call np.mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Nathan! I just implemented this change :)

src/lighteval/metrics/imports/bert_scorer.py Outdated Show resolved Hide resolved
@chuandudx
Copy link
Contributor Author

Hi @NathanHB :) Just wanted to followup on this PR and welcome any feedback from the concerns we previously discussed. Thank you!

@chuandudx
Copy link
Contributor Author

Merged the latest changes from main and fixed style error. Thank you!

@clefourrier
Copy link
Member

Thanks a lot! FYI, we're working on some other important features atm, but we'll come back to this PR as soon as we can

category=MetricCategory.GENERATIVE,
use_case=MetricUseCase.TRANSLATION,
corpus_level_fn=lambda x: np.mean(x.flatten()), # flatten, then average
corpus_level_fn=compute_mean,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you return the scores.item() you can simply use np.mean here

bleurt = SampleLevelMetric(
metric_name="bleurt",
sample_level_fn=BLEURT.compute,
sample_level_fn=BLEURT().compute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good !

@@ -715,8 +715,7 @@ def compute(self, golds: list[str], predictions: list[str]) -> float:
if len(predictions) == 1:
predictions = predictions * len(golds)
scores = self.model(**self.tokenizer(golds, predictions, return_tensors="pt"))[0].squeeze()

return scores
return scores.item()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works ! Thanks for finding this bug

@NathanHB NathanHB merged commit d84c378 into huggingface:main Oct 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants