Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: override existing results #1617

Merged
merged 9 commits into from
Dec 22, 2024
Merged

fix: override existing results #1617

merged 9 commits into from
Dec 22, 2024

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Dec 20, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Currently, if overwrite is passed with existing results, it won't overwrite them, and everything will be skipped.

This was the reason of #1584 (comment)

@Samoed Samoed requested a review from isaac-chung December 20, 2024 18:33
Comment on lines 535 to 538
subsets_to_run = (
info["missing_subsets"]
if not overwrite_results
else task_subsets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the test failures, maybe this condition needs to be reverted?

Copy link
Collaborator Author

@Samoed Samoed Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, that test were incorrect a bit, because they use overwrite_results=True and not all splits were evaluated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overwrite_results shouldn't affect how splits are determined, so those tests should pass whether overwrite_results is True or False, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in the test, last_evaluated_splits was being checked, and with overwrite_results, all selected splits and subsets should be reevaluated. This caused the test to fail. For example, instead of just val in last_evaluated_splits, it should include both val and test. However, the splits were selected correctly.

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a test case in missing splits/langs where overwrite_results=True, as that should not affect how splits and langs are selected.

@@ -245,7 +238,6 @@ def test_multilingual_one_missing_split_no_missing_lang(
output_folder=str(tmp_path / "partial_langs_partial_splits"),
verbosity=2,
eval_subsets=["eng", "fra"],
overwrite_results=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not affect how missing splits or langs are selected.

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 21, 2024

I've updated the test as you suggested. Previously, the problem was that when tasks were run a second time with the overwrite option, it only checked if one of the selected splits had been run. However, this behavior was incorrect, because all splits should be run with overwrite.

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Got a few comments and questions.

Comment on lines +474 to +479
missing_evaluations = self._get_missing_evaluations(
existing_results,
task_eval_splits,
task_subsets,
eval_subsets,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if existing_results=None here, then won't missing_evaluations contain all specified splits? Maybe this call can be omitted or simplified to an assignment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both task_eval_splits and task_subsets are None, all splits will be selected, but filtering for splits and subsets is still necessary. I think calling the same function is easier to understand than adding more complex logic.

mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
tests/test_evaluation/test_split_evaluation.py Outdated Show resolved Hide resolved
@@ -305,3 +295,70 @@ def test_multilingual_one_missing_lang_in_one_split(
# output merged result with previous results
assert results[0].scores.keys() == {"test", "val"}
assert len(results[0].scores["test"]) == 2


def test_all_splits_evaluated_with_overwrite(model, tasks, tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "all splits" mean here? There are test and val but only val is run here.

because all splits should be run with overwrite

Based on this comment, is it correct that "all" means all specified splits in the run method, and not all splits available in the task metadata's eval_splits?

Perhaps this can be spelled out, or preferably shown in additional test cases, where the 1st and 2nd runs have different eval_splits

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means all specified splits in the run method,

Yes, all splits specified in evaluation.run

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time for my questions. I feel we're in a good place and can leave the rest for future improvement. Let's merge :)

@isaac-chung isaac-chung merged commit 272adb1 into main Dec 22, 2024
10 checks passed
@isaac-chung isaac-chung deleted the fix_override_results branch December 22, 2024 13:23
@Samoed
Copy link
Collaborator Author

Samoed commented Dec 22, 2024

Thank you for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants