-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: override existing results #1617
Conversation
mteb/evaluation/MTEB.py
Outdated
subsets_to_run = ( | ||
info["missing_subsets"] | ||
if not overwrite_results | ||
else task_subsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from the test failures, maybe this condition needs to be reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems, that test were incorrect a bit, because they use overwrite_results=True
and not all splits were evaluated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overwrite_results
shouldn't affect how splits are determined, so those tests should pass whether overwrite_results
is True or False, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in the test, last_evaluated_splits
was being checked, and with overwrite_results
, all selected splits and subsets should be reevaluated. This caused the test to fail. For example, instead of just val
in last_evaluated_splits
, it should include both val
and test
. However, the splits were selected correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a test case in missing splits/langs where overwrite_results=True
, as that should not affect how splits and langs are selected.
@@ -245,7 +238,6 @@ def test_multilingual_one_missing_split_no_missing_lang( | |||
output_folder=str(tmp_path / "partial_langs_partial_splits"), | |||
verbosity=2, | |||
eval_subsets=["eng", "fra"], | |||
overwrite_results=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not affect how missing splits or langs are selected.
I've updated the test as you suggested. Previously, the problem was that when tasks were run a second time with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Got a few comments and questions.
missing_evaluations = self._get_missing_evaluations( | ||
existing_results, | ||
task_eval_splits, | ||
task_subsets, | ||
eval_subsets, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if existing_results=None
here, then won't missing_evaluations
contain all specified splits? Maybe this call can be omitted or simplified to an assignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both task_eval_splits
and task_subsets
are None
, all splits will be selected, but filtering for splits and subsets is still necessary. I think calling the same function is easier to understand than adding more complex logic.
@@ -305,3 +295,70 @@ def test_multilingual_one_missing_lang_in_one_split( | |||
# output merged result with previous results | |||
assert results[0].scores.keys() == {"test", "val"} | |||
assert len(results[0].scores["test"]) == 2 | |||
|
|||
|
|||
def test_all_splits_evaluated_with_overwrite(model, tasks, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "all splits" mean here? There are test and val but only val is run here.
because all splits should be run with overwrite
Based on this comment, is it correct that "all" means all specified splits in the run method, and not all splits available in the task metadata's eval_splits?
Perhaps this can be spelled out, or preferably shown in additional test cases, where the 1st and 2nd runs have different eval_splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means all specified splits in the run method,
Yes, all splits specified in evaluation.run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time for my questions. I feel we're in a good place and can leave the rest for future improvement. Let's merge :)
Thank you for review! |
Checklist
make test
.make lint
.Currently, if
overwrite
is passed with existing results, it won't overwrite them, and everything will be skipped.This was the reason of #1584 (comment)