-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements check on existing and new datasets #1049
Comments
Agreed that such tests would be great! |
I think calculate_metadata_metrics mostly addresses this issue, but I'm not sure |
Also cc @jhyuklee & Xiaoqi Ren here who mentioned the EmotionClassification dataset being very problematic in that regard which we know I think; I think it won't be in the new MTEB eng mix that will come with the new leaderboard ( mteb/mteb/benchmarks/benchmarks.py Line 71 in 8ae095a
|
I wrote a script to check all tasks against three main criteria:
Ran this script on all the tasks that are in Found a lot of problems, some of them minor and some of them need serious attention. I'll post the results in the next comments. |
BitextMiningDuplicated documents
Example from
The character I counted duplicates for each column, so each task has to be dealt with separately. In some tasks, like DiaBlaBitextMining, there are completely identical rows, but there are also different translations - this is normal for this task.
|
Great work Oh wow! Also seems like there might be a few trivial 1-word / short documents that might be worth removing to make the datasets more challenging. WDYT? |
Awesome! @AlexeyVatolin you can run |
Adding this to calculate metadata is a great way. I am not sure how we should count duplicates (by column or across) |
I can add number of unique sentences to tasks. Maybe we can add more information to the tasks |
@KennethEnevoldsen I think it would be a great idea to issue a warning if a dataset contains duplicates in the
Additionally, a text length check could be implemented, with a warning message displayed when the text is particularly short. |
Def. worth adding to the We might in the test suite also implement limits on duplicates, test set leakage, and min document length (require adding datasets to an exception list of breaking any of the requirements) If we have this I am not sure that warnings will be required. |
@gentaiscool just letting you know about this issue as well (notably see LinceMTBitextMining in table above) |
In the table below, I counted how many more complete duplicates there are in the BitextMining tasks (by text pairs from columns
|
ClassificationDuplicated documents
Example from
Empty documents
Leaks
All datasets with over 1000 leaks are completely broken because it uses the same dataset for training and for evaluation. I think we should disallow the use of |
ClusteringDuplicated documents
Empty documents
|
InstructionRetrievalDuplicated documents
|
MultilabelClassificationDuplicated documents
Leaked documents
|
@AlexeyVatolin Thank you! I've implemented your ideas in |
PairClassificationDuplicated documentsIt's funny that the most duplicates are in the duplicate search dataset :)
|
Oh wow! It might be worth splitting these up to independent issues pr. task type. Also seems like there is enough work here to warrant a broader work on dataset quality (I could imagine quite a few additional elements, removing too easy examples from training data, testing for data leakage between test and train). |
We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)
We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:
We can then write a file for a specific dataset / revision to compute these metrics.
Other tests such as checking if the language match etc. could also be added in the future.
The text was updated successfully, but these errors were encountered: