Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements check on existing and new datasets #1049

Open
KennethEnevoldsen opened this issue Jul 5, 2024 · 20 comments
Open

Implements check on existing and new datasets #1049

KennethEnevoldsen opened this issue Jul 5, 2024 · 20 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@KennethEnevoldsen
Copy link
Contributor

We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)

We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:

  1. Checking that there are no empty documents
  2. Checking that the task contains no duplicates
  3. Checking leakage between train and test sets
  4. Optionally we could add the existing computed metrics here as well (e..g avg. length)

We can then write a file for a specific dataset / revision to compute these metrics.

Other tests such as checking if the language match etc. could also be added in the future.

@Muennighoff
Copy link
Contributor

Agreed that such tests would be great!

@isaac-chung isaac-chung added the help wanted Extra attention is needed label Sep 26, 2024
@Samoed
Copy link
Collaborator

Samoed commented Oct 2, 2024

I think calculate_metadata_metrics mostly addresses this issue, but I'm not sure

@Muennighoff
Copy link
Contributor

Also cc @jhyuklee & Xiaoqi Ren here who mentioned the EmotionClassification dataset being very problematic in that regard which we know I think; I think it won't be in the new MTEB eng mix that will come with the new leaderboard (

MTEB_EN = Benchmark(
) so maybe that solves it

@AlexeyVatolin
Copy link
Contributor

I wrote a script to check all tasks against three main criteria:

  1. whether there are empty rows
  2. whether there are duplicates
  3. whether there are leaks between training and test data

Ran this script on all the tasks that are in mteb! It was long and took up a lot of hard drive space.

Found a lot of problems, some of them minor and some of them need serious attention. I'll post the results in the next comments.

@AlexeyVatolin
Copy link
Contributor

BitextMining

Duplicated documents

Dataset Name HF Subset Split Column Name Num Documents
LinceMTBitextMining eng-eng_hin train sentence1 2039
LinceMTBitextMining eng-eng_hin train sentence2 670
DiaBlaBitextMining fr-en test sentence2 242
DiaBlaBitextMining en-fr test sentence2 242
DiaBlaBitextMining fr-en test sentence1 176
DiaBlaBitextMining en-fr test sentence1 176
PhincBitextMining eng-eng_hin train sentence1 225
PhincBitextMining eng-eng_hin train sentence2 2
NusaTranslationBitextMining ind-btk train sentence1 88
NusaTranslationBitextMining ind-bew train sentence1 88
NusaTranslationBitextMining ind-jav train sentence1 88
NusaTranslationBitextMining ind-mak train sentence1 88
NusaTranslationBitextMining ind-min train sentence1 88
NusaTranslationBitextMining ind-sun train sentence1 88
NusaTranslationBitextMining ind-mad train sentence1 87
NusaTranslationBitextMining ind-jav train sentence2 15
NusaTranslationBitextMining ind-mak train sentence2 15
NusaTranslationBitextMining ind-sun train sentence2 13
NusaTranslationBitextMining ind-bew train sentence2 10
NusaTranslationBitextMining ind-min train sentence2 9
NusaTranslationBitextMining ind-mad train sentence2 8
NusaTranslationBitextMining ind-btk train sentence2 7
NusaTranslationBitextMining ind-mui train sentence1 3
NusaTranslationBitextMining ind-abs train sentence1 2
NusaTranslationBitextMining ind-abs train sentence2 2
NusaTranslationBitextMining ind-rej train sentence1 2
NusaTranslationBitextMining ind-bhp train sentence1 1
NusaTranslationBitextMining ind-bhp train sentence2 1
BibleNLPBitextMining eng_Latn-mgh_Latn train sentence1 28
BibleNLPBitextMining mgh_Latn-eng_Latn train sentence2 28
BibleNLPBitextMining eng_Latn-mgh_Latn train sentence2 21
BibleNLPBitextMining mgh_Latn-eng_Latn train sentence1 21
BibleNLPBitextMining eng_Latn-wed_Latn train sentence1 21
BibleNLPBitextMining wed_Latn-eng_Latn train sentence2 21
BibleNLPBitextMining eng_Latn-dhg_Latn train sentence1 17
BibleNLPBitextMining dhg_Latn-eng_Latn train sentence2 17
BibleNLPBitextMining eng_Latn-wed_Latn train sentence2 14
BibleNLPBitextMining wed_Latn-eng_Latn train sentence1 14
BibleNLPBitextMining eng_Latn-xbi_Latn train sentence1 14
BibleNLPBitextMining xbi_Latn-eng_Latn train sentence2 14
BibleNLPBitextMining eng_Latn-tbg_Latn train sentence1 13
BibleNLPBitextMining tbg_Latn-eng_Latn train sentence2 13
BibleNLPBitextMining eng_Latn-ind_Latn train sentence1 11
BibleNLPBitextMining ind_Latn-eng_Latn train sentence2 11
BibleNLPBitextMining eng_Latn-rmy_Latn train sentence1 11
BibleNLPBitextMining rmy_Latn-eng_Latn train sentence2 11
BibleNLPBitextMining eng_Latn-tzo_Latn train sentence1 9
BibleNLPBitextMining tzo_Latn-eng_Latn train sentence2 9
BibleNLPBitextMining eng_Latn-xbi_Latn train sentence2 8
BibleNLPBitextMining xbi_Latn-eng_Latn train sentence1 8
BibleNLPBitextMining eng_Latn-dhg_Latn train sentence2 7
BibleNLPBitextMining dhg_Latn-eng_Latn train sentence1 7
BibleNLPBitextMining eng_Latn-pol_Latn train sentence1 7
BibleNLPBitextMining pol_Latn-eng_Latn train sentence2 7
BibleNLPBitextMining eng_Latn-byr_Latn train sentence1 6
BibleNLPBitextMining byr_Latn-eng_Latn train sentence2 6
BibleNLPBitextMining eng_Latn-omw_Latn train sentence1 6
BibleNLPBitextMining omw_Latn-eng_Latn train sentence2 6
BibleNLPBitextMining eng_Latn-beo_Latn train sentence2 5
BibleNLPBitextMining beo_Latn-eng_Latn train sentence1 5
BibleNLPBitextMining eng_Latn-ind_Latn train sentence2 5
BibleNLPBitextMining ind_Latn-eng_Latn train sentence1 5
BibleNLPBitextMining eng_Latn-mam_Latn train sentence1 4
BibleNLPBitextMining mam_Latn-eng_Latn train sentence2 4
BibleNLPBitextMining eng_Latn-omw_Latn train sentence2 4
BibleNLPBitextMining omw_Latn-eng_Latn train sentence1 4
BibleNLPBitextMining eng_Latn-pol_Latn train sentence2 4
BibleNLPBitextMining pol_Latn-eng_Latn train sentence1 4
BibleNLPBitextMining eng_Latn-rmy_Latn train sentence2 4
BibleNLPBitextMining rmy_Latn-eng_Latn train sentence1 4
BibleNLPBitextMining eng_Latn-tbg_Latn train sentence2 4
BibleNLPBitextMining tbg_Latn-eng_Latn train sentence1 4
BibleNLPBitextMining eng_Latn-acr_Latn train sentence1 3
BibleNLPBitextMining acr_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-byr_Latn train sentence2 3
BibleNLPBitextMining byr_Latn-eng_Latn train sentence1 3
BibleNLPBitextMining eng_Latn-cak_Latn train sentence1 3
BibleNLPBitextMining cak_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-cha_Latn train sentence1 3
BibleNLPBitextMining cha_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-ixl_Latn train sentence1 3
BibleNLPBitextMining ixl_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-quc_Latn train sentence1 3
BibleNLPBitextMining eng_Latn-quc_Latn train sentence2 3
BibleNLPBitextMining quc_Latn-eng_Latn train sentence1 3
BibleNLPBitextMining quc_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-ubu_Latn train sentence1 3
BibleNLPBitextMining ubu_Latn-eng_Latn train sentence2 3
BibleNLPBitextMining eng_Latn-agt_Latn train sentence1 2
BibleNLPBitextMining agt_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-amn_Latn train sentence1 2
BibleNLPBitextMining amn_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-aoj_Latn train sentence1 2
BibleNLPBitextMining aoj_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ape_Latn train sentence1 2
BibleNLPBitextMining ape_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-arb_Arab train sentence1 2
BibleNLPBitextMining arb_Arab-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-bel_Cyrl train sentence1 2
BibleNLPBitextMining bel_Cyrl-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-beu_Latn train sentence1 2
BibleNLPBitextMining beu_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-bhg_Latn train sentence1 2
BibleNLPBitextMining bhg_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-boj_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-boj_Latn train sentence2 2
BibleNLPBitextMining boj_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining boj_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-bre_Latn train sentence1 2
BibleNLPBitextMining bre_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-bsn_Latn train sentence1 2
BibleNLPBitextMining bsn_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-bxh_Latn train sentence1 2
BibleNLPBitextMining bxh_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ces_Latn train sentence1 2
BibleNLPBitextMining ces_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-cmn_Hans train sentence1 2
BibleNLPBitextMining cmn_Hans-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-cop_Copt train sentence1 2
BibleNLPBitextMining cop_Copt-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ctu_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-ctu_Latn train sentence2 2
BibleNLPBitextMining ctu_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining ctu_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-cuk_Latn train sentence1 2
BibleNLPBitextMining cuk_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-deu_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-deu_Latn train sentence2 2
BibleNLPBitextMining deu_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining deu_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-dwr_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-dwr_Latn train sentence2 2
BibleNLPBitextMining dwr_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining dwr_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-eko_Latn train sentence1 2
BibleNLPBitextMining eko_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-fra_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-fra_Latn train sentence2 2
BibleNLPBitextMining fra_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining fra_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-gfk_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-gfk_Latn train sentence2 2
BibleNLPBitextMining gfk_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining gfk_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-gmv_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-gmv_Latn train sentence2 2
BibleNLPBitextMining gmv_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining gmv_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-gof_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-gof_Latn train sentence2 2
BibleNLPBitextMining gof_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining gof_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-grc_Grek train sentence1 2
BibleNLPBitextMining eng_Latn-grc_Grek train sentence2 2
BibleNLPBitextMining grc_Grek-eng_Latn train sentence1 2
BibleNLPBitextMining grc_Grek-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-gux_Latn train sentence1 2
BibleNLPBitextMining gux_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-hat_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-hat_Latn train sentence2 2
BibleNLPBitextMining hat_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining hat_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-hau_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-hau_Latn train sentence2 2
BibleNLPBitextMining hau_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining hau_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-hlt_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-hlt_Latn train sentence2 2
BibleNLPBitextMining hlt_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining hlt_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-hus_Latn train sentence1 2
BibleNLPBitextMining hus_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ita_Latn train sentence1 2
BibleNLPBitextMining ita_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ixl_Latn train sentence2 2
BibleNLPBitextMining ixl_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-kkc_Latn train sentence1 2
BibleNLPBitextMining kkc_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-kkl_Latn train sentence1 2
BibleNLPBitextMining kkl_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-knv_Latn train sentence1 2
BibleNLPBitextMining knv_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ksj_Latn train sentence1 2
BibleNLPBitextMining ksj_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-lif_Deva train sentence1 2
BibleNLPBitextMining eng_Latn-lif_Deva train sentence2 2
BibleNLPBitextMining lif_Deva-eng_Latn train sentence1 2
BibleNLPBitextMining lif_Deva-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-mal_Mlym train sentence1 2
BibleNLPBitextMining mal_Mlym-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-mkj_Latn train sentence1 2
BibleNLPBitextMining mkj_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-msy_Latn train sentence1 2
BibleNLPBitextMining msy_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-mya_Latn train sentence1 2
BibleNLPBitextMining mya_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-nhe_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-nhe_Latn train sentence2 2
BibleNLPBitextMining nhe_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining nhe_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-noa_Latn train sentence1 2
BibleNLPBitextMining noa_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-okv_Latn train sentence1 2
BibleNLPBitextMining okv_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-pao_Latn train sentence1 2
BibleNLPBitextMining pao_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-pon_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-pon_Latn train sentence2 2
BibleNLPBitextMining pon_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining pon_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-por_Latn train sentence1 2
BibleNLPBitextMining por_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-ron_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-ron_Latn train sentence2 2
BibleNLPBitextMining ron_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining ron_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-san_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-san_Latn train sentence2 2
BibleNLPBitextMining san_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining san_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-seh_Latn train sentence1 2
BibleNLPBitextMining seh_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-snp_Latn train sentence1 2
BibleNLPBitextMining snp_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-spa_Latn train sentence1 2
BibleNLPBitextMining spa_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-spl_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-spl_Latn train sentence2 2
BibleNLPBitextMining spl_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining spl_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-sus_Arab train sentence1 2
BibleNLPBitextMining sus_Arab-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-swh_Latn train sentence1 2
BibleNLPBitextMining swh_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tnc_Latn train sentence1 2
BibleNLPBitextMining tnc_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tod_Latn train sentence1 2
BibleNLPBitextMining tod_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tpi_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-tpi_Latn train sentence2 2
BibleNLPBitextMining tpi_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining tpi_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tuc_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-tuc_Latn train sentence2 2
BibleNLPBitextMining tuc_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining tuc_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tzj_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-tzj_Latn train sentence2 2
BibleNLPBitextMining tzj_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining tzj_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-tzo_Latn train sentence2 2
BibleNLPBitextMining tzo_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-uig_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-uig_Latn train sentence2 2
BibleNLPBitextMining uig_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining uig_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-urd_Arab train sentence1 2
BibleNLPBitextMining eng_Latn-urd_Arab train sentence2 2
BibleNLPBitextMining urd_Arab-eng_Latn train sentence1 2
BibleNLPBitextMining urd_Arab-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-vie_Latn train sentence1 2
BibleNLPBitextMining vie_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-wol_Latn train sentence1 2
BibleNLPBitextMining wol_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-wro_Latn train sentence1 2
BibleNLPBitextMining wro_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-yle_Latn train sentence2 2
BibleNLPBitextMining yle_Latn-eng_Latn train sentence1 2
BibleNLPBitextMining eng_Latn-yss_Latn train sentence1 2
BibleNLPBitextMining yss_Latn-eng_Latn train sentence2 2
BibleNLPBitextMining eng_Latn-aaz_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-aaz_Latn train sentence2 1
BibleNLPBitextMining aaz_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining aaz_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-abt_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-abt_Latn train sentence2 1
BibleNLPBitextMining abt_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining abt_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-aka_Latn train sentence2 1
BibleNLPBitextMining aka_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-amn_Latn train sentence2 1
BibleNLPBitextMining amn_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-anh_Latn train sentence1 1
BibleNLPBitextMining anh_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-arb_Arab train sentence2 1
BibleNLPBitextMining arb_Arab-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-arp_Latn train sentence1 1
BibleNLPBitextMining arp_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-aso_Latn train sentence2 1
BibleNLPBitextMining aso_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-aui_Latn train sentence1 1
BibleNLPBitextMining aui_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-awk_Latn train sentence1 1
BibleNLPBitextMining awk_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-azb_Arab train sentence2 1
BibleNLPBitextMining azb_Arab-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-bco_Latn train sentence1 1
BibleNLPBitextMining bco_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-bea_Latn train sentence1 1
BibleNLPBitextMining bea_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-bel_Cyrl train sentence2 1
BibleNLPBitextMining bel_Cyrl-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-bkd_Latn train sentence2 1
BibleNLPBitextMining bkd_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-bkx_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-bkx_Latn train sentence2 1
BibleNLPBitextMining bkx_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining bkx_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-bmk_Latn train sentence1 1
BibleNLPBitextMining bmk_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-ces_Latn train sentence2 1
BibleNLPBitextMining ces_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-cme_Latn train sentence2 1
BibleNLPBitextMining cme_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-cmn_Hans train sentence2 1
BibleNLPBitextMining cmn_Hans-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-cpu_Latn train sentence2 1
BibleNLPBitextMining cpu_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-cuk_Latn train sentence2 1
BibleNLPBitextMining cuk_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-dww_Latn train sentence2 1
BibleNLPBitextMining dww_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-dwy_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-dwy_Latn train sentence2 1
BibleNLPBitextMining dwy_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining dwy_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-enq_Latn train sentence2 1
BibleNLPBitextMining enq_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-hbo_Hebr train sentence2 1
BibleNLPBitextMining hbo_Hebr-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-hix_Latn train sentence2 1
BibleNLPBitextMining hix_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-hus_Latn train sentence2 1
BibleNLPBitextMining hus_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-hvn_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-hvn_Latn train sentence2 1
BibleNLPBitextMining hvn_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining hvn_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-ipi_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-ipi_Latn train sentence2 1
BibleNLPBitextMining ipi_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining ipi_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-ita_Latn train sentence2 1
BibleNLPBitextMining ita_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-kde_Latn train sentence2 1
BibleNLPBitextMining kde_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-kmh_Latn train sentence1 1
BibleNLPBitextMining kmh_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-kqc_Latn train sentence2 1
BibleNLPBitextMining kqc_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-kql_Latn train sentence1 1
BibleNLPBitextMining kql_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-ktm_Latn train sentence1 1
BibleNLPBitextMining ktm_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-kup_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-kup_Latn train sentence2 1
BibleNLPBitextMining kup_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining kup_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-kyq_Latn train sentence2 1
BibleNLPBitextMining kyq_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-llg_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-llg_Latn train sentence2 1
BibleNLPBitextMining llg_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining llg_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-mal_Mlym train sentence2 1
BibleNLPBitextMining mal_Mlym-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mam_Latn train sentence2 1
BibleNLPBitextMining mam_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mkj_Latn train sentence2 1
BibleNLPBitextMining mkj_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mkn_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mkn_Latn train sentence2 1
BibleNLPBitextMining mkn_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining mkn_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-mph_Latn train sentence1 1
BibleNLPBitextMining mph_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-msy_Latn train sentence2 1
BibleNLPBitextMining msy_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mwc_Latn train sentence1 1
BibleNLPBitextMining mwc_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-mya_Latn train sentence2 1
BibleNLPBitextMining mya_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-mzz_Latn train sentence1 1
BibleNLPBitextMining mzz_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-nbq_Latn train sentence1 1
BibleNLPBitextMining nbq_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-nfa_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-nfa_Latn train sentence2 1
BibleNLPBitextMining nfa_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining nfa_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-nif_Latn train sentence1 1
BibleNLPBitextMining nif_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-nss_Latn train sentence1 1
BibleNLPBitextMining nss_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-nys_Latn train sentence1 1
BibleNLPBitextMining nys_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-okv_Latn train sentence2 1
BibleNLPBitextMining okv_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-por_Latn train sentence2 1
BibleNLPBitextMining por_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-ptp_Latn train sentence2 1
BibleNLPBitextMining ptp_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-qxh_Latn train sentence1 1
BibleNLPBitextMining qxh_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-rgu_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-rgu_Latn train sentence2 1
BibleNLPBitextMining rgu_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining rgu_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-row_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-row_Latn train sentence2 1
BibleNLPBitextMining row_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining row_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-rwo_Latn train sentence1 1
BibleNLPBitextMining rwo_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-snx_Latn train sentence1 1
BibleNLPBitextMining snx_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-spa_Latn train sentence2 1
BibleNLPBitextMining spa_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-sri_Latn train sentence2 1
BibleNLPBitextMining sri_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-srq_Latn train sentence2 1
BibleNLPBitextMining srq_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-sua_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-sua_Latn train sentence2 1
BibleNLPBitextMining sua_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining sua_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-swe_Latn train sentence2 1
BibleNLPBitextMining swe_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-swh_Latn train sentence2 1
BibleNLPBitextMining swh_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-tiw_Latn train sentence1 1
BibleNLPBitextMining tiw_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-toj_Latn train sentence1 1
BibleNLPBitextMining toj_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-tpa_Latn train sentence1 1
BibleNLPBitextMining tpa_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-twi_Latn train sentence2 1
BibleNLPBitextMining twi_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-txq_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-txq_Latn train sentence2 1
BibleNLPBitextMining txq_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining txq_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-ubu_Latn train sentence2 1
BibleNLPBitextMining ubu_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-ulk_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-ulk_Latn train sentence2 1
BibleNLPBitextMining ulk_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining ulk_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-urb_Latn train sentence1 1
BibleNLPBitextMining urb_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-vie_Latn train sentence2 1
BibleNLPBitextMining vie_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-viv_Latn train sentence2 1
BibleNLPBitextMining viv_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-wat_Latn train sentence1 1
BibleNLPBitextMining wat_Latn-eng_Latn train sentence2 1
BibleNLPBitextMining eng_Latn-wol_Latn train sentence2 1
BibleNLPBitextMining wol_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-wro_Latn train sentence2 1
BibleNLPBitextMining wro_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-xla_Latn train sentence2 1
BibleNLPBitextMining xla_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-xsi_Latn train sentence2 1
BibleNLPBitextMining xsi_Latn-eng_Latn train sentence1 1
BibleNLPBitextMining eng_Latn-zaj_Latn train sentence2 1
BibleNLPBitextMining zaj_Latn-eng_Latn train sentence1 1
NTREXBitextMining default test orm_Ethi 13
NTREXBitextMining default test fij_Latn 9
NTREXBitextMining default test wol_Latn 7
NTREXBitextMining default test dzo_Tibt 5
NTREXBitextMining default test ven_Latn 4
NTREXBitextMining default test nya_Latn 4
NTREXBitextMining default test mey_Arab 4
NTREXBitextMining default test bod_Tibt 4
NTREXBitextMining default test ewe_Latn 3
NTREXBitextMining default test amh_Ethi 3
NTREXBitextMining default test jpn_Jpan 3
NTREXBitextMining default test lav_Latn 3
NTREXBitextMining default test mlg_Latn 3
NTREXBitextMining default test arb_Arab 2
NTREXBitextMining default test lit_Latn 2
NTREXBitextMining default test sna_Latn 2
NTREXBitextMining default test dan_Latn 2
NTREXBitextMining default test kat_Geor 2
NTREXBitextMining default test mar_Deva 2
NTREXBitextMining default test tgk_Cyrl 2
NTREXBitextMining default test bak_Cyrl 2
NTREXBitextMining default test kor_Hang 2
NTREXBitextMining default test fas_Arab 2
NTREXBitextMining default test ckb_Arab 2
NTREXBitextMining default test srp_Cyrl 2
NTREXBitextMining default test mal_Mlym 1
NTREXBitextMining default test pus_Arab 1
NTREXBitextMining default test zho_Hant 1
NTREXBitextMining default test uig_Arab 1
NTREXBitextMining default test shi_Arab 1
NTREXBitextMining default test ukr_Cyrl 1
NTREXBitextMining default test rus_Cyrl 1
NTREXBitextMining default test bul_Cyrl 1
NTREXBitextMining default test smo_Latn 1
NTREXBitextMining default test ssw_Latn 1
NTREXBitextMining default test ell_Grek 1
NTREXBitextMining default test tuk_Latn 1
NTREXBitextMining default test khm_Khmr 1
NTREXBitextMining default test nso_Latn 1
NTREXBitextMining default test tel_Telu 1
NTREXBitextMining default test swe_Latn 1
NTREXBitextMining default test sqi_Latn 1
NTREXBitextMining default test fin_Latn 1
NTREXBitextMining default test kan_Knda 1
NTREXBitextMining default test deu_Latn 1
NTREXBitextMining default test nno_Latn 1
NTREXBitextMining default test por_Latn 1
NTREXBitextMining default test tir_Ethi 1
NTREXBitextMining default test bos_Latn 1
NTREXBitextMining default test srp_Latn 1
NTREXBitextMining default test kin_Latn 1
NTREXBitextMining default test div_Thaa 1
NTREXBitextMining default test kaz_Cyrl 1
NTREXBitextMining default test urd_Arab 1
NTREXBitextMining default test yue_Hant 1
NTREXBitextMining default test slk_Latn 1
NTREXBitextMining default test hye_Armn 1
NTREXBitextMining default test kmr_Latn 1
NTREXBitextMining default test ita_Latn 1
NTREXBitextMining default test nld_Latn 1
NTREXBitextMining default test zul_Latn 1
NTREXBitextMining default test kir_Cyrl 1
NTREXBitextMining default test nep_Deva 1
NTREXBitextMining default test slv_Latn 1
NTREXBitextMining default test hin_Deva 1
NTREXBitextMining default test pol_Latn 1
NTREXBitextMining default test isl_Latn 1
NTREXBitextMining default test uzb_Latn 1
NTREXBitextMining default test bel_Cyrl 1
NTREXBitextMining default test yor_Latn 1
NTREXBitextMining default test sin_Sinh 1
NTREXBitextMining default test tha_Thai 1
NTREXBitextMining default test ltz_Latn 1
NTREXBitextMining default test heb_Hebr 1
NTREXBitextMining default test glg_Latn 1
NTREXBitextMining default test nob_Latn 1
NTREXBitextMining default test fuc_Latn 1
NTREXBitextMining default test vie_Latn 1
NTREXBitextMining default test mlt_Latn 1
NTREXBitextMining default test fra_Latn 1
NTREXBitextMining default test afr_Latn 1
NTREXBitextMining default test tat_Cyrl 1
NTREXBitextMining default test spa_Latn 1
NTREXBitextMining default test snd_Arab 1
NTREXBitextMining default test pan_Guru 1
IN22ConvBitextMining default test snd_Deva 13
IN22ConvBitextMining default test tam_Taml 11
IN22ConvBitextMining default test mal_Mlym 8
IN22ConvBitextMining default test tel_Telu 8
IN22ConvBitextMining default test pan_Guru 8
IN22ConvBitextMining default test ben_Beng 6
IN22ConvBitextMining default test asm_Beng 6
IN22ConvBitextMining default test hin_Deva 6
IN22ConvBitextMining default test eng_Latn 6
IN22ConvBitextMining default test npi_Deva 6
IN22ConvBitextMining default test urd_Arab 5
IN22ConvBitextMining default test mni_Mtei 5
IN22ConvBitextMining default test brx_Deva 5
IN22ConvBitextMining default test mai_Deva 4
IN22ConvBitextMining default test doi_Deva 4
IN22ConvBitextMining default test kan_Knda 4
IN22ConvBitextMining default test guj_Gujr 3
IN22ConvBitextMining default test gom_Deva 3
IN22ConvBitextMining default test sat_Olck 3
IN22ConvBitextMining default test san_Deva 3
IN22ConvBitextMining default test ory_Orya 3
IN22ConvBitextMining default test mar_Deva 2
IN22ConvBitextMining default test kas_Arab 1
IWSLT2017BitextMining en-zh validation sentence2 12
IWSLT2017BitextMining zh-en validation sentence1 12
IWSLT2017BitextMining en-it validation sentence2 11
IWSLT2017BitextMining it-en validation sentence1 11
IWSLT2017BitextMining en-fr validation sentence2 9
IWSLT2017BitextMining fr-en validation sentence1 9
IWSLT2017BitextMining ar-en validation sentence2 7
IWSLT2017BitextMining de-en validation sentence1 7
IWSLT2017BitextMining de-en validation sentence2 7
IWSLT2017BitextMining en-ar validation sentence1 7
IWSLT2017BitextMining en-de validation sentence1 7
IWSLT2017BitextMining en-de validation sentence2 7
IWSLT2017BitextMining en-fr validation sentence1 7
IWSLT2017BitextMining en-it validation sentence1 7
IWSLT2017BitextMining en-ja validation sentence1 7
IWSLT2017BitextMining en-ko validation sentence1 7
IWSLT2017BitextMining en-ko validation sentence2 7
IWSLT2017BitextMining en-nl validation sentence1 7
IWSLT2017BitextMining en-ro validation sentence1 7
IWSLT2017BitextMining en-zh validation sentence1 7
IWSLT2017BitextMining fr-en validation sentence2 7
IWSLT2017BitextMining it-en validation sentence2 7
IWSLT2017BitextMining it-nl validation sentence1 7
IWSLT2017BitextMining it-ro validation sentence1 7
IWSLT2017BitextMining ja-en validation sentence2 7
IWSLT2017BitextMining ko-en validation sentence1 7
IWSLT2017BitextMining ko-en validation sentence2 7
IWSLT2017BitextMining nl-en validation sentence2 7
IWSLT2017BitextMining nl-it validation sentence2 7
IWSLT2017BitextMining ro-en validation sentence2 7
IWSLT2017BitextMining ro-it validation sentence2 7
IWSLT2017BitextMining zh-en validation sentence2 7
IWSLT2017BitextMining en-ja validation sentence2 5
IWSLT2017BitextMining ja-en validation sentence1 5
IWSLT2017BitextMining en-ro validation sentence2 4
IWSLT2017BitextMining it-ro validation sentence2 4
IWSLT2017BitextMining nl-ro validation sentence1 4
IWSLT2017BitextMining nl-ro validation sentence2 4
IWSLT2017BitextMining ro-en validation sentence1 4
IWSLT2017BitextMining ro-it validation sentence1 4
IWSLT2017BitextMining ro-nl validation sentence1 4
IWSLT2017BitextMining ro-nl validation sentence2 4
IWSLT2017BitextMining en-nl validation sentence2 3
IWSLT2017BitextMining it-nl validation sentence2 3
IWSLT2017BitextMining nl-en validation sentence1 3
IWSLT2017BitextMining nl-it validation sentence1 3
IWSLT2017BitextMining ar-en validation sentence1 1
IWSLT2017BitextMining en-ar validation sentence2 1
BUCC.v2 ru-en test sentence2 11
BUCC.v2 de-en test sentence2 10
BUCC.v2 fr-en test sentence2 10
BUCC.v2 ru-en test sentence1 10
BUCC.v2 de-en test sentence1 7
BUCC.v2 fr-en test sentence1 5
BornholmBitextMining default test sentence2 9
BornholmBitextMining default test sentence1 3
NollySentiBitextMining en-ha train sentence1 5
NollySentiBitextMining en-ig train sentence1 5
NollySentiBitextMining en-pcm train sentence1 5
NollySentiBitextMining en-yo train sentence1 5
NollySentiBitextMining en-ha train sentence2 3
NollySentiBitextMining en-ig train sentence2 2
NollySentiBitextMining en-pcm train sentence2 2
NollySentiBitextMining en-yo train sentence2 1
TbilisiCityHallBitextMining kat_Geor-eng_Latn test sentence2 4
TbilisiCityHallBitextMining eng_Latn-kat_Geor test sentence1 4
IndicGenBenchFloresBitextMining san-eng validation sentence1 3
IndicGenBenchFloresBitextMining eng-san validation sentence2 3
IndicGenBenchFloresBitextMining bod-eng test sentence1 3
IndicGenBenchFloresBitextMining eng-bod test sentence2 3
IndicGenBenchFloresBitextMining sat-eng validation sentence1 2
IndicGenBenchFloresBitextMining eng-sat validation sentence2 2
IndicGenBenchFloresBitextMining urd-eng validation sentence1 1
IndicGenBenchFloresBitextMining eng-urd validation sentence2 1
IndicGenBenchFloresBitextMining san-eng test sentence1 1
IndicGenBenchFloresBitextMining eng-san test sentence2 1
IndicGenBenchFloresBitextMining bod-eng validation sentence1 1
IndicGenBenchFloresBitextMining eng-bod validation sentence2 1
IndicGenBenchFloresBitextMining boy-eng test sentence1 1
IndicGenBenchFloresBitextMining eng-boy test sentence2 1
IndicGenBenchFloresBitextMining hne-eng test sentence1 1
IndicGenBenchFloresBitextMining eng-hne test sentence2 1
FloresBitextMining default devtest kik_Latn 3
FloresBitextMining default devtest bam_Latn 3
FloresBitextMining default devtest bod_Tibt 3
FloresBitextMining default devtest lua_Latn 2
FloresBitextMining default devtest dyu_Latn 2
FloresBitextMining default devtest kam_Latn 2
FloresBitextMining default devtest hat_Latn 2
FloresBitextMining default devtest grn_Latn 2
FloresBitextMining default devtest min_Latn 1
FloresBitextMining default devtest uig_Arab 1
FloresBitextMining default devtest smo_Latn 1
FloresBitextMining default devtest ary_Arab 1
FloresBitextMining default devtest szl_Latn 1
FloresBitextMining default devtest san_Deva 1
FloresBitextMining default devtest min_Arab 1
FloresBitextMining default devtest hne_Deva 1
FloresBitextMining default devtest tgk_Cyrl 1
FloresBitextMining default devtest gla_Latn 1
FloresBitextMining default devtest run_Latn 1
NorwegianCourtsBitextMining default test sentence2 2
NorwegianCourtsBitextMining default test sentence1 1
VieMedEVBitextMining default test sentence2 1
IN22GenBitextMining default test tam_Taml 1

Example from LinceMTBitextMining task:

[{'sentence1': ')',
  'sentence2': '@mamtan14 didi abhi apni me time hai',
  'lang': 'eng-eng_hin'},
 {'sentence1': ')',
  'sentence2': '@iamGunjanGrunge Tanmay jail ko hi kha jata ',
  'lang': 'eng-eng_hin'},
 {'sentence1': ')',
  'sentence2': '@theamarjeet Haan bhai Ayesha ke sab fan hai ',
  'lang': 'eng-eng_hin'},
 {'sentence1': ')',
  'sentence2': '@Sand_In_Deed sabko samaj me aaya hi nahi ',
  'lang': 'eng-eng_hin'},
 {'sentence1': ')',
  'sentence2': 'ladka gaming mai khoya hua hai.... ',
  'lang': 'eng-eng_hin'}]

The character ) is repeated 149 times in the first column and there are 149 different translations to it.

I counted duplicates for each column, so each task has to be dealt with separately. In some tasks, like DiaBlaBitextMining, there are completely identical rows, but there are also different translations - this is normal for this task.

| orig_lang   | sentence1     | sentence2       |
|:------------|:--------------|:----------------|
| french      | Absolument.   | Absolutely.     |
| french      | Absolument.   | Exactly.        |
| english     | Ah OK.        | Ah ok.          |
| english     | Ah OK.        | Ah, okay.       |
| french      | Ah très bien. | Oh that's good. |
| french      | Ah très bien. | Ah, nice.       |

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Nov 7, 2024

Great work

Oh wow!

Also seems like there might be a few trivial 1-word / short documents that might be worth removing to make the datasets more challenging. WDYT?

@Samoed
Copy link
Collaborator

Samoed commented Nov 7, 2024

Awesome! @AlexeyVatolin you can run calculate_metadata_metrics() on task to calculate them more automatically (at least it will count number of documents)

@KennethEnevoldsen
Copy link
Contributor Author

Adding this to calculate metadata is a great way. I am not sure how we should count duplicates (by column or across)

@Samoed
Copy link
Collaborator

Samoed commented Nov 8, 2024

I can add number of unique sentences to tasks. Maybe we can add more information to the tasks

@AlexeyVatolin
Copy link
Contributor

AlexeyVatolin commented Nov 8, 2024

@KennethEnevoldsen I think it would be a great idea to issue a warning if a dataset contains duplicates in the calculate_metadata_metrics function.
Also duplicate checks should depend on the task. For the BitextMining task I propose these criteria:

  1. several (2-3) translation variants (duplicates in sentence1, different sentence2 and the same lang) - ok
  2. complete duplicates (fields sentence1, sentence2, lang completely coincide) - not ok
  3. multiple translations (duplicates in sentence1, different sentence2 and the same lang) - not ok

Additionally, a text length check could be implemented, with a warning message displayed when the text is particularly short.

@KennethEnevoldsen
Copy link
Contributor Author

Def. worth adding to the calculate_metadata_metrics then it will be added in review as well and we can have a look at it before merging.

We might in the test suite also implement limits on duplicates, test set leakage, and min document length (require adding datasets to an exception list of breaking any of the requirements)

If we have this I am not sure that warnings will be required.

@KennethEnevoldsen
Copy link
Contributor Author

@gentaiscool just letting you know about this issue as well (notably see LinceMTBitextMining in table above)

@AlexeyVatolin
Copy link
Contributor

In the table below, I counted how many more complete duplicates there are in the BitextMining tasks (by text pairs from columns sentence1 and sentence2 )

Dataset Name HF Subset Split Num Documents
LinceMTBitextMining eng-eng_hin train 513
DiaBlaBitextMining fr-en test 104
DiaBlaBitextMining en-fr test 104
BornholmBitextMining default train 24
BibleNLPBitextMining eng_Latn-mgh_Latn train 21
BibleNLPBitextMining mgh_Latn-eng_Latn train 21
BibleNLPBitextMining eng_Latn-wed_Latn train 14
BibleNLPBitextMining wed_Latn-eng_Latn train 14
BibleNLPBitextMining eng_Latn-xbi_Latn train 8
BibleNLPBitextMining xbi_Latn-eng_Latn train 8
BibleNLPBitextMining eng_Latn-dhg_Latn train 7
BibleNLPBitextMining dhg_Latn-eng_Latn train 7
BibleNLPBitextMining eng_Latn-ind_Latn train 5
BibleNLPBitextMining ind_Latn-eng_Latn train 5
BibleNLPBitextMining eng_Latn-omw_Latn train 4
BibleNLPBitextMining omw_Latn-eng_Latn train 4
BibleNLPBitextMining eng_Latn-pol_Latn train 4
BibleNLPBitextMining pol_Latn-eng_Latn train 4
BibleNLPBitextMining eng_Latn-rmy_Latn train 4
BibleNLPBitextMining rmy_Latn-eng_Latn train 4
BibleNLPBitextMining eng_Latn-tbg_Latn train 4
BibleNLPBitextMining tbg_Latn-eng_Latn train 4
BibleNLPBitextMining eng_Latn-byr_Latn train 3
BibleNLPBitextMining byr_Latn-eng_Latn train 3
BibleNLPBitextMining eng_Latn-quc_Latn train 3
BibleNLPBitextMining quc_Latn-eng_Latn train 3
BibleNLPBitextMining eng_Latn-boj_Latn train 2
BibleNLPBitextMining boj_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-ctu_Latn train 2
BibleNLPBitextMining ctu_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-deu_Latn train 2
BibleNLPBitextMining deu_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-dwr_Latn train 2
BibleNLPBitextMining dwr_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-fra_Latn train 2
BibleNLPBitextMining fra_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-gfk_Latn train 2
BibleNLPBitextMining gfk_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-gmv_Latn train 2
BibleNLPBitextMining gmv_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-gof_Latn train 2
BibleNLPBitextMining gof_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-grc_Grek train 2
BibleNLPBitextMining grc_Grek-eng_Latn train 2
BibleNLPBitextMining eng_Latn-hat_Latn train 2
BibleNLPBitextMining hat_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-hau_Latn train 2
BibleNLPBitextMining hau_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-hlt_Latn train 2
BibleNLPBitextMining hlt_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-ixl_Latn train 2
BibleNLPBitextMining ixl_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-lif_Deva train 2
BibleNLPBitextMining lif_Deva-eng_Latn train 2
BibleNLPBitextMining eng_Latn-nhe_Latn train 2
BibleNLPBitextMining nhe_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-pon_Latn train 2
BibleNLPBitextMining pon_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-ron_Latn train 2
BibleNLPBitextMining ron_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-san_Latn train 2
BibleNLPBitextMining san_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-spl_Latn train 2
BibleNLPBitextMining spl_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-tpi_Latn train 2
BibleNLPBitextMining tpi_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-tuc_Latn train 2
BibleNLPBitextMining tuc_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-tzj_Latn train 2
BibleNLPBitextMining tzj_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-tzo_Latn train 2
BibleNLPBitextMining tzo_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-uig_Latn train 2
BibleNLPBitextMining uig_Latn-eng_Latn train 2
BibleNLPBitextMining eng_Latn-urd_Arab train 2
BibleNLPBitextMining urd_Arab-eng_Latn train 2
BibleNLPBitextMining eng_Latn-aaz_Latn train 1
BibleNLPBitextMining aaz_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-abt_Latn train 1
BibleNLPBitextMining abt_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-amn_Latn train 1
BibleNLPBitextMining amn_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-arb_Arab train 1
BibleNLPBitextMining arb_Arab-eng_Latn train 1
BibleNLPBitextMining eng_Latn-bel_Cyrl train 1
BibleNLPBitextMining bel_Cyrl-eng_Latn train 1
BibleNLPBitextMining eng_Latn-bkx_Latn train 1
BibleNLPBitextMining bkx_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-ces_Latn train 1
BibleNLPBitextMining ces_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-cmn_Hans train 1
BibleNLPBitextMining cmn_Hans-eng_Latn train 1
BibleNLPBitextMining eng_Latn-cuk_Latn train 1
BibleNLPBitextMining cuk_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-hus_Latn train 1
BibleNLPBitextMining hus_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-hvn_Latn train 1
BibleNLPBitextMining hvn_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-ipi_Latn train 1
BibleNLPBitextMining ipi_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-ita_Latn train 1
BibleNLPBitextMining ita_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-kup_Latn train 1
BibleNLPBitextMining kup_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-llg_Latn train 1
BibleNLPBitextMining llg_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-mal_Mlym train 1
BibleNLPBitextMining mal_Mlym-eng_Latn train 1
BibleNLPBitextMining eng_Latn-mam_Latn train 1
BibleNLPBitextMining mam_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-mkj_Latn train 1
BibleNLPBitextMining mkj_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-mkn_Latn train 1
BibleNLPBitextMining mkn_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-msy_Latn train 1
BibleNLPBitextMining msy_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-mya_Latn train 1
BibleNLPBitextMining mya_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-nfa_Latn train 1
BibleNLPBitextMining nfa_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-okv_Latn train 1
BibleNLPBitextMining okv_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-por_Latn train 1
BibleNLPBitextMining por_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-rgu_Latn train 1
BibleNLPBitextMining rgu_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-row_Latn train 1
BibleNLPBitextMining row_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-spa_Latn train 1
BibleNLPBitextMining spa_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-sua_Latn train 1
BibleNLPBitextMining sua_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-swh_Latn train 1
BibleNLPBitextMining swh_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-txq_Latn train 1
BibleNLPBitextMining txq_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-ubu_Latn train 1
BibleNLPBitextMining ubu_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-ulk_Latn train 1
BibleNLPBitextMining ulk_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-vie_Latn train 1
BibleNLPBitextMining vie_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-wol_Latn train 1
BibleNLPBitextMining wol_Latn-eng_Latn train 1
BibleNLPBitextMining eng_Latn-wro_Latn train 1
BibleNLPBitextMining wro_Latn-eng_Latn train 1
BUCC.v2 ru-en test 10
BUCC.v2 de-en test 7
BUCC.v2 fr-en test 5
NusaTranslationBitextMining ind-sun train 9
NusaTranslationBitextMining ind-jav train 6
NusaTranslationBitextMining ind-mak train 6
NusaTranslationBitextMining ind-bew train 5
NusaTranslationBitextMining ind-min train 5
NusaTranslationBitextMining ind-btk train 3
NusaTranslationBitextMining ind-mad train 2
NusaTranslationBitextMining ind-abs train 1
IWSLT2017BitextMining en-fr validation 7
IWSLT2017BitextMining fr-en validation 7
IWSLT2017BitextMining de-en validation 5
IWSLT2017BitextMining en-de validation 5
IWSLT2017BitextMining en-it validation 5
IWSLT2017BitextMining en-ko validation 5
IWSLT2017BitextMining it-en validation 5
IWSLT2017BitextMining ko-en validation 5
IWSLT2017BitextMining en-ja validation 4
IWSLT2017BitextMining en-ro validation 4
IWSLT2017BitextMining ja-en validation 4
IWSLT2017BitextMining ro-en validation 4
IWSLT2017BitextMining en-nl validation 3
IWSLT2017BitextMining it-nl validation 3
IWSLT2017BitextMining it-ro validation 3
IWSLT2017BitextMining nl-en validation 3
IWSLT2017BitextMining nl-it validation 3
IWSLT2017BitextMining ro-it validation 3
IWSLT2017BitextMining en-zh validation 2
IWSLT2017BitextMining nl-ro validation 2
IWSLT2017BitextMining ro-nl validation 2
IWSLT2017BitextMining zh-en validation 2
IWSLT2017BitextMining ar-en validation 1
IWSLT2017BitextMining en-ar validation 1
NollySentiBitextMining en-ha train 3
NollySentiBitextMining en-pcm train 2
NollySentiBitextMining en-ig train 1
NollySentiBitextMining en-yo train 1
PhincBitextMining eng-eng_hin train 1

@AlexeyVatolin
Copy link
Contributor

Classification

Duplicated documents

Dataset Name Split Column Name Num Documents
CzechProductReviewSentimentClassification train text 3591
CzechProductReviewSentimentClassification test text 45
BengaliDocumentClassification train text 280
TweetSarcasmClassification train text 259
TweetSarcasmClassification test text 16
OnlineStoreReviewSentimentClassification train text 176
OnlineStoreReviewSentimentClassification train text 176
BengaliSentimentAnalysis train text 113
BengaliSentimentAnalysis train text 113
BengaliHateSpeechClassification train text 94
BengaliHateSpeechClassification train text 94
CSFDCZMovieReviewSentimentClassification train text 10
RestaurantReviewSentimentClassification train text 5
RestaurantReviewSentimentClassification train text 5
HotelReviewSentimentClassification train text 4
HotelReviewSentimentClassification train text 4
TweetEmotionClassification train text 1
TweetEmotionClassification train text 1

Example from CzechProductReviewSentimentClassification
Train set in this dataset contains 3481 full duplicates (pair text-label) and also 110 contradictory examples like this

text label
vše OK 2
vše OK 1
vše ok 1
vše ok 2
zatím naprostá spokojenost 2
zatím naprostá spokojenost 1

Empty documents

Dataset Name Split Column Name Num Documents
BengaliDocumentClassification train text 21
CzechProductReviewSentimentClassification train text 7
CzechProductReviewSentimentClassification test text 2

Leaks

Dataset Name Split Num Documents
TweetEmotionClassification train 2047
HotelReviewSentimentClassification train 2044
RestaurantReviewSentimentClassification train 2043
BengaliHateSpeechClassification train 2003
BengaliSentimentAnalysis train 1936
OnlineStoreReviewSentimentClassification train 1910
AJGT train 1800
CzechProductReviewSentimentClassification test 504
TweetSarcasmClassification test 118
CSFDCZMovieReviewSentimentClassification test 2
BengaliDocumentClassification test 2

All datasets with over 1000 leaks are completely broken because it uses the same dataset for training and for evaluation. I think we should disallow the use of eval_splits=["train"] for all supervised tasks. As for the other datasets, train and test contain many of the same examples.

@AlexeyVatolin
Copy link
Contributor

Clustering

Duplicated documents

TwentyNewsgroupsClustering.v2 - 5242 unique texts from 59545, which is a pretty sad situation.
TwentyNewsgroupsClustering.v2, StackExchangeClusteringP2P.v2 and MedrxivClusteringP2P.v2 - these are tasks from the MTEB(eng, beta) benchmark. I believe it would be beneficial to remove any duplicates before releasing this version of the MTEB (eng, beta) benchmark from beta status.

Dataset Name HF Subset Split Column Name Num Documents
TwentyNewsgroupsClustering.v2 default test sentences 54303
StackExchangeClusteringP2P.v2 default test sentences 36269
RedditClusteringP2P.v2 default test sentences 27884
MedrxivClusteringS2S.v2 default test sentences 21236
MedrxivClusteringP2P.v2 default test sentences 21235
SwednClusteringS2S default headlines sentences 870
WikiClusteringP2P.v2 mt test sentences 823
WikiClusteringP2P.v2 ilo test sentences 766
WikiClusteringP2P.v2 wa test sentences 681
WikiClusteringP2P.v2 sco test sentences 610
WikiClusteringP2P.v2 gv test sentences 557
WikiClusteringP2P.v2 ku test sentences 450
WikiClusteringP2P.v2 sq test sentences 447
WikiClusteringP2P.v2 bs test sentences 333
WikiClusteringP2P.v2 eu test sentences 238
WikiClusteringP2P.v2 min test sentences 237
WikiClusteringP2P.v2 ca test sentences 124
WikiClusteringP2P.v2 lv test sentences 106
WikiClusteringP2P.v2 cs test sentences 89
WikiClusteringP2P.v2 da test sentences 85
SwednClusteringP2P default summaries sentences 183
SwednClusteringP2P default articles sentences 84
BlurbsClusteringS2S.v2 default test sentences 103
SpanishNewsClusteringP2P default test sentences 79
PlscClusteringS2S.v2 default test sentences 55
PlscClusteringP2P.v2 default test sentences 55
MLSUMClusteringS2S.v2 es validation sentences 44
MLSUMClusteringS2S.v2 es test sentences 34
MLSUMClusteringS2S.v2 fr validation sentences 14
MLSUMClusteringS2S.v2 fr test sentences 3
RedditClustering.v2 default test sentences 30
MLSUMClusteringP2P.v2 es test sentences 17
MLSUMClusteringP2P.v2 fr test sentences 1
BiorxivClusteringS2S.v2 default test sentences 14
AlloProfClusteringS2S.v2 default test sentences 8
HALClusteringS2S.v2 default test sentences 6
RomaniBibleClustering default test sentences 5
StackExchangeClustering.v2 default test sentences 3
MasakhaNEWSClusteringS2S amh test sentences 3
MasakhaNEWSClusteringS2S ibo test sentences 3
MasakhaNEWSClusteringP2P amh test sentences 3
MasakhaNEWSClusteringP2P ibo test sentences 3
IndicReviewsClusteringP2P as test sentences 3
IndicReviewsClusteringP2P gu test sentences 3
IndicReviewsClusteringP2P bd test sentences 2
IndicReviewsClusteringP2P bn test sentences 2
IndicReviewsClusteringP2P hi test sentences 2
IndicReviewsClusteringP2P kn test sentences 2
IndicReviewsClusteringP2P ml test sentences 2
IndicReviewsClusteringP2P mr test sentences 2
IndicReviewsClusteringP2P or test sentences 2
IndicReviewsClusteringP2P pa test sentences 2
IndicReviewsClusteringP2P ta test sentences 2
IndicReviewsClusteringP2P te test sentences 2
IndicReviewsClusteringP2P ur test sentences 2
BiorxivClusteringP2P.v2 default test sentences 3
VGHierarchicalClusteringS2S default test sentences 2
SNLHierarchicalClusteringS2S default test sentences 2
SIB200ClusteringS2S ary_Arab test sentences 2
SIB200ClusteringS2S dyu_Latn test sentences 2
SIB200ClusteringS2S lua_Latn test sentences 2
SIB200ClusteringS2S san_Deva test sentences 2
SIB200ClusteringS2S azb_Arab test sentences 1
SIB200ClusteringS2S bam_Latn test sentences 1
SIB200ClusteringS2S bod_Tibt test sentences 1
SIB200ClusteringS2S bug_Latn test sentences 1
SIB200ClusteringS2S grn_Latn test sentences 1
SIB200ClusteringS2S hat_Latn test sentences 1
SIB200ClusteringS2S kbp_Latn test sentences 1
SIB200ClusteringS2S sat_Olck test sentences 1
SIB200ClusteringS2S sun_Latn test sentences 1
SIB200ClusteringS2S tgk_Cyrl test sentences 1
SIB200ClusteringS2S urd_Arab test sentences 1
WikiCitiesClustering default test sentences 1
MewsC16JaClustering default test sentences 1

Empty documents

Dataset Name Split Column Name Num Documents
SpanishNewsClusteringP2P test sentences 4

@AlexeyVatolin
Copy link
Contributor

InstructionRetrieval

Duplicated documents

Dataset Name Split Column Name Num Documents
Core17InstructionRetrieval test corpus 756
Robust04InstructionRetrieval test corpus 266
News21InstructionRetrieval test corpus 23

@AlexeyVatolin
Copy link
Contributor

MultilabelClassification

Duplicated documents

Dataset Name HF Subset Split Column Name Num Documents
SensitiveTopicsClassification default train text 36
CEDRClassification default train text 28
CEDRClassification default test text 7
BrazilianToxicTweetsClassification default train text 20
BrazilianToxicTweetsClassification default test text 2
MultiEURLEXMultilabelClassification en train text 14
MultiEURLEXMultilabelClassification pt train text 14
MultiEURLEXMultilabelClassification nl train text 13
MultiEURLEXMultilabelClassification el train text 12
MultiEURLEXMultilabelClassification fi train text 11
MultiEURLEXMultilabelClassification es train text 10
MultiEURLEXMultilabelClassification fr train text 9
MultiEURLEXMultilabelClassification de train text 8
MultiEURLEXMultilabelClassification it train text 8
MultiEURLEXMultilabelClassification sv train text 8
MultiEURLEXMultilabelClassification da train text 5
MultiEURLEXMultilabelClassification lt train text 2
MultiEURLEXMultilabelClassification pl train text 1
MultiEURLEXMultilabelClassification ro train text 1
MultiEURLEXMultilabelClassification hu train text 1
MultiEURLEXMultilabelClassification cs train text 1
MultiEURLEXMultilabelClassification bg train text 1
MultiEURLEXMultilabelClassification sk train text 1
MultiEURLEXMultilabelClassification sl train text 1
MultiEURLEXMultilabelClassification et train text 1
MultiEURLEXMultilabelClassification lv train text 1
MultiEURLEXMultilabelClassification mt train text 1
KorHateSpeechMLClassification default train text 8

Leaked documents

Dataset Name Split Num Documents
BrazilianToxicTweetsClassification test 11
CEDRClassification test 5
KorHateSpeechMLClassification test 2

@Samoed
Copy link
Collaborator

Samoed commented Nov 11, 2024

@AlexeyVatolin Thank you! I've implemented your ideas in calculate_metadata_metrics. Could you take a look at the PR #1438?

@AlexeyVatolin
Copy link
Contributor

PairClassification

Duplicated documents

It's funny that the most duplicates are in the duplicate search dataset :)

Dataset Name HF Subset Split Column Name Num Documents
SprintDuplicateQuestions default validation sentence1 100000
SprintDuplicateQuestions default test sentence1 100000
SprintDuplicateQuestions default validation sentence2 93068
SprintDuplicateQuestions default test sentence2 93068
TwitterURLCorpus default test sentence1 47205
TwitterURLCorpus default test sentence2 10444
TwitterSemEval2015 default test sentence1 14704
TwitterSemEval2015 default test sentence2 1022
Cmnli default validation sentence1 4183
Cmnli default validation sentence2 10
SICK-E-PL default test sentence2 1628
SICK-E-PL default test sentence1 1574
LegalBenchPC default test sentence2 1575
LegalBenchPC default test sentence1 752
indonli default test_expert sentence1 1526
indonli default test_expert sentence2 5
FalseFriendsGermanEnglish default test sentence1 1035
FalseFriendsGermanEnglish default test sentence2 538
KLUE-NLI default validation sentence1 1000
PSC default test sentence2 672
PSC default test sentence1 571
XNLI ar validation sentence1 567
XNLI bg validation sentence1 567
XNLI de validation sentence1 567
XNLI el validation sentence1 567
XNLI en validation sentence1 567
XNLI es validation sentence1 567
XNLI fr validation sentence1 567
XNLI hi validation sentence1 567
XNLI ru validation sentence1 567
XNLI sw validation sentence1 567
XNLI th validation sentence1 567
XNLI tr validation sentence1 567
XNLI vi validation sentence1 567
XNLI zh validation sentence1 567
XNLI es test sentence1 271
XNLI sw test sentence1 271
XNLI ar test sentence1 270
XNLI bg test sentence1 270
XNLI de test sentence1 270
XNLI el test sentence1 270
XNLI en test sentence1 270
XNLI fr test sentence1 270
XNLI hi test sentence1 270
XNLI ru test sentence1 270
XNLI th test sentence1 270
XNLI tr test sentence1 270
XNLI vi test sentence1 270
XNLI zh test sentence1 270
FarsTail default test sentence1 499
FarsTail default test sentence2 3
ArEntail default test sentence1 493
ArEntail default test sentence2 2
CDSC-E default test sentence2 491
CDSC-E default test sentence1 490
Ocnli default validation sentence1 325
Ocnli default validation sentence2 1
Assin2RTE default test sentence1 272
Assin2RTE default test sentence2 272
XNLIV2 odiya test sentence1 271
XNLIV2 punjabi test sentence1 270
XNLIV2 gujrati test sentence1 270
XNLIV2 kannada test sentence1 270
XNLIV2 assamese test sentence1 270
XNLIV2 bengali test sentence1 270
XNLIV2 marathi test sentence1 270
XNLIV2 bhojpuri test sentence1 270
XNLIV2 sanskrit test sentence1 270
XNLIV2 tamil test sentence1 270
XNLIV2 turkish test sentence1 270
XNLIV2 greek test sentence1 270
XNLIV2 russian test sentence1 270
CTKFactsNLI default test sentence1 252
CTKFactsNLI default validation sentence1 190
CTKFactsNLI default test sentence2 124
CTKFactsNLI default validation sentence2 81
PawsXPairClassification en validation sentence1 242
PawsXPairClassification en test sentence1 239
PawsXPairClassification en validation sentence2 229
PawsXPairClassification en test sentence2 200
PawsXPairClassification zh validation sentence2 105
PawsXPairClassification zh validation sentence1 101
PawsXPairClassification zh test sentence1 91
PawsXPairClassification zh test sentence2 91
PawsXPairClassification de validation sentence1 86
PawsXPairClassification de validation sentence2 80
PawsXPairClassification fr validation sentence1 67
PawsXPairClassification de test sentence1 66
PawsXPairClassification de test sentence2 62
PawsXPairClassification es validation sentence1 62
PawsXPairClassification fr validation sentence2 61
PawsXPairClassification es validation sentence2 59
PawsXPairClassification ja test sentence2 59
PawsXPairClassification ja test sentence1 56
PawsXPairClassification fr test sentence2 47
PawsXPairClassification fr test sentence1 46
PawsXPairClassification ko test sentence1 46
PawsXPairClassification es test sentence1 45
PawsXPairClassification ja validation sentence1 43
PawsXPairClassification es test sentence2 41
PawsXPairClassification ko validation sentence1 37
PawsXPairClassification ko validation sentence2 32
PawsXPairClassification ja validation sentence2 31
PawsXPairClassification ko test sentence2 31
SICK-BR-PC default test sentence1 177
SICK-BR-PC default test sentence2 170
RTE3 en test sentence2 74
RTE3 de test sentence2 72
RTE3 it test sentence2 72
RTE3 fr test sentence2 54
RTE3 de test sentence1 27
RTE3 en test sentence1 25
RTE3 it test sentence1 24
RTE3 fr test sentence1 15
TERRa default dev sentence1 25
PpcPC default test sentence1 16
PpcPC default test sentence2 4
ArmenianParaphrasePC default test sentence1 5
ArmenianParaphrasePC default test sentence2 1

@KennethEnevoldsen
Copy link
Contributor Author

Oh wow! It might be worth splitting these up to independent issues pr. task type.

Also seems like there is enough work here to warrant a broader work on dataset quality (I could imagine quite a few additional elements, removing too easy examples from training data, testing for data leakage between test and train).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants