Add Korean Text Search Tasks to MTEB #210

taeminlee · 2024-01-19T11:11:24Z

Hello MTEB maintainers,

I am currently working on a project that involves implementing an embedding model for Korean text search. Through my work, I realized the need for a benchmark, and that's how I came across MTEB. However, I noticed that MTEB does not currently support tasks specifically for Korean language. To address this gap, I have added tasks for Korean text search.

Limitations

At the moment, the implementation does not support a wide range of tasks like PL-MTEB or C-MTEB.

To-Do

I plan to add various tasks using Korean corpora in the near future. For example, I'm considering the addition of tasks like klue-sts.

I believe this enhancement will significantly benefit researchers and developers working with Korean language text search and analysis. I am looking forward to your feedback and suggestions on this addition.

Thank you for considering my contribution.

Muennighoff

Amazing! Adding Korean would be huge :)

From my side, we can pretty much already merge this and then you can add more whenever you want. But we can also leave the PR open if you prefer!

Muennighoff · 2024-01-19T11:15:13Z

mteb/abstasks/AbsTaskRetrieval.py

@@ -130,4 +139,6 @@ def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs)
                (doc["title"] + self.sep + doc["text"]).strip() if "title" in doc else doc["text"].strip()
                for doc in corpus
            ]
+        if prefix != '':
+            sentences = [prefix + sentence for sentence in sentences]


I think we can remove this. This should be done in the encode method of the model. If the model has an encode_corpus function then it will automatically use that

I agree with your suggestion to remove this section. Thanks for pointing this out!

mteb/abstasks/AbsTaskRetrieval.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

taeminlee · 2024-02-06T03:59:43Z

Amazing! Adding Korean would be huge :)

From my side, we can pretty much already merge this and then you can add more whenever you want. But we can also leave the PR open if you prefer!

Thank you for your feedback. I concur with proceeding with the current merge. :)

mteb/abstasks/AbsTaskRetrieval.py

scripts/run_mteb_korean.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Pang-dachu · 2024-02-16T05:42:03Z

@taeminlee
한국어 데이터 MTEB 평가를 어떻게 진행하면 될까요 ?

현재 [run_mteb_korean.py] 파일을 실행하면 평가 task들에 대해서 아래와 같은 오류가 발생합니다.

INFO:main:Running task: Ko-miracl
WARNING:mteb.evaluation.MTEB:WARNING: Unknown tasks: Ko-miracl.   

Known tasks: 8TagsClustering,AFQMC,ATEC,AllegroReviews,AmazonCounterfactualClassification,AmazonPolarityClassification,AmazonReviewsClassification,AngryTweetsClassification,ArguAna,ArguAna-PL,ArxivClusteringP2P,ArxivClusteringS2S,AskUbuntuDupQuestions,BIOSSES,BQ,BUCC,Banking77Classification,BiorxivClusteringP2P,BiorxivClusteringS2S,BlurbsClusteringP2P,BlurbsClusteringS2S,BornholmBitextMining,CBD,CDSC-E,CDSC-R,CLSClusteringP2P,CLSClusteringS2S,CMedQAv1,CMedQAv2,CQADupstackAndroidRetrieval,CQADupstackEnglishRetrieval,CQADupstackGamingRetrieval,CQADupstackGisRetrieval,CQADupstackMathematicaRetrieval,CQADupstackPhysicsRetrieval,CQADupstackProgrammersRetrieval,CQADupstackStatsRetrieval,CQADupstackTexRetrieval,CQADupstackUnixRetrieval,CQADupstackWebmastersRetrieval,CQADupstackWordpressRetrieval,ClimateFEVER,CmedqaRetrieval,Cmnli,CovidRetrieval,DBPedia,DBPedia-PL,DKHateClassification,DalajClassification,DanishPoliticalCommentsClassification,DuRetrieval,EcomRetrieval,EmotionClassification,FEVER,FiQA-PL,FiQA2018,HotpotQA,HotpotQA-PL,IFlyTek,ImdbClassification,JDReview,LCQMC,LccSentimentClassification,MMarcoReranking,MMarcoRetrieval,MSMARCO,MSMARCO-PL,MSMARCOv2,MTOPDomainClassification,MTOPIntentClassification,MassiveIntentClassification,MassiveScenarioClassification,MedicalRetrieval,MedrxivClusteringP2P,MedrxivClusteringS2S,MindSmallReranking,MultilingualSentiment,NFCorpus,NFCorpus-PL,NQ,NQ-PL,NoRecClassification,NordicLangClassification,NorwegianParliament,Ocnli,OnlineShopping,PAC,PAWSX,PPC,PSC,PolEmo2.0-IN,PolEmo2.0-OUT,QBQTC,Quora-PL,QuoraRetrieval,RedditClustering,RedditClusteringP2P,SCIDOCS,SCIDOCS-PL,SICK-E-PL,SICK-R,SICK-R-PL,STS12,STS13,STS14,STS15,STS16,STS17,STS22,STSB,STSBenchmark,ScalaDaClassification,ScalaNbClassification,ScalaSvClassification,SciDocsRR,SciFact,SciFact-PL,SprintDuplicateQuestions,StackExchangeClustering,StackExchangeClusteringP2P,StackOverflowDupQuestions,SweRecClassification,T2Reranking,T2Retrieval,TNews,TRECCOVID,TRECCOVID-PL,Tatoeba,TenKGnadClusteringP2P,TenKGnadClusteringS2S,ThuNewsClusteringP2P,ThuNewsClusteringS2S,Touche2020,ToxicConversationsClassification,TweetSentimentExtractionClassification,TwentyNewsgroupsClustering,TwitterSemEval2015,TwitterURLCorpus,VideoRetrieval,Waimai.

Muennighoff · 2024-02-16T07:59:05Z

@taeminlee 한국어 데이터 MTEB 평가를 어떻게 진행하면 될까요 ?

현재 [run_mteb_korean.py] 파일을 실행하면 평가 task들에 대해서 아래와 같은 오류가 발생합니다.

INFO:main:Running task: Ko-miracl
WARNING:mteb.evaluation.MTEB:WARNING: Unknown tasks: Ko-miracl.   

Known tasks: 8TagsClustering,AFQMC,ATEC,AllegroReviews,AmazonCounterfactualClassification,AmazonPolarityClassification,AmazonReviewsClassification,AngryTweetsClassification,ArguAna,ArguAna-PL,ArxivClusteringP2P,ArxivClusteringS2S,AskUbuntuDupQuestions,BIOSSES,BQ,BUCC,Banking77Classification,BiorxivClusteringP2P,BiorxivClusteringS2S,BlurbsClusteringP2P,BlurbsClusteringS2S,BornholmBitextMining,CBD,CDSC-E,CDSC-R,CLSClusteringP2P,CLSClusteringS2S,CMedQAv1,CMedQAv2,CQADupstackAndroidRetrieval,CQADupstackEnglishRetrieval,CQADupstackGamingRetrieval,CQADupstackGisRetrieval,CQADupstackMathematicaRetrieval,CQADupstackPhysicsRetrieval,CQADupstackProgrammersRetrieval,CQADupstackStatsRetrieval,CQADupstackTexRetrieval,CQADupstackUnixRetrieval,CQADupstackWebmastersRetrieval,CQADupstackWordpressRetrieval,ClimateFEVER,CmedqaRetrieval,Cmnli,CovidRetrieval,DBPedia,DBPedia-PL,DKHateClassification,DalajClassification,DanishPoliticalCommentsClassification,DuRetrieval,EcomRetrieval,EmotionClassification,FEVER,FiQA-PL,FiQA2018,HotpotQA,HotpotQA-PL,IFlyTek,ImdbClassification,JDReview,LCQMC,LccSentimentClassification,MMarcoReranking,MMarcoRetrieval,MSMARCO,MSMARCO-PL,MSMARCOv2,MTOPDomainClassification,MTOPIntentClassification,MassiveIntentClassification,MassiveScenarioClassification,MedicalRetrieval,MedrxivClusteringP2P,MedrxivClusteringS2S,MindSmallReranking,MultilingualSentiment,NFCorpus,NFCorpus-PL,NQ,NQ-PL,NoRecClassification,NordicLangClassification,NorwegianParliament,Ocnli,OnlineShopping,PAC,PAWSX,PPC,PSC,PolEmo2.0-IN,PolEmo2.0-OUT,QBQTC,Quora-PL,QuoraRetrieval,RedditClustering,RedditClusteringP2P,SCIDOCS,SCIDOCS-PL,SICK-E-PL,SICK-R,SICK-R-PL,STS12,STS13,STS14,STS15,STS16,STS17,STS22,STSB,STSBenchmark,ScalaDaClassification,ScalaNbClassification,ScalaSvClassification,SciDocsRR,SciFact,SciFact-PL,SprintDuplicateQuestions,StackExchangeClustering,StackExchangeClusteringP2P,StackOverflowDupQuestions,SweRecClassification,T2Reranking,T2Retrieval,TNews,TRECCOVID,TRECCOVID-PL,Tatoeba,TenKGnadClusteringP2P,TenKGnadClusteringS2S,ThuNewsClusteringP2P,ThuNewsClusteringS2S,Touche2020,ToxicConversationsClassification,TweetSentimentExtractionClassification,TwentyNewsgroupsClustering,TwitterSemEval2015,TwitterURLCorpus,VideoRetrieval,Waimai.

Maybe it was cuz you weren't installing from source? I released a new MTEB version so if you upgrade to 1.1.2 it should be in there

This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers.

@staoxiao

* docs: Added missing points for #214 Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time. * docs: added point for #197 Added 2 points for rasdani and 2 bonus points for the first german retrieval (I believe). Added one point for each of the reviewers * docs: added points for #116 This includes 6 points for 3 datasets to slvnwhrl +2 for first german clustering task also added points for reviews * Added points for #134 cmteb This includes 29 datasets (38 points) and 6x2 bonus points (12 points) for the 6 taskXlanguage which was not previously included. All the points are attributed to @staoxiao, though we can split them if needed. We also added points for review. * docs: Added points for #137 polish This includes points for 12 datasets (24) across 4 tasks (8). These points are given to rafalposwiata and then one point for review * docs: Added points for #27 (spanish) These include 9 datasets (18 points) across 4 news tasks (8) for spanish. Points are given to violenil as the contributor, and one points for reviewers. Points can be split up if needed. * docs: Added points for #224 Added points 2 points for the dataset. I could imagine that I might have missed some bonus points as well. Also added one point for review. * docs: Added points for #210 (korean) This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers. * Add contributor --------- Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

add Ko-miracl, Ko-StrategyQA, Ko-mrtydi tasks

434846b

Muennighoff reviewed Jan 19, 2024

View reviewed changes

taeminlee and others added 3 commits February 6, 2024 12:47

Update mteb/abstasks/AbsTaskRetrieval.py

53a9895

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Update AbsTaskRetrieval.py

81e53be

Merge branch 'main' into main

f9978fd

Muennighoff approved these changes Feb 6, 2024

View reviewed changes

mteb/abstasks/AbsTaskRetrieval.py Outdated Show resolved Hide resolved

scripts/run_mteb_korean.py Outdated Show resolved Hide resolved

taeminlee and others added 3 commits February 6, 2024 18:25

Update mteb/abstasks/AbsTaskRetrieval.py

76e7d83

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Update scripts/run_mteb_korean.py

be04d5f

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Merge branch 'main' into main

f4abc8c

Muennighoff merged commit dadf2da into embeddings-benchmark:main Feb 6, 2024
3 checks passed

Muennighoff mentioned this pull request Apr 4, 2024

Adding French team contribution points #302

Merged

KennethEnevoldsen added a commit that referenced this pull request Apr 11, 2024

docs: Added points for #210 (korean)

e4c0352

This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Korean Text Search Tasks to MTEB #210

Add Korean Text Search Tasks to MTEB #210

taeminlee commented Jan 19, 2024

Muennighoff left a comment

Muennighoff Jan 19, 2024

taeminlee Feb 6, 2024

taeminlee commented Feb 6, 2024

Pang-dachu commented Feb 16, 2024 •

edited

Loading

Muennighoff commented Feb 16, 2024

Add Korean Text Search Tasks to MTEB #210

Add Korean Text Search Tasks to MTEB #210

Conversation

taeminlee commented Jan 19, 2024

Limitations

To-Do

Muennighoff left a comment

Choose a reason for hiding this comment

Muennighoff Jan 19, 2024

Choose a reason for hiding this comment

taeminlee Feb 6, 2024

Choose a reason for hiding this comment

taeminlee commented Feb 6, 2024

Pang-dachu commented Feb 16, 2024 • edited Loading

Muennighoff commented Feb 16, 2024

Pang-dachu commented Feb 16, 2024 •

edited

Loading