Add law datasets #311

ShuangLI59 · 2024-04-04T00:14:15Z

Checklist for adding MMTEB dataset

I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
I have added points for my submission to the POINTS.md file.

Muennighoff

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

@KennethEnevoldsen do you know why the tests are failing? It seems unrelated to the PR 🧐

mteb/__init__.py

mteb/tasks/Retrieval/__init__.py

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py

mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py

scripts/run_mteb_law.py

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

mteb/tasks/Retrieval/__init__.py

KennethEnevoldsen · 2024-04-04T11:36:01Z

mteb/tasks/Retrieval/law/AILAStatutesRetrieval.py

+        date=None,
+        form=None,
+        domains=None,
+        task_subtypes=None,
+        license=None,
+        socioeconomic_status=None,
+        annotations_creators=None,
+        dialect=None,
+        text_creation=None,
+        bibtex_citation=None,
+        n_samples=None,
+        avg_character_length=None,


These need to e filled out (for all datasets). Previous datasets don't have them but future datasets should.

How to do the e filled out? Will your team do this for us?

The guide describes how to fill these out. Generally, we expect it when new datasets are added to MTEB as the ones adding the dataset are more knowledgeable about the dataset.

Generally though here is a short guess:

date=, # you best guess on date of when the data was generated (from, to) form="written", # assumed domains=["Legal", "Non-fiction"], # assumed task_subtypes=[], license=, # needs be specified socioeconomic_status=["high"], # assumed but since it is law it is probably correct annotations_creators=, # required dialect=[], # assuming there are no dialects text_creation="found", # assumed bibtex_citation=, # if there is none just leave it as None otherwise please specify it n_samples=, # check using python avg_character_length=, # check using python

KennethEnevoldsen · 2024-04-04T11:38:22Z

@ShuangLI59 thanks for the submission I believe this is a great addition to MTEB. We still need a few things before we can merge this is, but it should be quick enough to add.

Please make sure to also run the models as specified on the checklist (this is important as we do not test to run on all datasets so to make sure you integrate it correctly please specify those as well.

KennethEnevoldsen · 2024-04-04T11:50:58Z

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I would move it to the "en" folder. Legal should be tagged as domain in the metadata.

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

We should also add into the interface of MTEB to filter based on domain as well (that should be doable with the current metadata).

Muennighoff · 2024-04-04T12:13:11Z

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I would move it to the "en" folder. Legal should be tagged as domain in the metadata.

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

We should also add into the interface of MTEB to filter based on domain as well (that should be doable with the current metadata).

The thing is that it's a mix of English, German & Chinese datasets so they would be all separate then even though they form one benchmark

KennethEnevoldsen · 2024-04-04T12:49:40Z

The thing is that it's a mix of English, German & Chinese datasets so they would be all separate then even though they form one benchmark

Ah didn't see that aspect. I think grouping them in languages is what I would do for now though. Potentially we could do:

tasks/{type}/{lang}/{major domain}/my_task.py

But since many datasets can have multiple domains I would not do that. Instead I would specify it (as you do) in a task list.

PS: You might consider adding the "Norwegian Courts" dataset (also legal, but not sure it is relevant). PR #315 updated its metadata

KennethEnevoldsen · 2024-04-04T12:56:56Z

@KennethEnevoldsen do you know why the tests are failing? It seems unrelated to the PR 🧐

I think it is a brief network error (sadly happens when your test relies on external resources). Rerunning the test should solve the issue otherwise I will have a look at it.

mteb/__init__.py

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

mteb/__init__.py

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

Update LeCaRDv2Retrieval metadata

Update LegalBenchConsumerContractsQARetrieval metadata

Update LegalBenchCorporateLobbyingRetrieval metadata

Update LegalQuADRetrieval metadata

Update LegalSummarizationRetrieval metadata

Update AILACasedocsRetrieval

Update AILACasedocsRetrieval metadata

Update AILAStatutesRetrieval metadata

Update GerDaLIRSmallRetrieval metadata

Update LeCaRDv2Retrieval metadata

Muennighoff · 2024-04-06T08:53:42Z

@ShuangLI59 can you still move the datasets to the language folders? Then I think we can merge!

ShuangLI59 · 2024-04-06T18:21:36Z

@Muennighoff Yes the datasets have been moved to the language folders

ShuangLI59 · 2024-04-07T05:14:13Z

Hi @Muennighoff , I noticed that "GerDaLIR" should be changed to "GerDaLIRSmall" in scripts/run_mteb_law.py
Could you help update this?

TASK_LIST_RETRIEVAL_LAW = [
"LegalSummarization",
"LegalBenchConsumerContractsQA",
"LegalBenchCorporateLobbying",
"AILACasedocs",
"AILAStatutes",
"LeCaRDv2",
"LegalQuAD",
"GerDaLIRSmall",
]

* add command * add datasets * reformat dataset * Rephrase description * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/__init__.py * Update scripts/run_mteb_law.py * Update scripts/run_mteb_law.py * Update mteb/__init__.py * Update mteb/tasks/Retrieval/__init__.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py * Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py * Update scripts/run_mteb_law.py * Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py * Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py * Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py * Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py * Rename GerDaLIRRetrieval.py to GerDaLIRSmallRetrieval.py * Update mteb/tasks/Retrieval/__init__.py * Update GerDaLIRSmallRetrieval.py Add metadata * Update GerDaLIRSmallRetrieval.py Update metadata * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval metadata * Update AILAStatutesRetrieval.py Update AILAStatutesRetrieval metadata * Update LeCaRDv2Retrieval.py Update LeCaRDv2Retrieval metadata * Update LegalBenchConsumerContractsQARetrieval.py Update LegalBenchConsumerContractsQARetrieval metadata * Update LegalBenchCorporateLobbyingRetrieval.py Update LegalBenchCorporateLobbyingRetrieval metadata * Update LegalQuADRetrieval.py Update LegalQuADRetrieval metadata * Update LegalSummarizationRetrieval.py Update LegalSummarizationRetrieval metadata * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval metadata * Update AILAStatutesRetrieval.py Update AILAStatutesRetrieval metadata * Update GerDaLIRSmallRetrieval.py Update GerDaLIRSmallRetrieval metadata * Update LeCaRDv2Retrieval.py Update LeCaRDv2Retrieval metadata * Update LegalBenchConsumerContractsQARetrieval.py * Update LegalBenchCorporateLobbyingRetrieval.py * Update LegalQuADRetrieval.py * Update LegalSummarizationRetrieval.py * Update AILACasedocsRetrieval.py * Update AILAStatutesRetrieval.py * Update GerDaLIRSmallRetrieval.py * Update LeCaRDv2Retrieval.py * move dataset language folder * update order --------- Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

ShuangLI59 added 3 commits April 3, 2024 23:02

add command

50693ee

add datasets

7a0cfa1

reformat dataset

691c8f1

ShuangLI59 changed the title ~~Law~~ Add law datasets Apr 4, 2024

ShuangLI59 requested a review from Muennighoff April 4, 2024 00:59

Muennighoff reviewed Apr 4, 2024

View reviewed changes

KennethEnevoldsen reviewed Apr 4, 2024

View reviewed changes

Muennighoff reviewed Apr 5, 2024

View reviewed changes

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved

Muennighoff added 6 commits April 5, 2024 09:08

Rephrase description

73df58e

Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

e678e05

Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

50ad155

Update mteb/__init__.py

86f2c14

Update scripts/run_mteb_law.py

1057a5c

Update scripts/run_mteb_law.py

0f34694

Muennighoff reviewed Apr 5, 2024

View reviewed changes

mteb/__init__.py Show resolved Hide resolved

Muennighoff added 3 commits April 5, 2024 20:09

Update mteb/__init__.py

0a4b1c4

Update mteb/tasks/Retrieval/__init__.py

e248c3c

Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

1ab2fd2

Muennighoff reviewed Apr 5, 2024

View reviewed changes

mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved

Muennighoff added 2 commits April 5, 2024 20:10

Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

8a32bc4

Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

7d23295

Muennighoff reviewed Apr 5, 2024

View reviewed changes

mteb/tasks/Retrieval/law/LegalQuADRetrieval.py Outdated Show resolved Hide resolved

Muennighoff added 2 commits April 5, 2024 20:10

Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

21040e8

Update scripts/run_mteb_law.py

ca482dd

ShuangLI59 added 18 commits April 5, 2024 12:43

Update LeCaRDv2Retrieval.py

64cd1e5

Update LeCaRDv2Retrieval metadata

Update LegalBenchConsumerContractsQARetrieval.py

4ed2450

Update LegalBenchConsumerContractsQARetrieval metadata

Update LegalBenchCorporateLobbyingRetrieval.py

07424d2

Update LegalBenchCorporateLobbyingRetrieval metadata

Update LegalQuADRetrieval.py

a2f492e

Update LegalQuADRetrieval metadata

Update LegalSummarizationRetrieval.py

e961324

Update LegalSummarizationRetrieval metadata

Update AILACasedocsRetrieval.py

0ba28eb

Update AILACasedocsRetrieval

Update AILACasedocsRetrieval.py

12d9353

Update AILACasedocsRetrieval metadata

Update AILAStatutesRetrieval.py

c67fffa

Update AILAStatutesRetrieval metadata

Update GerDaLIRSmallRetrieval.py

230b597

Update GerDaLIRSmallRetrieval metadata

Update LeCaRDv2Retrieval.py

ef9b785

Update LeCaRDv2Retrieval metadata

Update LegalBenchConsumerContractsQARetrieval.py

edda5cf

Update LegalBenchCorporateLobbyingRetrieval.py

f080a18

Update LegalQuADRetrieval.py

0c0ea85

Update LegalSummarizationRetrieval.py

c4bef47

Update AILACasedocsRetrieval.py

37fbf83

Update AILAStatutesRetrieval.py

0cabe7f

Update GerDaLIRSmallRetrieval.py

fe6c4ee

Update LeCaRDv2Retrieval.py

28ce3c3

ShuangLI59 added 2 commits April 6, 2024 17:43

move dataset language folder

976bffe

update order

eff2e80

Muennighoff approved these changes Apr 6, 2024

View reviewed changes

Muennighoff merged commit 6e3f419 into main Apr 6, 2024
5 checks passed

Muennighoff mentioned this pull request Apr 8, 2024

Integrating FollowIR #321

Closed

Muennighoff mentioned this pull request May 20, 2024

Integrate with MTEB? kaistAI/InstructIR#3

Open

Muennighoff mentioned this pull request May 31, 2024

Integrate with MTEB? gowitheflow-1998/RAR-b#4

Closed

Muennighoff mentioned this pull request Jul 10, 2024

Integrate with MTEB? CoIR-team/coir#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add law datasets #311

Add law datasets #311

ShuangLI59 commented Apr 4, 2024 •

edited by KennethEnevoldsen

Loading

Muennighoff left a comment

KennethEnevoldsen Apr 4, 2024

ShuangLI59 Apr 4, 2024

KennethEnevoldsen Apr 5, 2024

KennethEnevoldsen Apr 5, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 4, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 4, 2024

Muennighoff commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

Muennighoff commented Apr 6, 2024

ShuangLI59 commented Apr 6, 2024

ShuangLI59 commented Apr 7, 2024

Add law datasets #311

Add law datasets #311

Conversation

ShuangLI59 commented Apr 4, 2024 • edited by KennethEnevoldsen Loading

Checklist for adding MMTEB dataset

Muennighoff left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Apr 4, 2024

Choose a reason for hiding this comment

ShuangLI59 Apr 4, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 5, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen commented Apr 4, 2024 • edited Loading

KennethEnevoldsen commented Apr 4, 2024

Muennighoff commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

Muennighoff commented Apr 6, 2024

ShuangLI59 commented Apr 6, 2024

ShuangLI59 commented Apr 7, 2024

ShuangLI59 commented Apr 4, 2024 •

edited by KennethEnevoldsen

Loading

KennethEnevoldsen Apr 5, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 4, 2024 •

edited

Loading