Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add law datasets #311

Merged
merged 46 commits into from
Apr 6, 2024
Merged

Add law datasets #311

merged 46 commits into from
Apr 6, 2024

Conversation

ShuangLI59
Copy link
Contributor

@ShuangLI59 ShuangLI59 commented Apr 4, 2024

Checklist for adding MMTEB dataset

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the POINTS.md file.

@ShuangLI59 ShuangLI59 changed the title Law Add law datasets Apr 4, 2024
@ShuangLI59 ShuangLI59 requested a review from Muennighoff April 4, 2024 00:59
Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

@KennethEnevoldsen do you know why the tests are failing? It seems unrelated to the PR 🧐

mteb/__init__.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/__init__.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/LegalQuADRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py Outdated Show resolved Hide resolved
scripts/run_mteb_law.py Outdated Show resolved Hide resolved
scripts/run_mteb_law.py Outdated Show resolved Hide resolved
scripts/run_mteb_law.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/__init__.py Show resolved Hide resolved
Comment on lines 22 to 33
date=None,
form=None,
domains=None,
task_subtypes=None,
license=None,
socioeconomic_status=None,
annotations_creators=None,
dialect=None,
text_creation=None,
bibtex_citation=None,
n_samples=None,
avg_character_length=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to e filled out (for all datasets). Previous datasets don't have them but future datasets should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to do the e filled out? Will your team do this for us?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide describes how to fill these out. Generally, we expect it when new datasets are added to MTEB as the ones adding the dataset are more knowledgeable about the dataset.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally though here is a short guess:

        date=, # you best guess on date of when the data was generated (from, to) 
        form="written", # assumed 
        domains=["Legal", "Non-fiction"], # assumed
        task_subtypes=[],
        license=,  # needs be specified
        socioeconomic_status=["high"], # assumed but since it is law it is probably correct
        annotations_creators=, # required
        dialect=[], # assuming there are no dialects
        text_creation="found", # assumed
        bibtex_citation=, # if there is none just leave it as None otherwise please specify it 
        n_samples=, # check using python
        avg_character_length=, # check using python

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 4, 2024

@ShuangLI59 thanks for the submission I believe this is a great addition to MTEB. We still need a few things before we can merge this is, but it should be quick enough to add.

Please make sure to also run the models as specified on the checklist (this is important as we do not test to run on all datasets so to make sure you integrate it correctly please specify those as well.

@KennethEnevoldsen
Copy link
Contributor

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I would move it to the "en" folder. Legal should be tagged as domain in the metadata.

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

We should also add into the interface of MTEB to filter based on domain as well (that should be doable with the current metadata).

@Muennighoff
Copy link
Contributor

Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90

I would move it to the "en" folder. Legal should be tagged as domain in the metadata.

I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐

We should also add into the interface of MTEB to filter based on domain as well (that should be doable with the current metadata).

The thing is that it's a mix of English, German & Chinese datasets so they would be all separate then even though they form one benchmark

@KennethEnevoldsen
Copy link
Contributor

The thing is that it's a mix of English, German & Chinese datasets so they would be all separate then even though they form one benchmark

Ah didn't see that aspect. I think grouping them in languages is what I would do for now though. Potentially we could do:

tasks/{type}/{lang}/{major domain}/my_task.py

But since many datasets can have multiple domains I would not do that. Instead I would specify it (as you do) in a task list.

PS: You might consider adding the "Norwegian Courts" dataset (also legal, but not sure it is relevant). PR #315 updated its metadata

@KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen do you know why the tests are failing? It seems unrelated to the PR 🧐

I think it is a brief network error (sadly happens when your test relies on external resources). Rerunning the test should solve the issue otherwise I will have a look at it.

mteb/__init__.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py Outdated Show resolved Hide resolved
@Muennighoff
Copy link
Contributor

@ShuangLI59 can you still move the datasets to the language folders? Then I think we can merge!

@ShuangLI59
Copy link
Contributor Author

@Muennighoff Yes the datasets have been moved to the language folders

@Muennighoff Muennighoff merged commit 6e3f419 into main Apr 6, 2024
5 checks passed
@ShuangLI59
Copy link
Contributor Author

Hi @Muennighoff , I noticed that "GerDaLIR" should be changed to "GerDaLIRSmall" in scripts/run_mteb_law.py
Could you help update this?

TASK_LIST_RETRIEVAL_LAW = [
"LegalSummarization",
"LegalBenchConsumerContractsQA",
"LegalBenchCorporateLobbying",
"AILACasedocs",
"AILAStatutes",
"LeCaRDv2",
"LegalQuAD",
"GerDaLIRSmall",
]

@Muennighoff Muennighoff mentioned this pull request Apr 8, 2024
MartinBernstorff pushed a commit that referenced this pull request Apr 10, 2024
* add command

* add datasets

* reformat dataset

* Rephrase description

* Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

* Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

* Update mteb/__init__.py

* Update scripts/run_mteb_law.py

* Update scripts/run_mteb_law.py

* Update mteb/__init__.py

* Update mteb/tasks/Retrieval/__init__.py

* Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

* Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py

* Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

* Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py

* Update scripts/run_mteb_law.py

* Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py

* Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py

* Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py

* Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py

* Rename GerDaLIRRetrieval.py to GerDaLIRSmallRetrieval.py

* Update mteb/tasks/Retrieval/__init__.py

* Update GerDaLIRSmallRetrieval.py

Add metadata

* Update GerDaLIRSmallRetrieval.py

Update metadata

* Update AILACasedocsRetrieval.py

Update AILACasedocsRetrieval metadata

* Update AILAStatutesRetrieval.py

Update AILAStatutesRetrieval metadata

* Update LeCaRDv2Retrieval.py

Update LeCaRDv2Retrieval metadata

* Update LegalBenchConsumerContractsQARetrieval.py

Update LegalBenchConsumerContractsQARetrieval metadata

* Update LegalBenchCorporateLobbyingRetrieval.py

Update LegalBenchCorporateLobbyingRetrieval metadata

* Update LegalQuADRetrieval.py

Update LegalQuADRetrieval metadata

* Update LegalSummarizationRetrieval.py

Update LegalSummarizationRetrieval metadata

* Update AILACasedocsRetrieval.py

Update AILACasedocsRetrieval

* Update AILACasedocsRetrieval.py

Update AILACasedocsRetrieval metadata

* Update AILAStatutesRetrieval.py

Update AILAStatutesRetrieval metadata

* Update GerDaLIRSmallRetrieval.py

Update GerDaLIRSmallRetrieval metadata

* Update LeCaRDv2Retrieval.py

Update LeCaRDv2Retrieval metadata

* Update LegalBenchConsumerContractsQARetrieval.py

* Update LegalBenchCorporateLobbyingRetrieval.py

* Update LegalQuADRetrieval.py

* Update LegalSummarizationRetrieval.py

* Update AILACasedocsRetrieval.py

* Update AILAStatutesRetrieval.py

* Update GerDaLIRSmallRetrieval.py

* Update LeCaRDv2Retrieval.py

* move dataset language folder

* update order

---------

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants