-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add law datasets #311
Add law datasets #311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great amazing job! Tagging @KennethEnevoldsen to make sure you are okay with having a law folder?
This will be a new leaderboard tab detailed here: https://huggingface.co/spaces/mteb/leaderboard/discussions/90
I think lateron we will separate languages and domains and maybe allow people to select combinations of them via dropdown boxes or similar 🧐
@KennethEnevoldsen do you know why the tests are failing? It seems unrelated to the PR 🧐
date=None, | ||
form=None, | ||
domains=None, | ||
task_subtypes=None, | ||
license=None, | ||
socioeconomic_status=None, | ||
annotations_creators=None, | ||
dialect=None, | ||
text_creation=None, | ||
bibtex_citation=None, | ||
n_samples=None, | ||
avg_character_length=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to e filled out (for all datasets). Previous datasets don't have them but future datasets should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to do the e filled out? Will your team do this for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide describes how to fill these out. Generally, we expect it when new datasets are added to MTEB as the ones adding the dataset are more knowledgeable about the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally though here is a short guess:
date=, # you best guess on date of when the data was generated (from, to)
form="written", # assumed
domains=["Legal", "Non-fiction"], # assumed
task_subtypes=[],
license=, # needs be specified
socioeconomic_status=["high"], # assumed but since it is law it is probably correct
annotations_creators=, # required
dialect=[], # assuming there are no dialects
text_creation="found", # assumed
bibtex_citation=, # if there is none just leave it as None otherwise please specify it
n_samples=, # check using python
avg_character_length=, # check using python
@ShuangLI59 thanks for the submission I believe this is a great addition to MTEB. We still need a few things before we can merge this is, but it should be quick enough to add. Please make sure to also run the models as specified on the checklist (this is important as we do not test to run on all datasets so to make sure you integrate it correctly please specify those as well. |
I would move it to the "en" folder. Legal should be tagged as domain in the metadata.
We should also add into the interface of MTEB to filter based on domain as well (that should be doable with the current metadata). |
The thing is that it's a mix of English, German & Chinese datasets so they would be all separate then even though they form one benchmark |
Ah didn't see that aspect. I think grouping them in languages is what I would do for now though. Potentially we could do: tasks/{type}/{lang}/{major domain}/my_task.py But since many datasets can have multiple domains I would not do that. Instead I would specify it (as you do) in a task list. PS: You might consider adding the "Norwegian Courts" dataset (also legal, but not sure it is relevant). PR #315 updated its metadata |
I think it is a brief network error (sadly happens when your test relies on external resources). Rerunning the test should solve the issue otherwise I will have a look at it. |
Update LeCaRDv2Retrieval metadata
Update LegalBenchConsumerContractsQARetrieval metadata
Update LegalBenchCorporateLobbyingRetrieval metadata
Update LegalQuADRetrieval metadata
Update LegalSummarizationRetrieval metadata
Update AILACasedocsRetrieval
Update AILACasedocsRetrieval metadata
Update AILAStatutesRetrieval metadata
Update GerDaLIRSmallRetrieval metadata
Update LeCaRDv2Retrieval metadata
@ShuangLI59 can you still move the datasets to the language folders? Then I think we can merge! |
@Muennighoff Yes the datasets have been moved to the language folders |
Hi @Muennighoff , I noticed that "GerDaLIR" should be changed to "GerDaLIRSmall" in scripts/run_mteb_law.py TASK_LIST_RETRIEVAL_LAW = [ |
* add command * add datasets * reformat dataset * Rephrase description * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/__init__.py * Update scripts/run_mteb_law.py * Update scripts/run_mteb_law.py * Update mteb/__init__.py * Update mteb/tasks/Retrieval/__init__.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/GerDaLIRRetrieval.py * Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py * Update mteb/tasks/Retrieval/law/LegalQuADRetrieval.py * Update scripts/run_mteb_law.py * Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py * Update mteb/tasks/Retrieval/law/LegalSummarizationRetrieval.py * Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py * Update mteb/tasks/Retrieval/law/LeCaRDv2Retrieval.py * Rename GerDaLIRRetrieval.py to GerDaLIRSmallRetrieval.py * Update mteb/tasks/Retrieval/__init__.py * Update GerDaLIRSmallRetrieval.py Add metadata * Update GerDaLIRSmallRetrieval.py Update metadata * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval metadata * Update AILAStatutesRetrieval.py Update AILAStatutesRetrieval metadata * Update LeCaRDv2Retrieval.py Update LeCaRDv2Retrieval metadata * Update LegalBenchConsumerContractsQARetrieval.py Update LegalBenchConsumerContractsQARetrieval metadata * Update LegalBenchCorporateLobbyingRetrieval.py Update LegalBenchCorporateLobbyingRetrieval metadata * Update LegalQuADRetrieval.py Update LegalQuADRetrieval metadata * Update LegalSummarizationRetrieval.py Update LegalSummarizationRetrieval metadata * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval * Update AILACasedocsRetrieval.py Update AILACasedocsRetrieval metadata * Update AILAStatutesRetrieval.py Update AILAStatutesRetrieval metadata * Update GerDaLIRSmallRetrieval.py Update GerDaLIRSmallRetrieval metadata * Update LeCaRDv2Retrieval.py Update LeCaRDv2Retrieval metadata * Update LegalBenchConsumerContractsQARetrieval.py * Update LegalBenchCorporateLobbyingRetrieval.py * Update LegalQuADRetrieval.py * Update LegalSummarizationRetrieval.py * Update AILACasedocsRetrieval.py * Update AILAStatutesRetrieval.py * Update GerDaLIRSmallRetrieval.py * Update LeCaRDv2Retrieval.py * move dataset language folder * update order --------- Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Checklist for adding MMTEB dataset
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
make test
.make lint
.