Skip to content

Commit

Permalink
Add new benchmark: Basque bench (EleutherAI#2153)
Browse files Browse the repository at this point in the history
* Add basque_bench

* Add flores_eu group

* Update _flores_common_yaml

* Run linters, updated flores, mgsm, copa, and readme

* Apply suggestions from code review

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

---------

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
  • Loading branch information
3 people authored and mariagrandury committed Oct 9, 2024
1 parent edbe084 commit a44913c
Show file tree
Hide file tree
Showing 27 changed files with 507 additions and 0 deletions.
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
| [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English |
| [basque_bench](basque_bench/README.md) | Collection of tasks in Basque encompassing various evaluation areas. | Basque |
| [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque |
| [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German |
| [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
Expand Down
88 changes: 88 additions & 0 deletions lm_eval/tasks/basque_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# BasqueBench

### Paper

BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.

The new evaluation datasets included in BasqueBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |

The datasets included in BasqueBench that have been made public in previous pubications are:

| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_eu | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| EusExams | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusExams |
| EusProficiency | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusProficiency |
| EusReading | Reading Comprehension | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusReading |
| EusTrivia | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusTrivia |
| FLORES_eu | Translation | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | https://huggingface.co/datasets/facebook/flores |
| QNLIeu | Natural Language Inference | [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/) | https://huggingface.co/datasets/orai-nlp/basqueGLUE |
| XNLIeu | Natural Language Inference | [XNLIeu: a dataset for cross-lingual NLI in Basque](https://arxiv.org/abs/2404.06996) | https://huggingface.co/datasets/HiTZ/xnli-eu |
| XStoryCloze_eu | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |


### Citation
Paper for BasqueBench coming soon.

### Groups and Tasks

#### Groups

- `basque_bench`: All tasks included in BasqueBench.
- `flores_eu`: All FLORES translation tasks from or to Basque.

#### Tasks

The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
- `belebele_eus_Latn`
- `eus_exams_eu`
- `eus_proficiency`
- `eus_reading`
- `eus_trivia`
- `flores_eu`
- `flores_eu-ca`
- `flores_eu-de`
- `flores_eu-en`
- `flores_eu-es`
- `flores_eu-fr`
- `flores_eu-gl`
- `flores_eu-it`
- `flores_eu-pt`
- `flores_ca-eu`
- `flores_de-eu`
- `flores_en-eu`
- `flores_es-eu`
- `flores_fr-eu`
- `flores_gl-eu`
- `flores_it-eu`
- `flores_pt-eu`
- `mgsm_direct_eu`
- `mgsm_native_cot_eu`
- `qnlieu`
- `wnli_eu`
- `xcopa_eu`
- `xnli_eu`
- `xnli_eu_native`
- `xstorycloze_eu`

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_eus_Latn`: Belebele Basque
- `qnlieu`: From BasqueGLUE


### Checklist

* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
18 changes: 18 additions & 0 deletions lm_eval/tasks/basque_bench/basque_bench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
group: basque_bench
task:
- belebele_eus_Latn
- xstorycloze_eu
- flores_eu
- eus_reading
- eus_proficiency
- eus_trivia
- eus_exams_eu
- qnlieu
- xnli_eu
- xnli_eu_native
- wnli_eu
- xcopa_eu
- mgsm_direct_eu
- mgsm_native_cot_eu
metadata:
version: 1.0
27 changes: 27 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/_flores_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
tag: flores
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
#! The test split of flores is not publicly available! (See paper section 6.1)
training_split: dev
validation_split: dev
test_split: devtest
fewshot_split: dev
target_delimiter: ''
generation_kwargs:
until:
- "\n"
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 0.1
dataset_kwargs:
trust_remote_code: true
115 changes: 115 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/create_yamls_flores_eu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
"""
Script to generate task YAMLs for the FLORES-200 dataset.
Based on `tasks/translation/utils.py`.
"""

import argparse
import yaml
from langcodes import *
from itertools import *

# utils
flatten = lambda l: list(itertools.chain(*l))

# constants
_LANGUAGES = [
"ace_Arab", "bam_Latn", "dzo_Tibt", "hin_Deva", "khm_Khmr", "mag_Deva", "pap_Latn", "sot_Latn", "tur_Latn",
"ace_Latn", "ban_Latn", "ell_Grek", "hne_Deva", "kik_Latn", "mai_Deva", "pbt_Arab", "spa_Latn", "twi_Latn",
"acm_Arab", "bel_Cyrl", "eng_Latn", "hrv_Latn", "kin_Latn", "mal_Mlym", "pes_Arab", "srd_Latn", "tzm_Tfng",
"acq_Arab", "bem_Latn", "epo_Latn", "hun_Latn", "kir_Cyrl", "mar_Deva", "plt_Latn", "srp_Cyrl", "uig_Arab",
"aeb_Arab", "ben_Beng", "est_Latn", "hye_Armn", "kmb_Latn", "min_Arab", "pol_Latn", "ssw_Latn", "ukr_Cyrl",
"afr_Latn", "bho_Deva", "eus_Latn", "ibo_Latn", "kmr_Latn", "min_Latn", "por_Latn", "sun_Latn", "umb_Latn",
"ajp_Arab", "bjn_Arab", "ewe_Latn", "ilo_Latn", "knc_Arab", "mkd_Cyrl", "prs_Arab", "swe_Latn", "urd_Arab",
"aka_Latn", "bjn_Latn", "fao_Latn", "ind_Latn", "knc_Latn", "mlt_Latn", "quy_Latn", "swh_Latn", "uzn_Latn",
"als_Latn", "bod_Tibt", "fij_Latn", "isl_Latn", "kon_Latn", "mni_Beng", "ron_Latn", "szl_Latn", "vec_Latn",
"amh_Ethi", "bos_Latn", "fin_Latn", "ita_Latn", "kor_Hang", "mos_Latn", "run_Latn", "tam_Taml", "vie_Latn",
"apc_Arab", "bug_Latn", "fon_Latn", "jav_Latn", "lao_Laoo", "mri_Latn", "rus_Cyrl", "taq_Latn", "war_Latn",
"arb_Arab", "bul_Cyrl", "fra_Latn", "jpn_Jpan", "lij_Latn", "mya_Mymr", "sag_Latn", "taq_Tfng", "wol_Latn",
"arb_Latn", "cat_Latn", "fur_Latn", "kab_Latn", "lim_Latn", "nld_Latn", "san_Deva", "tat_Cyrl", "xho_Latn",
"ars_Arab", "ceb_Latn", "fuv_Latn", "kac_Latn", "lin_Latn", "nno_Latn", "sat_Olck", "tel_Telu", "ydd_Hebr",
"ary_Arab", "ces_Latn", "gaz_Latn", "kam_Latn", "lit_Latn", "nob_Latn", "scn_Latn", "tgk_Cyrl", "yor_Latn",
"arz_Arab", "cjk_Latn", "gla_Latn", "kan_Knda", "lmo_Latn", "npi_Deva", "shn_Mymr", "tgl_Latn", "yue_Hant",
"asm_Beng", "ckb_Arab", "gle_Latn", "kas_Arab", "ltg_Latn", "nso_Latn", "sin_Sinh", "tha_Thai", "zho_Hans",
"ast_Latn", "crh_Latn", "glg_Latn", "kas_Deva", "ltz_Latn", "nus_Latn", "slk_Latn", "tir_Ethi", "zho_Hant",
"awa_Deva", "cym_Latn", "grn_Latn", "kat_Geor", "lua_Latn", "nya_Latn", "slv_Latn", "tpi_Latn", "zsm_Latn",
"ayr_Latn", "dan_Latn", "guj_Gujr", "kaz_Cyrl", "lug_Latn", "oci_Latn", "smo_Latn", "tsn_Latn", "zul_Latn",
"azb_Arab", "deu_Latn", "hat_Latn", "kbp_Latn", "luo_Latn", "ory_Orya", "sna_Latn", "tso_Latn",
"azj_Latn", "dik_Latn", "hau_Latn", "kea_Latn", "lus_Latn", "pag_Latn", "snd_Arab", "tuk_Latn",
"bak_Cyrl", "dyu_Latn", "heb_Hebr", "khk_Cyrl", "lvs_Latn", "pan_Guru", "som_Latn", "tum_Latn"
]
LANGUAGE_PAIRS = [(a, b) for idx, a in enumerate(_LANGUAGES) for b in _LANGUAGES[idx + 1:]]

LANGUAGES_OF_INTEREST = ["cat_Latn", "spa_Latn", "eng_Latn", "glg_Latn", "eus_Latn", "ita_Latn", "deu_Latn", "por_Latn", "fra_Latn"]
MAIN_LANG = "eus_Latn"
LANGUAGE_PAIRS = [(a, b) for (a, b) in LANGUAGE_PAIRS if a in LANGUAGES_OF_INTEREST and b in LANGUAGES_OF_INTEREST and MAIN_LANG in (a, b)]

# auxiliary functions

code_to_language_name = lambda code: Language.make(language=Language.get(code)["language"]).display_name()
code_to_short_name = lambda code: Language.get(code)["language"]
jinja_var = lambda s: "{{" + s + "}}" # wrapper to avoid having to escape { } in format strings

def doc_to_text(src: str, tgt: str) -> str:
src_name, tgt_name = map(code_to_language_name, [src, tgt])

return f"""\
{src_name} sentence: {jinja_var('sentence_' + src)}
{tgt_name} sentence:"""

def doc_to_target(tgt: str) -> str:

return f"{jinja_var('sentence_' + tgt)}"

# main function

def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
"""
Generate a YAML file for each translation direction.
"""

err = []
for src, tgt in LANGUAGE_PAIRS:

# do both translation directions for each lang pair
for src, tgt in [(src, tgt), (tgt, src)]:
lang_pair_name = f"{code_to_short_name(src)}-{code_to_short_name(tgt)}"
yaml_file_name = f"flores_{lang_pair_name}.yaml"

try:
with open( f"{output_dir}/{yaml_file_name}", "w" if overwrite else "x", encoding="utf-8") as outfile:
print(f"Creating {yaml_file_name}...")
outfile.write("# File generated by `create-yamls.py`\n")
yaml.dump(
{
# "group": [f"{BENCH_NAME}_bench", f"{BENCH_NAME}_bench_flores"],
# "group": "flores_eu",
"include": "_flores_common_yaml",
"task": f"flores_{lang_pair_name}",
"doc_to_text": doc_to_text(src, tgt),
"doc_to_target": doc_to_target(tgt),
},
outfile,
sort_keys=False,
)

except FileExistsError:
err.append(yaml_file_name)

if len(err) > 0:
raise FileExistsError(
"Files were not created because they already exist:"
f" {', '.join(err)}"
"\nUse flag --overwrite to overwrite them."
)


def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--overwrite", default=False, action="store_true", help="Overwrite files if they already exist")
parser.add_argument( "--output-dir", default=".", help="Directory to write yaml files to" )
args = parser.parse_args()

gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)

if __name__ == "__main__":
main()
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_ca-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_ca-eu
doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_de-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_de-eu
doc_to_text: 'German sentence: {{sentence_deu_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_en-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_en-eu
doc_to_text: 'English sentence: {{sentence_eng_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_es-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-eu
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-ca.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-ca
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Catalan sentence:'
doc_to_target: '{{sentence_cat_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-de
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
German sentence:'
doc_to_target: '{{sentence_deu_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-en
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
English sentence:'
doc_to_target: '{{sentence_eng_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-es
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-fr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-fr
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
French sentence:'
doc_to_target: '{{sentence_fra_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-gl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-gl
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Galician sentence:'
doc_to_target: '{{sentence_glg_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-it.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-it
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Italian sentence:'
doc_to_target: '{{sentence_ita_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu-pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-pt
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
24 changes: 24 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
group: flores_eu
task:
- flores_es-eu
- flores_eu-es
- flores_en-eu
- flores_eu-en
- flores_eu-pt
- flores_pt-eu
- flores_eu-it
- flores_it-eu
- flores_eu-fr
- flores_fr-eu
- flores_eu-ca
- flores_ca-eu
- flores_eu-gl
- flores_gl-eu
- flores_eu-de
- flores_de-eu
aggregate_metric_list:
- metric: bleu
aggregation: mean
weight_by_size: false
metadata:
version: 1.0
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_fr-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_fr-eu
doc_to_text: 'French sentence: {{sentence_fra_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
7 changes: 7 additions & 0 deletions lm_eval/tasks/basque_bench/flores_eu/flores_gl-eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_gl-eu
doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
Loading

0 comments on commit a44913c

Please sign in to comment.