Add new benchmark: Galician bench (#2155)

* Add galician_bench * Update xnli_gl path * Add flores_gl group * Update _flores_common_yaml * Updated some task groupings and readme ---------
EleutherAI · Oct 3, 2024 · 0e76386 · 0e76386
1 parent ea17b98
commit 0e76386
Show file tree

Hide file tree

Showing 35 changed files with 944 additions and 0 deletions.
diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md
@@ -42,6 +42,7 @@
 | [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
 | [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English |
 | [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French|
+| [galician_bench](galician_bench/README.md) | Collection of tasks in Galician encompassing various evaluation areas. | Galician |
 | [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English |
 | [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English |
 | [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English |

diff --git a/lm_eval/tasks/galician_bench/README.md b/lm_eval/tasks/galician_bench/README.md
@@ -0,0 +1,80 @@
+# GalicianBench
+
+### Paper
+
+GalicianBench is a benchmark for evaluating language models in Galician tasks. This is, it evaluates the ability of a language model to understand and generate Galician text. GalicianBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of GalicianBench will be published in a paper soon.
+
+The new evaluation datasets included in GalicianBench are:
+| Task          | Category       | Homepage  |
+|:-------------:|:-----:|:-----:|
+| Belebele_gl | Reading Comprehension | https://huggingface.co/datasets/proxectonos/belebele_gl |
+| GalCoLA | Linguistic Acceptability | https://huggingface.co/datasets/proxectonos/galcola |
+| MGSM_ca | Math | https://huggingface.co/datasets/proxectonos/mgsm_gl |
+| Parafrases_gl | Paraphrasing | https://huggingface.co/datasets/proxectonos/parafrases_gl |
+| PAWS-gl | Paraphrasing | https://huggingface.co/datasets/proxectonos/PAWS-gl |
+| OpenBookQA_gl | Question Answering | https://huggingface.co/datasets/proxectonos/openbookqa_gl |
+| Summarization_gl | Summarization | https://huggingface.co/datasets/proxectonos/summarization_gl |
+| TruthfulQA_gl | Truthfulness | https://huggingface.co/datasets/proxectonos/truthfulqa_gl |
+| xnli_gl | NLI | https://huggingface.co/datasets/proxectonos/xnli_gl |
+| xstorycloze_gl | Commonsense Reasoning | https://huggingface.co/datasets/proxectonos/xstorycloze_gl |
+
+The datasets included in GalicianBench that have been made public in previous pubications are:
+
+| Task          | Category       | Paper title          | Homepage  |
+|:-------------:|:-----:|:-------------:|:-----:|
+| FLORES_gl | Translation | [The FLORES-101  Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
+
+
+### Citation
+Paper for GalicianBench coming soon.
+
+### Groups and Tasks
+
+#### Groups
+
+- `galician_bench`: All tasks included in GalicianBench.
+- `flores_gl`: All FLORES translation tasks from or to Galician.
+
+
+#### Tasks
+
+The following tasks evaluate tasks on GalicianBench dataset using various scoring methods.
+  - `belebele_glg_Latn`
+  - `flores_gl`
+  - `flores_gl-ca`
+  - `flores_gl-de`
+  - `flores_gl-en`
+  - `flores_gl-es`
+  - `flores_gl-eu`
+  - `flores_gl-fr`
+  - `flores_gl-it`
+  - `flores_gl-pt`
+  - `flores_ca-gl`
+  - `flores_de-gl`
+  - `flores_en-gl`
+  - `flores_es-gl`
+  - `flores_eu-gl`
+  - `flores_fr-gl`
+  - `flores_it-gl`
+  - `flores_pt-gl`
+  - `galcola`
+  - `summarization_gl`
+  - `parafrases_gl`
+  - `paws_gl`
+  - `openbookqa_gl`
+  - `mgsm_direct_gl`
+  - `truthfulqa_gl`
+  - `xnli_gl`
+  - `xstorycloze_gl`
+
+### Checklist
+
+* [x] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation?
+    * [ ] Yes, original implementation contributed by author of the benchmark
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/lm_eval/tasks/galician_bench/belebele_glg_Latn.yaml b/lm_eval/tasks/galician_bench/belebele_glg_Latn.yaml
@@ -0,0 +1,7 @@
+task: belebele_glg_Latn
+include: ../belebele/_default_template_yaml
+dataset_path: proxectonos/belebele_gl
+fewshot_split: train
+test_split: train
+metadata:
+  version: 1.0
diff --git a/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml b/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml
@@ -0,0 +1,28 @@
+group: flores
+dataset_path: facebook/flores
+dataset_name: all
+output_type: generate_until
+#! The test split of flores is not publicly available! (See paper section 6.1)
+#! We are using `dev` and `devtest` splits, but they're mapped to train/validation/test in `data/flores/flores.py`.
+training_split: dev
+validation_split: dev
+test_split: devtest
+fewshot_split: dev
+target_delimiter: ''
+generation_kwargs:
+  until:
+    - "\n"
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: ter
+    aggregation: ter
+    higher_is_better: false
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
diff --git a/lm_eval/tasks/galician_bench/flores_gl/create_yamls_flores_gl.py b/lm_eval/tasks/galician_bench/flores_gl/create_yamls_flores_gl.py
@@ -0,0 +1,115 @@
+"""
+Script to generate task YAMLs for the FLORES-200 dataset.
+Based on `tasks/translation/utils.py`.
+"""
+
+import argparse
+import yaml
+from langcodes import *
+from itertools import *
+
+# utils
+flatten = lambda l: list(itertools.chain(*l))
+
+# constants
+_LANGUAGES = [
+"ace_Arab",  "bam_Latn",  "dzo_Tibt",  "hin_Deva",	"khm_Khmr",  "mag_Deva",  "pap_Latn",  "sot_Latn",	"tur_Latn",
+"ace_Latn",  "ban_Latn",  "ell_Grek",  "hne_Deva",	"kik_Latn",  "mai_Deva",  "pbt_Arab",  "spa_Latn",	"twi_Latn",
+"acm_Arab",  "bel_Cyrl",  "eng_Latn",  "hrv_Latn",	"kin_Latn",  "mal_Mlym",  "pes_Arab",  "srd_Latn",	"tzm_Tfng",
+"acq_Arab",  "bem_Latn",  "epo_Latn",  "hun_Latn",	"kir_Cyrl",  "mar_Deva",  "plt_Latn",  "srp_Cyrl",	"uig_Arab",
+"aeb_Arab",  "ben_Beng",  "est_Latn",  "hye_Armn",	"kmb_Latn",  "min_Arab",  "pol_Latn",  "ssw_Latn",	"ukr_Cyrl",
+"afr_Latn",  "bho_Deva",  "eus_Latn",  "ibo_Latn",	"kmr_Latn",  "min_Latn",  "por_Latn",  "sun_Latn",	"umb_Latn",
+"ajp_Arab",  "bjn_Arab",  "ewe_Latn",  "ilo_Latn",	"knc_Arab",  "mkd_Cyrl",  "prs_Arab",  "swe_Latn",	"urd_Arab",
+"aka_Latn",  "bjn_Latn",  "fao_Latn",  "ind_Latn",	"knc_Latn",  "mlt_Latn",  "quy_Latn",  "swh_Latn",	"uzn_Latn",
+"als_Latn",  "bod_Tibt",  "fij_Latn",  "isl_Latn",	"kon_Latn",  "mni_Beng",  "ron_Latn",  "szl_Latn",	"vec_Latn",
+"amh_Ethi",  "bos_Latn",  "fin_Latn",  "ita_Latn",	"kor_Hang",  "mos_Latn",  "run_Latn",  "tam_Taml",	"vie_Latn",
+"apc_Arab",  "bug_Latn",  "fon_Latn",  "jav_Latn",	"lao_Laoo",  "mri_Latn",  "rus_Cyrl",  "taq_Latn",	"war_Latn",
+"arb_Arab",  "bul_Cyrl",  "fra_Latn",  "jpn_Jpan",	"lij_Latn",  "mya_Mymr",  "sag_Latn",  "taq_Tfng",	"wol_Latn",
+"arb_Latn",  "cat_Latn",  "fur_Latn",  "kab_Latn",	"lim_Latn",  "nld_Latn",  "san_Deva",  "tat_Cyrl",	"xho_Latn",
+"ars_Arab",  "ceb_Latn",  "fuv_Latn",  "kac_Latn",	"lin_Latn",  "nno_Latn",  "sat_Olck",  "tel_Telu",	"ydd_Hebr",
+"ary_Arab",  "ces_Latn",  "gaz_Latn",  "kam_Latn",	"lit_Latn",  "nob_Latn",  "scn_Latn",  "tgk_Cyrl",	"yor_Latn",
+"arz_Arab",  "cjk_Latn",  "gla_Latn",  "kan_Knda",	"lmo_Latn",  "npi_Deva",  "shn_Mymr",  "tgl_Latn",	"yue_Hant",
+"asm_Beng",  "ckb_Arab",  "gle_Latn",  "kas_Arab",	"ltg_Latn",  "nso_Latn",  "sin_Sinh",  "tha_Thai",	"zho_Hans",
+"ast_Latn",  "crh_Latn",  "glg_Latn",  "kas_Deva",	"ltz_Latn",  "nus_Latn",  "slk_Latn",  "tir_Ethi",	"zho_Hant",
+"awa_Deva",  "cym_Latn",  "grn_Latn",  "kat_Geor",	"lua_Latn",  "nya_Latn",  "slv_Latn",  "tpi_Latn",	"zsm_Latn",
+"ayr_Latn",  "dan_Latn",  "guj_Gujr",  "kaz_Cyrl",	"lug_Latn",  "oci_Latn",  "smo_Latn",  "tsn_Latn",	"zul_Latn",
+"azb_Arab",  "deu_Latn",  "hat_Latn",  "kbp_Latn",	"luo_Latn",  "ory_Orya",  "sna_Latn",  "tso_Latn",
+"azj_Latn",  "dik_Latn",  "hau_Latn",  "kea_Latn",	"lus_Latn",  "pag_Latn",  "snd_Arab",  "tuk_Latn",
+"bak_Cyrl",  "dyu_Latn",  "heb_Hebr",  "khk_Cyrl",	"lvs_Latn",  "pan_Guru",  "som_Latn",  "tum_Latn"
+]
+LANGUAGE_PAIRS = [(a, b) for idx, a in enumerate(_LANGUAGES) for b in _LANGUAGES[idx + 1:]]
+
+LANGUAGES_OF_INTEREST = ["cat_Latn", "spa_Latn", "eng_Latn", "glg_Latn", "eus_Latn", "ita_Latn", "deu_Latn", "por_Latn", "fra_Latn"]
+MAIN_LANG = "glg_Latn"
+LANGUAGE_PAIRS = [(a, b) for (a, b) in LANGUAGE_PAIRS if a in LANGUAGES_OF_INTEREST and b in LANGUAGES_OF_INTEREST and MAIN_LANG in (a, b)]
+
+# auxiliary functions
+
+code_to_language_name = lambda code: Language.make(language=Language.get(code)["language"]).display_name()
+code_to_short_name = lambda code: Language.get(code)["language"]
+jinja_var = lambda s: "{{" + s + "}}" # wrapper to avoid having to escape { } in format strings
+
+def doc_to_text(src: str, tgt: str) -> str:
+    src_name, tgt_name = map(code_to_language_name, [src, tgt])
+
+    return f"""\
+{src_name} sentence: {jinja_var('sentence_' + src)}
+{tgt_name} sentence:"""
+
+def doc_to_target(tgt: str) -> str:
+
+    return f"{jinja_var('sentence_' + tgt)}"
+
+# main function
+
+def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
+    """
+    Generate a YAML file for each translation direction.
+    """
+
+    err = []
+    for src, tgt in LANGUAGE_PAIRS:
+
+        # do both translation directions for each lang pair
+        for src, tgt in [(src, tgt), (tgt, src)]:
+            lang_pair_name = f"{code_to_short_name(src)}-{code_to_short_name(tgt)}"
+            yaml_file_name = f"flores_{lang_pair_name}.yaml"
+
+            try:
+                with open( f"{output_dir}/{yaml_file_name}", "w" if overwrite else "x", encoding="utf-8") as outfile:
+                    print(f"Creating {yaml_file_name}...")
+                    outfile.write("# File generated by `create-yamls.py`\n")
+                    yaml.dump(
+                        {
+#                             "group": [f"{BENCH_NAME}_bench", f"{BENCH_NAME}_bench_flores"],
+#                            "group": "flores_gl",
+                            "include": "_flores_common_yaml",
+                            "task": f"flores_{lang_pair_name}",
+                            "doc_to_text": doc_to_text(src, tgt),
+                            "doc_to_target": doc_to_target(tgt),
+                        },
+                        outfile,
+                        sort_keys=False,
+                    )
+
+            except FileExistsError:
+                err.append(yaml_file_name)
+
+    if len(err) > 0:
+        raise FileExistsError(
+            "Files were not created because they already exist:"
+            f" {', '.join(err)}"
+            "\nUse flag --overwrite to overwrite them."
+        )
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--overwrite", default=False, action="store_true", help="Overwrite files if they already exist")
+    parser.add_argument( "--output-dir", default=".", help="Directory to write yaml files to" )
+    args = parser.parse_args()
+
+    gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
+
+if __name__ == "__main__":
+    main()
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_ca-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_ca-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_ca-gl
+doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_de-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_de-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_de-gl
+doc_to_text: 'German sentence: {{sentence_deu_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_en-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_en-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_en-gl
+doc_to_text: 'English sentence: {{sentence_eng_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_es-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_es-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_es-gl
+doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_eu-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_eu-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_eu-gl
+doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_fr-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_fr-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_fr-gl
+doc_to_text: 'French sentence: {{sentence_fra_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-ca.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-ca.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-ca
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  Catalan sentence:'
+doc_to_target: '{{sentence_cat_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-de.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-de.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-de
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  German sentence:'
+doc_to_target: '{{sentence_deu_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-en.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-en.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-en
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  English sentence:'
+doc_to_target: '{{sentence_eng_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-es.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-es.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-es
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  Spanish sentence:'
+doc_to_target: '{{sentence_spa_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-eu.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-eu.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-eu
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  Basque sentence:'
+doc_to_target: '{{sentence_eus_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-fr.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-fr.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-fr
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  French sentence:'
+doc_to_target: '{{sentence_fra_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-it.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-it.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-it
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  Italian sentence:'
+doc_to_target: '{{sentence_ita_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl-pt.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl-pt.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_gl-pt
+doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
+
+  Portuguese sentence:'
+doc_to_target: '{{sentence_por_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_gl.yaml
@@ -0,0 +1,24 @@
+group: flores_gl
+task:
+  - flores_es-gl
+  - flores_gl-es
+  - flores_en-gl
+  - flores_gl-en
+  - flores_eu-gl
+  - flores_gl-eu
+  - flores_pt-gl
+  - flores_gl-pt
+  - flores_it-gl
+  - flores_gl-it
+  - flores_fr-gl
+  - flores_gl-fr
+  - flores_ca-gl
+  - flores_gl-ca
+  - flores_gl-de
+  - flores_de-gl
+aggregate_metric_list:
+  - metric: bleu
+    aggregation: mean
+    weight_by_size: false
+metadata:
+  version: 1.0
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_it-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_it-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_it-gl
+doc_to_text: 'Italian sentence: {{sentence_ita_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'
diff --git a/lm_eval/tasks/galician_bench/flores_gl/flores_pt-gl.yaml b/lm_eval/tasks/galician_bench/flores_gl/flores_pt-gl.yaml
@@ -0,0 +1,7 @@
+# File generated by `create-yamls.py`
+include: _flores_common_yaml
+task: flores_pt-gl
+doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}}
+
+  Galician sentence:'
+doc_to_target: '{{sentence_glg_Latn}}'