Read our paper:
Cross-Lingual Adaptation for In-Context Learning [PDF] (Not submitted for publication)
Contents
Details such as python and package versions can be found in the generated pyproject.toml and poetry.lock files.
We recommend using an environment manager such as conda. After setting up your environment with the correct python version, please proceed with the installation of the required packages
For poetry users, getting setup is as easy as running
poetry install
We also provide a requirements.txt file for pip users who do not wish to use poetry. In this case, simply run
pip install -r requirements.txt
This requirements.txt
file is generated by running the following
sh gen_pip_reqs.sh
If you wish to run evaluation without first training the model, we provide our
checkpoints via The Internet Archive at
this link. Please unzip
this folder and organize it such that the checkpoints are in the checkpoints
folder at the root of this repository.
We do not provide the bare hr_to_lr
MetaICL model checkpoint. For this
checkpoint, please refer to the instructions on the
MetaICL repo for downloading
their metaicl
model in the hr_to_lr
setting. Once downloaded, rename this to
metaicl.pt
and place it in the relevant checkpoints directory.
The following table provides a reference for the models evaluated in our paper.
Model Name | Evaluation Languages | Description |
---|---|---|
metaicl |
en | direct hr_to_lr checkpoint from the MetaICL repo |
sandwich-{lang} |
fr, de | metaicl sandwiched in a translation API for lang , serving as a baseline |
metaicl-gewechselt-{lang}-clm |
fr, de | metaicl adapted to a lang (fr or de) using WECHSEL, 0 shot or with the additional recommended CLM training. |
gpt2-gewechselt-{lang}-clm |
not evaluated | gpt2 adapted to lang (fr or de) using WECHSEL with additional recommended CLM training. Note, we do not actually evaluate this buut only use it as a base. |
{base}-metaicla |
fr, de | A base (any of the gpt2-gewechselt-{lang}-clm ) with a MetaICL adapter, trained the standard way. |
{base}-metaiclva |
fr, de | A base (any of the gpt2-gewechselt-{lang}-clm ) with a MetaICL vessel adapter, trained with targeted distillation. |
We use hydra for configuring our project.
To download/process the data, either run claficle/data/oscar.py or claficle/data/benchmark.py for OSCAR and our multi-lingual multi-task benchmark respectively. You may have to configure or override claficle/conf/setup_data.yaml accordingly. We suggest inspecting slurm/data/ for examples of how we ran these.
Note that to process OSCAR in French and German data you need to make use of trained tokenizers from WECHSEL initialization. You can either download these along with our checkpoints or run WECHSEL initalization yourself by running claficle/models/gewechselt.py, configured with claficle/conf/wechsel_init.yaml. We have examples of how we ran this in slurm/wechsel/.
Once the data is downloaded, to run evaluation run claficle/run/eval.py, configured with claficle/conf/eval.yaml. Examples at slurm/eval/.
Of course, to run evaluation you need trained checkpoints. You can once again either download these or train them yourself. For geWECHSELt models, you can run claficle/run/train.py. For MetaICLVA, you can run claficle/run/distil.py. For MetaICLA, please refer to our MetaICL fork. Like always, these are configured with the relevant files in claficle/conf/ and are accompanies by examples of how we did it in slurm/.
├── LICENSE
├── README.md <- The top-level README
├── data/
│ ├── interim/ <- Intermediate data that has been transformed.
│ ├── processed/ <- The final, canonical data sets for modeling.
│ └── raw/ <- The original, immutable data dump.
├── checkpoints/ <- Trained and serialized models.
├── notebooks/ <- Jupyter notebooks.
├── slurm/ <- SLURM scripts
├── logs/ <- logs
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
├── pyproject.toml <- project metadata, handled by poetry.
├── poetry.lock <- resolving and locking dependencies, handled by poetry.
├── requirements.txt <- for non-poetry users.
├── gen_pip_reqs.sh <- for generating the pip requirements.txt file
└── claficle/ <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
├── data/ <- Scripts to download or generate data
├── models/ <- Model definitions
├── run/ <- scripts to train, evaluate and use models
├── conf/ <- config files
├── utils/ <- miscellaneous utils
└── visualization/ <- Scripts for visualization
The project structure is largely based on the cookiecutter data-science template. This is purposely opinionated so that paths align over collaborators without having to edit config files. Users may find the cookiecutter data-science opinions page, of relevance
The top level data/
and models/
directory are in version control only to
show structure. Their contents will not be committed and are ignored via
.gitignore
.