Here is an overview of the steps we'll take to generate a DPO/ORPO dataset for a new language:
graph LR
A[Load Aya Dataset] --> B[Filter Aya to focus on target language]
B --> C[Generate new responses using Inference Endpoints]
C --> D[UltraFeedback is used to rank responses]
D --> E[Validate preferences using Argilla]
- CohereForAI/aya_dataset: a dataset of human-annotated prompt-completion pairs across 71 languages.
distilabel
: a library for generating synthetic datasets.- Hugging Face Inference Endpoints for hosting models used to generate new responses.
- Argilla: a data collaboration tool which we can use to label preferences in the generated dataset.
To make things a bit easier, we'll start by filtering the Aya dataset to focus on the language you are interested in. This will make it easier to generate new responses using inference endpoints. Aya includes data for 71 languages, so hopefully you'll be able to find the language you are interested in. If not, feel free to reach out on the Hugging Face Discord server to see if there are other resources available for your language and we can help you get started.
For this step you can use the dataset prep notebook to filter the Aya dataset to the language you are interested in. This notebook also gives you a chance to explore the Aya dataset a little bit more and get a sense of the data you'll be working with.
Identify any existing strong instruction-tuned models for your language. Some languages may already have nice benchmarks and leaderboards that you can use to identify a strong base model. Discussing this with the community on the Hugging Face Discord server can also be helpful.
Tip
If there are no existing benchmarks for your language, it is probably best to rely on vibe based checks initially. As part of the data generation pipeline for this project, you could also decide to generate multiple responses for each prompt. This will allow you to also generate some rankings for the responses and use these to identify a strong base model. Even if there are benchmarks available, the results of the vibe based checks can be useful especially if the benchmarks are not very robust.
In the Distilabel library pipeline we created for this project, we use Hugging Face Inference Endpoints for hosting models used to generate new responses. This allow us to run the scripts without needing to have GPUs available locally. There are two main options for using the Inference Endpoints:
- serverless
- dedicated endpoints
The serverless option doesn't require any setup but is restricted to a smaller subset of models. This blog post gives a good overview of the supported models.
Tip
If you are able to use the serverless option, it is the easiest way to get started. You can find more information on how to use the serverless option in the docs. You will have higher rate limits with a Pro account. If you are using a free account, you will have a lower rate limit.
The dedicated endpoints option allows a large number of the models on the Hugging Face Hub to be deployed. This option requires a bit more setup but is more flexible. You can find more information on how to set up an endpoint in the docs.
Warning
Be aware that you will be charged for the use of the Inference Endpoints. You can find more information on the pricing in the pricing docs. It is very strongly recommend to enable the autoscaling option when setting up the endpoint to avoid unnecessary charges.
Warning
If you have local GPUs available, you can also adapt this approach using other inference frameworks such as Ollama or vLLM.
Once you have filtered the Aya dataset to the language you are interested in and identified a strong base model, you can use the distilabel
library to generate a second response for each prompt in the filtered Aya dataset. This script will use the strong base model identified in step 2 to generate the second response. For this step it probably makes sense to run the script locally. Rather than on Google Colab or a similar platform, as you will be able to run the script for longer periods of time without timeouts.
You should clone this repository and install the required dependencies. You can do this by running the following commands:
git clone https://github.com/huggingface/data-is-better-together
cd dpo
You should then create a virtual environment and install the required dependencies. I have migrated to using uv
for managing virtual environments. You can install uv
and installing Python dependencies. You can use uv
to create a virtual environment and install the required dependencies by running the following commands:
uv venv
# On macOS and Linux.
source .venv/bin/activate
# On Windows.
.venv\Scripts\activate
You can then install the required dependencies by running the following command:
uv pip install -r requirements.txt
Note
You can use whatever method you prefer for managing virtual environments. If you are not familiar with uv
, you can use venv
or conda
instead.
We will use the distilabel
library to generate a second response for each prompt in the filtered Aya dataset. This script will use the strong base model identified in step 2 to generate the second response.
Use the aya_dpo_gen.py
script to generate a second response for each prompt in the filtered Aya dataset. This script will use the strong base model identified in step 2 to generate the second response. This script is fairly heavily commented so you should be able to follow along with what is happening. I will highlight a few key points.
At the top of the script, there is a section for configuration. You should update the configuration to match your setup. Here is an example of the configuration section:
# Model Configuration
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
MAX_NEW_TOKENS = 2000 # Maximum number of new tokens to generate
# Inference Endpoints Configuration
# INFERENCE_ENDPOINTS_URL = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct" # Inference endpoints URL
# ENDPOINT_NAME = "meta-llama/Meta-Llama-3-70B-Instruct"
INPUT_BATCH_SIZE = 10 # Input batch size `for the model via the Inference Endpoints API, you can adjust this based on the model's requirements and the hardware you are using to deploy the model
# Argilla Configuration
ARGILLA_SPACE_URL = "https://dibt-demo-argilla-space.hf.space" # Argilla Space URL
ARGILLA_DATASET_NAME = "aya_dutch_dpo" # Argilla dataset name
ARGILLA_WORKSPACE_NAME = "admin" # Argilla workspace name
# Dataset Configuration
INPUT_DATASET_HUB_ID = "DIBT/aya_dataset_dutch_example" # Input dataset hub ID (created in the previous step)
OUTPUT_DATASET_HUB_ID = "DIBT/aya_dutch_dpo_raw" # Output dataset hub ID
SPLIT = "test" # Split of the dataset to use. Start with test whilst you are testing the pipeline and then switch to train when you are ready to generate the full dataset
Warning
If you have local GPUs available, you can also adapt this approach using other inference frameworks such as Ollama or vLLM.
If you are using the dedicated endpoints option, you will need to pass the Inference Endpoints URL and the endpoint name.
For testing purposes, you can set the split to test
. This will allow you to test the pipeline with a smaller subset of the data. Once you are happy with the pipeline, you can switch to train
to generate the full dataset.
Depending on the model you are using to generate the second response, you may need to provide a custom system prompt. For the Dutch example, I found that using Llama 3 without a system prompt was generating responses that were almost always in English. I added a custom system prompt to the script to steer the model to answering in Dutch. Here is an example of the custom system prompt I used for the Dutch example:
system_prompt = """Je bent een AI-assistent. Je primaire taal is Nederlands. Beantwoord de meeste vragen en prompts in het Nederlands, tenzij specifiek gevraagd wordt om een andere taal te gebruiken.
Als je gevraagd wordt om te vertalen tussen twee andere talen, bijvoorbeeld van Frans naar Engels, voer dan de gevraagde vertaling uit. Wanneer citaten of passages in een andere taal in een prompt worden gegeven, ga er dan van uit dat de gebruiker wil dat je ze begrijpt en ernaar verwijst bij het formuleren van je Nederlandse antwoord. Vertaal de anderstalige tekst zelf niet, tenzij dit specifiek wordt gevraagd."""
This translates to something like
You are an AI assistant. Your primary language is Dutch. Answer most questions and prompts in Dutch, unless specifically asked to use another language. If you are asked to translate between two other languages, for example from French to English, perform the requested translation. When quotes or passages in another language are given in a prompt, assume that the user wants you to understand them and refer to them when formulating your English response. Do not translate the foreign text yourself, unless specifically requested.
This custom system prompt is passed to the model along with the prompt and the model generates a response in Dutch. When you are setting up the pipeline for your language, you should test the responses generated by the model to ensure they are in the correct language. This is an area where some experimentation with the system prompt may be necessary. Running the script with a small subset of the data will allow you to test the responses and adjust the system prompt as needed. Once you are happy with the responses generated by the model, you can run the script with the full dataset.
Tip
Whilst it might be tempting to use a system prompt that tells the model to always respond in the target language there are prompts that sometimes require the model to respond in a different language. For example, in the Dutch dataset there are prompts that ask, in Dutch, for a bit of French text to be translated into English. Truly expecting a multilingual model to respond in the target language is a good way to get a sense of how well the model can handle the task.
Since we are running on Inference Endpoints, the script will use the Hugging Face Inference API to generate the responses. This means most of the heavy computation is happening on the Hugging Face servers and you don't need to have a powerful machine to run the script.
You can run the script by running the following command:
python aya_dpo_gen.py
The script will generate a second response for each prompt in the filtered Aya dataset. The responses will be saved to a Huggingface dataset at the end of the run. The script will also generate an LLM ranking for each response. It is an open question how well LLMs will do as judges for non English languages. For Dutch the LLM rankings seem pretty good using LLama 3. If after doing a few runs it seems like the LLM rankings are not very good, you could remove this step from the pipeline.
Tip
Whilst the LLM judge may not be very good it may still be worth keeping that in the pipeline. When you annotate the data in Argilla you can use the LLM ranking as a starting point and assign a human ranking. This will also give you a sense of how well the LLM judge is doing. This alone could be a valuable contribution to the community.
The distilabel pipeline will by default get pushed to the Hugging Face Hub. You will be able to do the following in the Argilla interface:
- View the generated responses
- Rate the responses
- Provide feedback on the responses
- Add your own responses (for example to slightly adjust the response generated by the model)
Additionally the interface allows you to filter by the following:
- the predicted language of the response
- the LLM ranking of each response
This will allow you to quickly filter the responses and focus on the ones that are most interesting to you.
Tip
If the LLM judge is doing a good job you may choose to only annotate a subset of the responses. If the LLM judge is not doing a good job you may choose to annotate all of the responses. This will give you a sense of how well the LLM judge is doing and how much work is required to get a good dataset. If you setup Argilla with HF OAuth enabled you can also share the dataset with the community and get help with the annotation!
This video shows some of the features available in the Argilla interface:
Firefox.mp4
The 02_load_from_argilla.ipynb notebook walks through the process of loading the annotated dataset from Argilla and generating the final DPO/ORPO dataset. This notebook will also give you a chance to explore the annotated dataset and get a sense of the preferences that have been assigned to each response. You will likely want to adapt this notebook to your use case but it should provide a good starting point.
To help others generate DPO/ORPO datasets for more languages, you can share your scripts with the community. You can do this by creating a pull request to the overall README for this project. You can already see some examples of how this has been done for the Dutch example. You can either directly add your scripts, notebooks etc to a subfolder of the dpo
folder or you can add a link to a repository where you have stored your scripts.
(optional) Train a model using the generated DPO/ORPO dataset and push forward the state of the art in your language 🚀
The Alignment Handbook has a great guide on how to train a model using a DPO/ORPO dataset.
Alternatively, if you prefer to take a more hands of approach you could try using AutoTrain to train a model using the generated DPO/ORPO dataset. AutoTrain is a tool that allows you to train a model using a dataset and a configuration file.