Skip to content

Latest commit

 

History

History
61 lines (37 loc) · 6.55 KB

File metadata and controls

61 lines (37 loc) · 6.55 KB

Multilingual DPO/ORPO Dataset Project

This project aims to encourage the creation of DPO/ORPO datasets for more languages. By providing these tools, we aim to foster a community of people building DPO/ORPO datasets for different languages. Currently, many languages do not have DPO/ORPO datasets openly shared on the Hugging Face Hub. The DIBT/preference_data_by_language Space gives an overview of the language coverage. At the time of this commit, only 14 languages with DPO/ORPO datasets are available on the Hugging Face Hub. Following this recipe, you can easily generate a DPO/ORPO dataset for a new language.

What are the goals of this project?

This project has the following goals:

  • An Argilla Interface for ranking responses generated by a human Aya annotator and a generated response. See the aya_dutch_dpoexample.
  • A "raw dataset" with LLM feedback for each prompt. See the DIBT/aya_dutch_dpo_raw for an example.
  • A growing dataset with human-verified preferences for each response. See the DIBT/aya_dutch_dpo for an example dataset.

Why do we need DPO/ORPO datasets for more languages?

What is Direct Preference Optimization (DPO)? Direct Preference Optimization (DPO) is a technique for training models to optimize for human preferences.

Direct Preference Optimization (DPO) has emerged as a promising alternative for aligning Large Language Models (LLMs) to human or AI preferences. Unlike traditional alignment methods, which are based on reinforcement learning, DPO recasts the alignment formulation as a simple loss function that can be optimized directly on a dataset of preferences ${(x, y_w, y_l)}$, where $x$ is a prompt and $(y_w,y_l)$ are the preferred and dispreferred responses. source

Or, in other words, to train a model using DPO you need a dataset of prompts and responses where one response is preferred over the other. This type of data is also used for ORPO, another alignment technique we'll describe in the next section.

Sample of a preference tuning dataset. Example of a preference tuning dataset. Each row contains a prompt and a "chosen" and "rejected" response.


DPO datasets are a powerful tool for fine-tuning language models to generate responses that are more aligned with human preferences, so are a valuable resource for improving the quality of chatbots and other generative models. However, currently, there are only a few DPO datasets available for a limited number of languages. By generating more DPO datasets for different languages, we can help to improve the quality of generative models in a wider range of languages.

Recently, Odds Ratio Preference Optimization (ORPO) has been proposed as an alternative to DPO. ORPO is a novel approach to fine-tuning language models that incorporates preference alignment directly into the supervised fine-tuning (SFT) process by using the odds ratio to contrast favored and disfavored generation styles. By applying a minor penalty to the disfavored style during SFT, ORPO effectively guides the model toward the desired behavior without the need for an additional alignment step.

tl;dr: if you have a DPO-style dataset + a strong base model you can use ORPO to train a chat model. Recently Argilla, KAIST, and Hugging Face used this approach to train HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 a very strong chat model using only 7k data preference pairs!

How can you contribute?

As part of Data Is Better Together, we're supporting the community in generating more DPO/ORPO datasets for different languages. If you would like to help, you can follow the steps below to generate a DPO/ORPO dataset for a language that you are interested in. There are already many language communities working together on the Hugging Face Discord server, so you can also join the server to collaborate with others on this project 🤗.

Project Overview

Aya, an open science initiative to accelerate multilingual AI progress, has released a dataset of human-annotated prompt-completion pairs across 71 languages. We can use this dataset to generate DPO/ORPO datasets for languages for which they don't currently exist.

Here are the steps we'll take to generate a DPO/ORPO dataset for a new language:

  • Start from the CohereForAI/aya_dataset.
  • Filter the Aya dataset to the language you are focusing on.
  • Use distilabel to generate a second response for each prompt in the filtered Aya dataset.
  • (Optional) Send the generated dataset to Argilla for annotation where the community can choose which response is better.
  • (Optional) Train a model using the generated DPO/ORPO dataset and push forward the state of the art in your language 🚀🚀🚀

You can find more detailed instructions on how to generate a DPO/ORPO dataset for a new language in the instructions.

I'm GPU-poor, can I still get involved?

Yes! The example scripts in this repository use Hugging Face Inference Endpoints for the inference component. This means you can run the scripts on your local machine without needing a GPU. We can provide you with GPU grants to run the distilabel script if you need them. Please reach out to us on the Hugging Face Discord server if you need a GPU grant. Note: We will want to ensure that you have a plan for how you will use the GPU grant before providing it, in particular, we'll want to see that you have set up an Argilla Space for your project already and have already done some work to identify the language you want to work on and the models you want to use.

Next steps

The current notebooks and code currently only show how to generate the synthetic data and create a preference dataset annotation Space. The next steps would be to collect human feedback on the synthetic data and then use this to train a model. We will cover this in a future notebook.