Code for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs.
Authors: Shenao Zhang¹, Zhihan Liu¹, Boyi Liu², Yufeng Zhang, Yingxiang Yang², Yongfei Liu², Liyu Chen², Tao Sun², Zhaoran Wang¹.
¹Northwestern University, ²ByteDance
Install the package dependencies as follows:
python -m pip install .
To fine-tune Gemma-2-9b-it, upgrade transformers by pip install --upgrade transformers
.
Replace USERNAME
in scripts/preprocess.py
and scripts/reward_augmentation.py
with your huggingface username.
First, preprocess the UltraFeedback dataset following this script, while keeping the quality scores of the responses:
python scripts/preprocess.py
Then the reward-augmented preference data can be obtained by running:
python scripts/reward_augmentation.py
Replace USERNAME
in config_full.yaml
with your huggingface username.
Then run standard DPO on the reward-augmented preference data, e.g., on the Qwen2-7B-Instruct model:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/qwen2-7b-instruct-dpo-ra/dpo/config_full.yaml
@article{zhang2024reward,
title={Reward-Augmented Data Enhances Direct Preference Alignment of LLMs},
author={Zhang, Shenao and Liu, Zhihan and Liu, Boyi and Zhang, Yufeng and Yang, Yingxiang and Liu, Yongfei and Chen, Liyu and Sun, Tao and Wang, Zhaoran},
journal={arXiv preprint arXiv:2410.08067},
year={2024}
}
This repo is built upon The Alignment Handbook. We thank the authors for their great work.