Skip to content

Commit

Permalink
Merge branch 'develop' into semi_auto_trainer_llama2
Browse files Browse the repository at this point in the history
  • Loading branch information
haohongxiang committed Jan 28, 2024
2 parents b3e64c3 + 13fd655 commit eda936c
Show file tree
Hide file tree
Showing 51 changed files with 5,246 additions and 143 deletions.
225 changes: 225 additions & 0 deletions examples/RLHF/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# RLHF PPO

提供了基于强化学习 PPO 算法对 LLM 进行人类偏好对齐的代码及完整使用示例。其中 PPO 代码实现细节参考了 [PKU-Alignment/safe-rlhf](https://github.com/PKU-Alignment/safe-rlhf)(PKU Beaver) 中的 PPO 实现,支持reward normalization、pretraining loss等常用的 PPO 稳定训练策略;示例使用 PKU-Alignment/safe-rlhf 提供的部分数据集和模型。后续将持续完善扩展,支持更好效果、更低成本、更高性能、更大规模的 RLHF 能力。

## 快速开始

项目整体组织结构如下:

```
.
├── reward_main.py # reward model训练脚本
├── reward_config.json # reward model训练配置文件
├── reward_trainer.py # reward训练执行器py脚本
├── ppo_main.py # RLHF训练脚本
├── ppo_config.json # RLHF训练配置文件
├── ppo_trainer.py # RLHF训练执行器py脚本
├── data # 数据集相关目录
│ └── base.py # 数据集基类及工具py文件
│ └── alpaca.py # alpaca(raw)数据集py文件
│ └── safe_rlhf.py # safe_rlhf(raw)数据集py文件
│ └── preference.py # 偏好数据集py文件
│ └── prompt_only.py # prompt only数据集py文件
│ └── supervised.py # supervised数据集py文件
├── models # 模型相关目录
│ └── score_model_utils.py # score model基类及工具py文件
│ └── score_model.py # score model模型定义py文件
└── README.md
```

### 环境准备

- Python >= 3.10
- PaddlePaddle >= 2.6.0
- PaddleNLP >= 2.6.0

此外还需要安装以下依赖:`pip install rich`

### 数据准备

PPO 训练包括 Supervised Fine-Tuning、Reward Model Fine-Tuning、RLHF 三个阶段(可见下文训练部分),会涉及到多个数据集,下面分别介绍并给出自定义数据的方法。

#### Supervised Fine-Tuning 数据
[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),可以直接参考对应内容进行数据准备。

#### Reward Model Fine-Tuning 数据
Reward Model Fine-Tuning 阶段需要使用人类偏好数据。示例使用 PKU-Alignment/safe-rlhf 提供的 [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) 数据集,下面是其中一条样本,这里使用其中的`prompt、response_0、response_1、better_response_id`字段来组织偏好数据(safe字段,该数据集将helpful 和 harmless 分开标注,示例这里使用其 helpful 标注)。

```json
{
"prompt": "Is it possible to purchase a complete chess set at a pawn shop?",
"response_0": "Yes, it is possible to purchase a complete chess set at a pawn shop.",
"response_1": "Yes, it is possible to purchase a complete chess set at a pawn shop. Pawn shops may carry chess sets that have been either inherited or previously borrowed. There is also a possibility of finding individual chess pieces such as rooks, knights, bishops, queens, and kings. If you cannot find a complete set, you may be able to purchase the various chess pieces separately.",
"is_response_0_safe": true,
"is_response_1_safe": true,
"better_response_id": 1,
"safer_response_id": 1
}
```

#### RLHF 数据
RLHF 阶段使用 prompt only 数据,另外可以可选的提供额外的监督数据用于构建 LM 损失约束 RLHF 训练。示例使用 [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) 数据集(同样是人类偏好数据集,这里只使用其 prompt 字段并对 prompt 去重)。此外还使用了 [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)的数据来构建额外的损失项。

以上示例数据集在训练时将自动下载缓存使用。

#### 自定义数据
数据定义围绕 `RawSample``RawDataset` 两个预置的类进行;`RawSample` 提供了数据样本级别接入协议规范,`RawDataset` 提供了数据集级别接入协议规范;按照下面介绍的规范接入,即可通过预置的 `SupervisedDataset``PreferenceDataset``PromptOnlyDataset` 三类 RLHF 训练所需数据接口来使用自定义数据。

自定义数据集需要:
- 继承 `RawDataset` ,并定义类属性 `NAME` 用于注册数据集。
- 实现 `__init__` 方法(加载数据),`__getitem__` 方法(根据 index 获取样本并转换为 `RawSample` 对象返回)、`__len__` 方法(数据集大小)。

示例如下:

```python
from datasets import load_dataset
from data import RawDataset, RawSample

class MyRawDataset(RawDataset):
NAME = 'my-dataset-name'

def __init__(self, path=None) -> None:
# Load a dataset from Hugging Face or any other data source
# self.data = load_dataset(path or 'my-organization/my-dataset')['train']
self.data = [{
'col1': 'question',
'col2': 'answer1',
'col3': 'answer2',
'col4': 1, # score of answer1
'col5': 2 # score of answer2
}] * 10 # dummy data for example

def __getitem__(self, index: int) -> RawSample:
data = self.data[index]
# Construct a `RawSample` dictionary from your custom dataset item
return RawSample(
input=data['col1'],
answer=data['col2'],
other_answer=data['col3'],
better=float(data['col4']) > float(data['col5']),
)

def __len__(self) -> int:
return len(self.data) # dataset size
```

其中 `RawSample` 是整个 RLHF 训练过程用到的几种数据类型的超集,如下所示,其可以桥接各训练阶段所需样本类型。在自定义数据时,对于 SFT 数据使用 `RawSample``(input, answer)` 字段;对于人类偏好数据使用 `RawSample``(input, answer, other_answer, better)` 字段;对于 prompt only 数据,使用 `RawSample``(input)`字段。

```python
class RawSample(TypedDict, total=False):
"""Raw sample type.
For SupervisedDataset, should provide (input, answer) or (dialogue).
For PreferenceDataset, should provide (input, answer, other_answer, better).
For SafetyPreferenceDataset, should provide (input, answer, other_answer, safer, is_safe, is_other_safe).
For PromptOnlyDataset, should provide (input).
When input is a list, it would be processed as a dialogue.
"""

# Texts
input: NotRequired[str | list[str]] # either `input` or `dialogue` should be provided
"""User input text."""
answer: NotRequired[str]
"""Assistant answer text."""
other_answer: NotRequired[str]
"""Other assistant answer text via resampling."""
dialogue: NotRequired[list[str]] # either `input` or `dialogue` should be provided
"""Dialogue history."""

# Flags
better: NotRequired[bool]
"""Whether ``answer`` is better than ``other_answer``."""
safer: NotRequired[bool]
"""Whether ``answer`` is safer than ``other_answer``."""
is_safe: NotRequired[bool]
"""Whether ``answer`` is safe."""
is_other_safe: NotRequired[bool]
"""Whether ``other_answer`` is safe."""
```

如此定义的数据集将可以通过预置接口根据 `NAME` 来使用,当前内置支持`"PKU-SafeRLHF/train", "PKU-SafeRLHF/test", "PKU-SafeRLHF-30K/train", "PKU-SafeRLHF-30K/test", "PKU-SafeRLHF-10K/train", "alpaca"` 几个数据集。另外还支持使用多个数据集并指定数据比例,我们可以按照需要为每个阶段训练准备多份数据集。示例如下:

```python
from paddlenlp.transformers import AutoTokenizer
from data import PreferenceDataset

tokenizer = AutoTokenizer.from_pretrained('facebook/llama-7b')
dataset = PreferenceDataset({
'alpaca': 0.75,
'my-dataset-name': 0.5
}, tokenizer)
```

### 训练

PPO 完整的训练过程包括以下 3 个阶段,如下图所示(来自[DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat)):

<p align="center">
<img src="https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/assets/image/ppo_trainer.png?raw=true" align="middle" width = "600" />
</p>

1. Supervised Fine-Tuning (SFT)

[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),可以直接参考对应内容进行训练并使用其产出模型。

2. Reward Model Fine-Tuning

使用 `reward_main.py` 脚本根据 `reward_config.json` 训练奖励模型

```
python -u -m paddle.distributed.launch reward_main.py ./reward_config.json
```

`reward_config.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述;稍有区别的是 `train_datasets`/`eval_datasets` 分别使用数据集定义注册时的`NAME`属性给出训练和验证集。另外对于奖励模型训练有以下特殊参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):

- `normalize_score_during_training`:是否在训练过程中对奖励进行 normalize,默认为 `False`
- `normalizer_type`:使用 normalizer 时计算 mean、var 的方式,可选`"RunningMeanStd", "ExponentialMovingAverage"`
- `normalizer_momentum`:使用 `ExponentialMovingAverage` normalizer 时指定的 momentum ,默认为 `0.9`
- `loss_type`:使用 token 级或是 sequence 级 loss 进行奖励模型训练,可选`"token-wise", "sequence-wise"`,默认为 `"sequence-wise"`
- `regularization`:奖励模型训练目标中对奖励的正则化系数,默认为 `0.001`

3. RLHF:

RLHF 阶段需要 actor model、reference model、critic model、reward model 四个模型;actor-model/reference-model 使用 SFT 模型进行 initialize/frozen;critic-model/reward-model 使用 reward 模型进行 initialize/frozen (另外注意若 SFT 使用 LoRA 请先将 LoRA 权重合并)。这里使用 PKU-Alignment/PKU-SafeRLHF 提供的 SFT 模型([PKU-Alignment/alpaca-7b-reproduced](https://huggingface.co/PKU-Alignment/alpaca-7b-reproduced))和 reward 模型([PKU-Alignment/beaver-7b-v1.0-reward](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward),注意该模型只关注 helpful 未考量 harmless)作为示例,使用 `ppo_main.py` 脚本根据 `ppo_config.json` 进行 RLHF 训练。

```
python -u -m paddle.distributed.launch ppo_main.py ./ppo_config.json
```

`ppo_config.json` 中的绝大部分参数释义同[LLM 精调](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#2-%E7%B2%BE%E8%B0%83),不再赘述,重点给出以下参数配置及释义(使用 PKU-Alignment/PKU-SafeRLHF 中的默认值):

- `train_datasets`:使用数据集定义注册时的`NAME`属性给出训练集。
- `eval_datasets`:使用数据集定义注册时的`NAME`属性给出验证集。
- `ptx_datasets`:使用数据集定义注册时的`NAME`属性给出 ptx-loss 使用的数据集,未提供时将不使用 ptx-loss。
- `actor_model_name_or_path`:actor-model/reference-model 用来 initialize/frozen 的模型名称或目录。
- `reward_model_name_or_path`:reward-model 的模型名称或目录。
- `reward_critic_model_name_or_path`:critic-model 的模型名称或目录,未提供时将使用`reward_model_name_or_path`进行 critic-model 的初始化。
- `per_device_prompt_batch_size`:训练时 prompt only 数据集读取用于 rollout 生成的批次大小(每张卡)。
- `per_device_train_batch_size`:根据 prompt 进行生成及训练使用的批次大小(每张卡)。
- `num_return_sequences`:生成时每个 prompt 生成的回复个数,即 `GenerationConfig.num_return_sequences`,所有回复都将用来训练。
- `temperature`:生成采样时使用的 `temperature` ,即 `GenerationConfig.temperature`
- `top_p`:生成采样时 top-p-filtering 阈值,即 `GenerationConfig.top_p`
- `repetition_penalty`:生成采样时长度惩罚系数,即 `GenerationConfig.repetition_penalty`
- `update_iters`:一次生成的数据被使用的次数。
- `kl_coeff`:对 reward 进行 KL-Penalty 的系数。
- `clip_range_score`:对 reward 进行裁剪的阈值。
- `clip_range_value`:critic model(value function)对当前sequence的新值与Experience Buffer中旧值的差距超过该范围将进行裁剪。
- `clip_range_ratio`:将当前sequence的新概率与Experience Buffer中旧概率比值裁剪到`(1-clip_range_ratio, 1+clip_range_ratio)`范围(PPO-Clip)。
- `ptx_coeff`: 预训练损失项 ptx-loss 的系数。

另外所有 [`TrainingArguments` 支持参数配置](https://paddlenlp.readthedocs.io/zh/latest/trainer.html#trainingarguments)将为 actor-model 和 critic-model 的训练复用(如`sharding_stage`),除单独提供了 `critic_learning_rate/critic_weight_decay/critic_lr_scheduler_type/critic_warmup_ratio/critic_recompute` 这些参数支持为 critic-model 训练单独指定相应配置。actor-model 和 critic-model 的 checkpoints 将分别保存在 `outpt_dir` 所指定目录的 policy 和 value 文件夹下。

当前示例中所用数据及规模 RLHF 训练基于 sharding stage3 使用 NVIDIA A100 80G 4卡/8卡训练验证。

### 推理

训练完成后可以直接使用 `outpt_dir` 所指定目录中 policy 文件夹下的 checkpoints 按照[LLM 推理](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#4-%E6%8E%A8%E7%90%86)部分的介绍来进行推理,请参考相应部分内容。

## Acknowledge

我们借鉴了[PKU-Alignment/safe-rlhf](https://github.com/PKU-Alignment/safe-rlhf)(PKU Beaver)的优秀设计实现,在此对其作者表示感谢。

## 参考文献
- Zheng R, Dou S, Gao S, et al. Secrets of rlhf in large language models part i: Ppo[J]. arXiv preprint arXiv:2307.04964, 2023.
- Dai J, Pan X, Sun R, et al. Safe rlhf: Safe reinforcement learning from human feedback[J]. arXiv preprint arXiv:2310.12773, 2023.
33 changes: 33 additions & 0 deletions examples/RLHF/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from paddle.io import Dataset

from .alpaca import *
from .base import *
from .preference import *
from .prompt_only import *
from .safe_rlhf import *
from .supervised import *


class DummyDataset(Dataset):
def __init__(self, length: int) -> None:
self.length = length

def __len__(self) -> int:
return self.length

def __getitem__(self, index: int):
return {}
42 changes: 42 additions & 0 deletions examples/RLHF/data/alpaca.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
# Copyright 2023 PKU-Alignment Team. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Stanford Alpaca dataset for supervised instruction fine-tuning."""

from __future__ import annotations

from datasets import load_dataset

from .base import RawDataset, RawSample

__all__ = ["AlpacaDataset"]


class AlpacaDataset(RawDataset):
NAME: str = "alpaca"
ALIASES: tuple[str, ...] = ("stanford-alpaca",)

def __init__(self, path: str | None = None) -> None:
self.data = load_dataset(path or "tatsu-lab/alpaca", split="train")

def __getitem__(self, index: int) -> RawSample:
data = self.data[index]
input = ( # pylint: disable=redefined-builtin
" ".join((data["instruction"], data["input"])) if data["input"] else data["instruction"]
)
answer = data["output"]
return RawSample(input=input, answer=answer)

def __len__(self) -> int:
return len(self.data)
Loading

0 comments on commit eda936c

Please sign in to comment.