Skip to content

Commit

Permalink
Merge branch 'develop' into support_chatglm_dybatch_v1
Browse files Browse the repository at this point in the history
  • Loading branch information
xiaoxiaohehe001 committed Aug 26, 2023
2 parents 55d1f6d + 1a69081 commit 2ffa396
Show file tree
Hide file tree
Showing 99 changed files with 4,625 additions and 1,322 deletions.
18 changes: 10 additions & 8 deletions llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
| [GPT-3](./gpt-3) |||| WIP || WIP |
| [OPT](./opt) | WIP ||| WIP|| WIP |
| [GLM](./glm) |N/A ||| WIP|| WIP |
| [Qwen](./qwen) |N/A ||||| WIP |


# LLM全流程工具介绍
Expand All @@ -29,12 +30,13 @@

- PaddlePaddle >= 2.5.1
- PaddleNLP >= 2.6.0
- tiktoken (仅 Qwen 需要)

## 2. 预训练
[LLaMA v1/v2](./llama)[GPT-3](./gpt-3) 目录中提供了模型预训练的数据准备和训练细节,后续我们将支持更多的模型预训练。

## 3. 精调
目前精调统一脚本只支持[LLaMA v1/v2](./llama)[ChatGLM-6B](./chatglm)[ChatGLM2-6B](./chatglm2)[Bloom](./bloom)[OPT](./opt),其他模型精调使用详见对应模型目录。接下来我们将以**Llama 2**为例介绍如何使用统一脚本进行SFT、LoRA、Prefix Tuning。更多LoRA、Prefix Tuning请参见[PEFT文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/peft.md)
目前精调统一脚本只支持[LLaMA v1/v2](./llama)[ChatGLM-6B](./chatglm)[ChatGLM2-6B](./chatglm2)[Bloom](./bloom)[OPT](./opt)[Qwen](./qwen),其他模型精调使用详见对应模型目录。接下来我们将以**Llama 2**为例介绍如何使用统一脚本进行SFT、LoRA、Prefix Tuning。更多LoRA、Prefix Tuning请参见[PEFT文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/peft.md)

### 3.1 精调训练数据格式

Expand Down Expand Up @@ -122,15 +124,15 @@ python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./

<details><summary>&emsp; 数据参数(DataArgument) </summary><div>


- `dataset_name_or_path`: 本地数据集目录或内置数据集名称,默认为None。
- `task_name`: 用于选择内置数据集中的具体任务,默认为None。
- `src_length`: 模型输入上下文最大长度,默认为1024。
- `tgt_length`:模型生成文本最大长度,默认为1024。
- `eval_with_do_generation`: 在模型效果评估的时候是否调用model.generate,默认为False。设置为True时,指标为ppl, accuracy;设置为False时,指标为BLEU4/Rouge,建议将`metric_for_best_model`设为bleu4。
- `save_generation_output`: 当`eval_with_do_generation`设为True,是否将生成结果保存在`generated_output.json`文件中,默认为False。
- `intokens`:是否使用InToken数据流(减少Padding冗余计算,大幅提升有效Token计算效率),默认为False。当`eval_with_do_generation`设为True,评估过程不支持InToken数据流。
- `intokens_max_length`: InToken数据流模型训练最大长度,默认为2048。
- `intokens`:是否使用InToken数据流(减少Padding冗余计算,大幅提升有效Token计算效率),默认为False。当`eval_with_do_generation`设为True,评估过程不支持InToken数据流。。
- `src_length`: 模型输入上下文最大token长度,默认为1024。
- `max_length`:模型输入(上下文+生成内容)的最大token长度, 默认为2048。当`intokens`设为True的时候,同时也为InToken数据流模型训练输入最大长度,通常建议设为模型允许输入最大长度,同时`per_device_train_batch_size`设为1,使用`gradient_accumulation_steps`控制batch size。
- `lazy`:设置为False则使用`MapDataset`,设置为True则使用`IterDataset`,默认为False。对于数据量较大的时候建议设为True,`IterDataset`可以避免一次性将所有数据读入内存,注意需要设置`max_steps`并且`evaluation_strategy``save_strategy`设为`steps`

</div></details>


Expand Down Expand Up @@ -295,8 +297,8 @@ python predictor.py \

- `model_name_or_path`: 必须,预训练模型名称或者本地的模型路径,用于热启模型和分词器,默认为None。
- `batch_size`: 批处理大小,默认为8。该参数越大,占用显存越高;该参数越小,占用显存越低。
- `src_length`: 模型输入上下文最大长度,默认为1024。
- `max_length`:推理过程中模型输入最大长度,也即文本生成的最长长度为`max_length-len(input_ids)`, 默认为2048。
- `src_length`: 模型输入上下文最大token长度,默认为1024。
- `max_length`:模型输入(上下文+生成内容)的最大token长度, 默认为2048。
- `lora_path`: LoRA参数和配置路径,对LoRA参数进行初始化,默认为None。
- `prefix_path`: Prefix Tuning参数和配置路径,对Prefix Tuning参数进行初始化,默认为None。
- `top_k`: “采样”策略中为 top-k 过滤保留的最高概率标记的数量。默认为1,等价于贪心策略。
Expand Down
19 changes: 13 additions & 6 deletions llm/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,24 @@
class DataArgument:
dataset_name_or_path: str = field(default=None, metadata={"help": "Name or path for dataset"})
task_name: str = field(default=None, metadata={"help": "Additional name to select a more specific task."})
src_length: int = field(default=1024, metadata={"help": "The max length of source text."})
tgt_length: int = field(default=1024, metadata={"help": "The max length of target text."})
intokens: bool = field(default=False, metadata={"help": "Whether to use InTokens data stream"})
src_length: int = field(default=1024, metadata={"help": "The maximum length of source(context) tokens."})
max_length: int = field(
default=2048,
metadata={
"help": "The maximum length that model input tokens can have. When intokens is set to True, it's also the maximum length for InTokens data stream"
},
)
eval_with_do_generation: bool = field(default=False, metadata={"help": "Whether to do generation for evaluation"})
save_generation_output: bool = field(
default=False,
metadata={"help": "Whether to save generated text to file when eval_with_do_generation set to True."},
)
intokens: bool = field(default=False, metadata={"help": "Whether to use InTokens data stream"})
intokens_max_length: int = field(
default=2048,
metadata={"help": "The max length for InTokens data stream. Only effective when intokens is True"},
lazy: bool = field(
default=False,
metadata={
"help": "Weather to return `MapDataset` or an `IterDataset`.True for `IterDataset`. False for `MapDataset`."
},
)


Expand Down
1 change: 1 addition & 0 deletions llm/bloom/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ BLOOM是一种自回归大型语言模型(LLM),在大量文本数据上训练
| bigscience/bloomz-7b1-mt |
| bigscience/bloomz-7b1-p3 |
| bigscience/bloomz-7b1 |
| bellegroup/belle-7b-2m |
2 changes: 1 addition & 1 deletion llm/bloom/gptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/bloom/lora_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/bloom/pt_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/bloom/ptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/bloom/sft_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm/gptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm/lora_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm/pt_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm/ptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm/sft_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm2/gptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm2/lora_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm2/pt_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm2/ptq_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"dataset_name_or_path": "./data",
Expand Down
2 changes: 1 addition & 1 deletion llm/chatglm2/sft_argument.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"tgt_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
Expand Down
5 changes: 3 additions & 2 deletions llm/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,10 @@ def tokenize_example(tokenizer, example, data_args):
truncation_side="left",
add_special_tokens=True,
)
tgt_max_length = data_args.max_length - len(tokenized_source["input_ids"])
tokenized_target = tokenizer(
target,
max_length=data_args.tgt_length,
max_length=tgt_max_length,
truncation=True,
truncation_side="right",
add_special_tokens=False,
Expand All @@ -70,7 +71,7 @@ def tokenize_example(tokenizer, example, data_args):
tokenized_target_input_ids = tokenized_target["input_ids"]
# Add eos_token_id at the end of sequence if the sentence is not truncated.
# Attention! In some cases(ex. ChatGLMv2), tokenized eos_token is not equal to eos_token_id.
if len(tokenized_target_input_ids) < data_args.tgt_length:
if len(tokenized_target_input_ids) < tgt_max_length:
tokenized_target_input_ids += [tokenizer.eos_token_id]

return tokenized_source, tokenized_target_input_ids
Expand Down
40 changes: 23 additions & 17 deletions llm/finetune_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,9 @@ def main():
if hasattr(model_config, "use_flash_attention"):
model_config.use_flash_attention = model_args.use_flash_attention
if hasattr(model_config, "max_position_embeddings"):
if model_config.max_position_embeddings < data_args.src_length + data_args.tgt_length:
if model_config.max_position_embeddings < data_args.max_length:
raise ValueError(
f"The src_length + tgt_length ({data_args.src_length + data_args.tgt_length}) must be smaller than max_position_embeddings({model_config.max_position_embeddings})."
f"The max_length ({data_args.max_length}) must be smaller than max_position_embeddings({model_config.max_position_embeddings})."
)
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
Expand All @@ -113,18 +113,18 @@ def main():
# Load tokenizer & dataset
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
if isinstance(tokenizer, LlamaTokenizer):
tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else "<pad>"
tokenizer.pad_token_id = tokenizer.eos_token_id

if data_args.dataset_name_or_path is None:
raise ValueError(f"Please specific dataset name or path (got {data_args.dataset_name_or_path})")
elif os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) and os.path.exists(
os.path.join(data_args.dataset_name_or_path, "dev.json")
):
train_ds = load_dataset(
read_local_dataset, path=os.path.join(data_args.dataset_name_or_path, "train.json"), lazy=False
read_local_dataset, path=os.path.join(data_args.dataset_name_or_path, "train.json"), lazy=data_args.lazy
)
dev_ds = load_dataset(
read_local_dataset, path=os.path.join(data_args.dataset_name_or_path, "dev.json"), lazy=False
read_local_dataset, path=os.path.join(data_args.dataset_name_or_path, "dev.json"), lazy=data_args.lazy
)
else:
if data_args.task_name is not None:
Expand All @@ -140,7 +140,7 @@ def main():
else:
trans_func = partial(get_convert_example(model), tokenizer=tokenizer, data_args=data_args)
if data_args.intokens:
if model.base_model_prefix not in ["llama", "bloom", "chatglm"]:
if model.base_model_prefix not in ["llama", "bloom", "chatglm"] and training_args.pipeline_parallel_degree < 1:
raise NotImplementedError("InTokens data stream is only implemented for LLaMA, Bloom and ChatGLM so far.")
train_ds = train_ds.map(partial(trans_func, is_test=False, intokens=data_args.intokens))
eval_intokens = data_args.intokens
Expand All @@ -151,19 +151,27 @@ def main():
eval_intokens = False
dev_ds = dev_ds.map(partial(trans_func, is_test=data_args.eval_with_do_generation, intokens=eval_intokens))
if data_args.intokens:
if data_args.lazy:
from paddlenlp.datasets import InTokensIterableDataset

intoken_dataset = InTokensIterableDataset
else:
from paddlenlp.datasets import InTokensMapDataset

intoken_dataset = InTokensMapDataset
from paddlenlp.datasets import InTokensMapDataset

logger.info("Creating InTokens Data Stream. This may take a few minutes.")
train_ds = InTokensMapDataset(
train_ds = intoken_dataset(
train_ds,
tokenizer=tokenizer,
max_length=data_args.intokens_max_length,
max_length=data_args.max_length,
)
if eval_intokens:
dev_ds = InTokensMapDataset(
dev_ds = intoken_dataset(
dev_ds,
tokenizer=tokenizer,
max_length=data_args.intokens_max_length,
max_length=data_args.max_length,
)

if model_args.prefix_tuning:
Expand Down Expand Up @@ -232,6 +240,8 @@ def compute_metrics_do_generation(eval_preds):
}

# Create trainer
max_length = data_args.max_length if training_args.pipeline_parallel_degree > 1 else None
padding = "max_length" if training_args.pipeline_parallel_degree > 1 else True
trainer = CausalLMTrainer(
model=model,
args=training_args,
Expand All @@ -241,13 +251,9 @@ def compute_metrics_do_generation(eval_preds):
compute_metrics=compute_metrics_do_generation if data_args.eval_with_do_generation else compute_metrics,
data_collator=DataCollatorForSeq2Seq(
tokenizer=tokenizer,
max_length=data_args.src_length + data_args.tgt_length
if training_args.pipeline_parallel_degree > 1
else -1,
padding="max_length" if training_args.pipeline_parallel_degree > 1 else True,
max_label_length=data_args.src_length + data_args.tgt_length
if training_args.pipeline_parallel_degree > 1
else None,
max_length=max_length,
padding=padding,
max_label_length=max_length,
return_tensors="np",
),
do_generation=data_args.eval_with_do_generation,
Expand Down
Loading

0 comments on commit 2ffa396

Please sign in to comment.