Release v3.0.0-beta0 · PaddlePaddle/PaddleNLP

很高兴地通知大家，飞桨大模型套件发布v3.0.0beat版本：拥抱大模型，体验全升级。具体工作如下：

统一大模型工具链，实现国产计算芯片全流程接入；
全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；
自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推；
主流模型持续支持更新，提供高效解决方案。

大模型精调对齐训推优化

PEFT：
- 新增scaling策略，支持rslora, pissa算法 in #8256
- 适配FusedQKV和FastFFN参数 in #8372 #8526
DPO：
- 支持DPO（llama，qwen）in #8474
- 支持序列并行 in #7953
国产芯片支持：
- 适配NPU in #8303 #8342 #8359 #8399 #8409 #8401 #8431 #8439 #8438 #8442 #8528 #8642
- 适配XPU in #8282 #8505 #8515 #8588 #8595 #8598
- 适配GCU in #8445 #8470
性能优化：
- 优化Unified Checkpoint机制 in #8204 #8409 #8422 #8512
- 模型并行优化 in #8370
- 序列并行优化 in #8551
- 支持llama3 (wint8|4/a8w8) in #8630
其他
- 新增模型内存监控 in #8269

模型新增

新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
新增llama3模型 in #8307 #8371
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct

基础框架升级

功能优化：
- 支持FusedQKV和FastFFN权重自动融合分割 in #8202 #8378 #8432
- 支持模型并行参数同步设置 in #8311
- 支持RoPE算子设定theta in #8440
- 通信overlap优化 in #8276 #8473 #8499 #8594
AutoParallel优化
- llama支持recompute机制 in #8265
- 适配llama3 in #8395
- position_ids优化 in #8363
- 支持流水线并行split_backward in #8479
- 适配qwen in #8312
分布式能力优化：
- 修复流水线并行中enable_sharding_comm_overlap中参数错误问题 in #8333
- MoE并行支持 in #8498 #8522
chat能力优化：
- 增加Chat template in #8226
其他
- 文档 in #8336 #8393
- 更新nested操作 in #8380
- 随机性更新 in #8450 #8396
- 算子更新 in #8472
- example更新 in #8538

问题修复

修复sharding数量小于100的bug in #8146
修复TP/PP参数合并问题 in #8239
修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
修复text feature extraction任务中tokenizer输入 in #8331
修复import error in #8332 #8367

结构调整

PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666

What's Changed

[dist]pip requirements-dev.txt by @Liujie0926 in #8258
add scaling by @lugimzzz in #8256
[LLM]Support Gemma model by @Southpika in #8082
[BugFix] Try except sequence parallel utils by @DesmonDay in #8189
Update CodeCov GitHub Action by @sijunhe in #8268
[AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
Fix sharding < 100 limitation bug by @sneaxiy in #8146
use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
[dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
[Bug Fix]Fix merge parameters in pp by @Southpika in #8239
[LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
add a100 test ground truth by @zhiqiu in #8249
[paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
[paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
Support llama-3 by @ZHUI in #8307
[Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
[paddle-pipelines] Update mkdocs by @w5688414 in #8310
[benchmark]update llama2_ips by @Liujie0926 in #8322
[dist CI]fix before_hook by @Liujie0926 in #8283
benchmark llama worker=1 by @wanghuancoder in #8305
【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
Add system env log for llama test by @zhangbo9674 in #8321
[LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
[Distributed] fix lora by @SylarTiaNII in #8325
fix try import by @w5688414 in #8332
[DEV] Support sync params in tensor parallel config by @From00 in #8311
cherry pick paddlenlp 2.8 by @w5688414 in #8323
textfeature_queryinput by @cxa-unique in #8331
[BugFix] Fix gpu ci by @ZHUI in #8337
[Trainer] Fix sharding overlap bug by @DesmonDay in #8333
[Tokenizer]Add Chat template by @Southpika in #8226
[AutoParallel]Refine lr warm_up configuration strategy for llama by @zhangbo9674 in #8329
Add num_hidden_layer config for llama run_pretrain by @zhangbo9674 in #8288
[XPU] llama add xpu support by @dynamicheart in #8282
add eliminate_transpose arg by @zhiqiu in #8339
change llama/modeling.py to opt npu performence by @Galaxy1458 in #8342
Update llm docs requirements by @w5688414 in #8336
Disable eval and predict for llama-2 benchmark. by @Xreki in #8366
update by @Galaxy1458 in #8359
[LLM] fix lora target modules on llama by @SylarTiaNII in #8372
[paddle-pipelines] Update offline ann by @w5688414 in #8353
refine benchmard bert ips stat by @wanghuancoder in #8361
[BugFix] Update truncate in distributed training by @KB-Ding in #8362
[dist benchmark]Fix llama2 benchmark by @Liujie0926 in #8376
Revert "update" by @ZHUI in #8389
Fix test init by @ZHUI in #8377
[Performance] Optimize unified checkpoint save/load speed. by @ZHUI in #8204
[npu model bug]fix_global_bug by @Galaxy1458 in #8399
[Bugfix] Fix fast tokenizer import error by @w5688414 in #8367
[bugfix] fix uie by @w5688414 in #8379
fit for llama3 for auto_parallel by @zhiqiu in #8395
[DistDataloader] Update implementation, add nested.py by @DesmonDay in #8380
[LLM] Fix fuse or split with same key by @DrownFish19 in #8378
[UC] Fix compatible with npu by @ZHUI in #8409
pre copy pinned data to gpu by @wanghuancoder in #8386
Refine position_ids for auto parallel training of llama by @zhangbo9674 in #8363
[Distributed] enable tensor_parallel_output for finetuning by @SylarTiaNII in #8370
fix type promotion problem. by @zxcd in #8414
Fix ckpt done by @gongel in #8402
[LLM] rename logits_tensor_parallel_output to avoid conflict by @SylarTiaNII in #8419
[Trainer] fix distdataloader by @DesmonDay in #8420
fix safe open. by @ZHUI in #8422
adapter new type promotion rule for Paddle 2.6 by @zxcd in #8421
[BugFix] Fix llama3 eot_id by @ZHUI in #8371
add npu-llama-opt0-script by @Galaxy1458 in #8401
[LLM] add assertion for enable_stage1_overlap in lora mode by @SylarTiaNII in #8425
[NPU]Custom fusion operator unification by @Galaxy1458 in #8431
delete csrc/generation/reset_need_stop_value.cc by @yuanlehome in #8413
Update llama_npu_opt_lora.sh by @Galaxy1458 in #8439
[CI]add scripts for unittest by @Liujie0926 in #8433
fix npu sft ckpt load bug and no FA bug by @NINGBENZHE in #8438
Fix CI bugs by @ZHUI in #8430
Fix/test gpu by @ZHUI in #8452
Support fused_attention_qkv for auto_parallel llama by @zhangbo9674 in #8432
[BugFix] Fix load rng compatibility. by @ZHUI in #8450
update by @Galaxy1458 in #8448
[GCU] Support llama for GCU by @EnflameGCU in #8445
[bugfix] fix erniedoc by @w5688414 in #8393
[benchmark]Add llama2 auto by @Liujie0926 in #8424
Add llama2-70b for test_tipc by @zhangbo9674 in #8455
Fix ci tests. by @ZHUI in #8471
[NPU] support npu llama2-13B export & inference by @ronny1996 in #8442
[LLM] fix bug when loss is None in llama modeling.py by @cqulilujia in #8459
fix rotary_emb for llama by @EnflameGCU in #8470
[Ops] RoPE kernel support theta input by @yinfan98 in #8440
Support Sharding Overlap by @iosmers in #8473
Revert "Support Sharding Overlap (#8473)" by @SylarTiaNII in #8491
fix run_benchmark for llama2_70b in auto_parallel by @fightfat in #8484
【AutoParallel】Add split_backward for vpp by @heavyrain-lzy in #8479
Quick fix from_pretrained. by @ZHUI in #8486
Fix rng_state in llm models by @zhangyuqin1998 in #8396
[AutoParallel] Support qwen for auto_parallel by @GhostScreaming in #8312
modify block_multihead_attention api by @ming1753 in #8456
[LLM] disable part of MC2 in lora by @SylarTiaNII in #8505
Update model_utils.py by @ZHUI in #8509
Update merge_lora_params.py by @Galaxy1458 in #8514
[fea] moe support by @bo-ke in #8498
Add Sharding V1 broadcast and V2 allgather overlap optimize by @iosmers in #8499
[fix] Broadcast optimizer state using broadcast_dp without shard-resh… by @bo-ke in #8522
Update README.md by @wawltor in #8524
[Safetensors] Fix fast safe open slice. by @ZHUI in #8512
Update Benchmark scripts by @iosmers in #8519
fix eval. by @ZHUI in #8529
[BugFix][NPU] fix llama attn_mask astype error by @tianhaodongbd in #8528
fused_ln:Added implementation for the HIP platform by @asr-sheep1 in #8472
[CI] Update pip source. by @ZHUI in #8540
[PIP] Update run_ci.sh by @ZHUI in #8552
add mteb evaluation by @cxa-unique in #8538
[Cherry-pick] Add release grad & sharding format & decorate_exclude_layers by @ForFishes in #8545
Add RingFlashAttention for context parallel by @zhangyuqin1998 in #8383
fix codecov conflicts by @greycooker in #8555
support fused weights for export_model by @ronny1996 in #8554
【benchmark】 add llama-7b_auto_dp2mp2pp2 benchmark script for cinn by @mmglove in #8423
Fix memory leak bug by @sneaxiy in #8546
Update sequence_parallel for predict by @DesmonDay in #8551
[GPT][CE] Update modeling.py by @ZHUI in #8548
add fuse_attention_ffn support for qwen by @deepllz in #8526
Update generation_utils.py by @carryyu in #8502
fix llama export by @ronny1996 in #8561
Update llama_npu_opt_lora.sh by @Galaxy1458 in #8562
[FIX DDP] fix ddp by @ZHUI in #8549
[AutoParallel] Add benchmark for llama-7b-dy2st. by @GhostScreaming in #8559
[Cherry pick] Sharding reshard function enhancement by @sneaxiy in #8544
[BugFix] Fix test_long_sequence_strategies by @ZHUI in #8568
Fix/ci pip by @ZHUI in #8541
Add async save for optimizer by @ForFishes in #8557
add llama & qwen dpo by @lugimzzz in #8474
[LLM] support Qwen2 by @DrownFish19 in #8338
[LLM] Fix Qwen2 by @DrownFish19 in #8584
fix autotunner benchmark error and fix llama2 dy2st benchmark by @fightfat in #8587
fix autoruner resume case by @Difers in #8259
Enable test with re-try. by @ZHUI in #8590
[xpu] add xpu custom ops support for llama2-7b by @NeroLoh in #8515
xpu devices support llama-7b basic mode inference (turn on BlockAtten… by @zhink in #8588
Add Pipeline Parallel for PPO training and support generation with InferenceModel by @guoshengCS in #7953
[xpu] change xpu setup.py to paddlenlp_ops by @NeroLoh in #8595
Clean RLHF main script by @guoshengCS in #8596
Fix dataset with empty char. by @ZHUI in #8469
XPU open ir pass by @zhink in #8598
[bug fix] fix sharding stage1 allgather overlap bug, which needs to forbiden pin memory by @iosmers in #8594
Add main process print function by @ForFishes in #8604
[Feature] Optimize config saving. by @ZHUI in #8490
to_json_string兼容性升级 by @sneaxiy in #8608
[PaddleNLP 3.0] [Release] Refactor examples by @DrownFish19 in #8609
finetune support continue_training by @tianhaodongbd in #8615
[PaddleNLP 3.0] Refactor/3 part1- remove fast tokenizer. by @ZHUI in #8613
Repo adjustment by @wtmlon in #8605
[PaddleNLP 3.0] Refactor, merge examples/language_model model_zoo to legacy/model_zoo by @ZHUI in #8614
[PaddleNLP 3.0] Refactor RLHF by @gongel in #8617
Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. by @Xreki in #8623
[PaddleNLP 3.0] Fix dead link by @ZHUI in #8626
Update PaddleNLP to fix PPO by @sneaxiy in #8618
[LLM] support sparse attention for LLAMA by @GuoxiaWang in #8592
remove fast generation by @wtmlon in #8625
fix npu llama by @zhink in #8628
[PaddleNLP 3.0] Refactor/3 part3, move pipelines. by @ZHUI in #8619
[PaddleNLP 3.0] update dataset preprocess by @DrownFish19 in #8629
[LLM] Support prefix tuning and lora for qwen2 by @DrownFish19 in #8601
modify path of model_zoo in ci_case_auto.sh and ci_case_dy.sh by @jeff41404 in #8633
【benchmark】 fix model_zoo path by @mmglove in #8643
[PaddleNLP 3.0] [LLM] change llm content by @lugimzzz in #8627
[LLM] Add sequence_parallel support for qwen by @Difers in #8558
[NPU][LLM] add README & reformat llama scripts by @SylarTiaNII in #8642
align llama auto_parallel dataloader with manual_parallel by @zhiqiu in #8639
fix fast_ln compile error by @deepllz in #8650
Apache License by @DrownFish19 in #8658
Fix different length for numpy>=1.24.x by @DrownFish19 in #8655
[LLM][NPU] fix on readme by @SylarTiaNII in #8659
[DOC] Fix dead link by @DrownFish19 in #8662
fix benchmark dir because of PR#8627 by @fightfat in #8649
fix llama alibi pretrain by @lugimzzz in #8668
inference support llama3(wint8|4/a8w8) by @yuanlehome in #8630
【benchmark】 fix benchmark script by @mmglove in #8648
[cpu]llama avx model inference supports by @bukejiyu in #8634
【AutoParallel】Change benchmark config for llama2-7b by @heavyrain-lzy in #8667
support flashmask by @lugimzzz in #8670
[PaddleNLP 3.0] Update README.md by @DrownFish19 in #8666
adjust llm readme by @lugimzzz in #8672
Update export model by @DesmonDay in #8671
Update version by @gongel in #8675
Sft flash mask by @wtmlon in #8664
Update version by @gongel in #8676

New Contributors

@Southpika made their first contribution in #8082
@cxa-unique made their first contribution in #8331
@dynamicheart made their first contribution in #8282
@EnflameGCU made their first contribution in #8445
@cqulilujia made their first contribution in #8459
@yinfan98 made their first contribution in #8440
@zhangyuqin1998 made their first contribution in #8396
@ming1753 made their first contribution in #8456
@asr-sheep1 made their first contribution in #8472
@NeroLoh made their first contribution in #8515
@bukejiyu made their first contribution in #8634

Full Changelog: v2.8.1...v3.0.0-beta0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0-beta0

大模型精调对齐训推优化

模型新增

基础框架升级

问题修复

结构调整

What's Changed

New Contributors

Contributors