DrownFish19
released this
28 Jun 03:05
·
5 commits
to release/3.0-beta
since this release
很高兴地通知大家,飞桨大模型套件发布v3.0.0beat版本:拥抱大模型,体验全升级。具体工作如下:
- 统一大模型工具链,实现国产计算芯片全流程接入;
- 全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程;
- 自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推;
- 主流模型持续支持更新,提供高效解决方案。
大模型精调对齐训推优化
-
PEFT:
-
DPO:
-
国产芯片支持:
-
性能优化:
-
其他
- 新增模型内存监控 in #8269
模型新增
-
新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
-
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
-
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct
基础框架升级
-
功能优化:
-
AutoParallel优化
-
分布式能力优化:
-
chat能力优化:
- 增加Chat template in #8226
-
其他
问题修复
- 修复sharding数量小于100的bug in #8146
- 修复TP/PP参数合并问题 in #8239
- 修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
- 修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
- 增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
- 修复text feature extraction任务中tokenizer输入 in #8331
- 修复import error in #8332 #8367
结构调整
PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666
What's Changed
- [dist]pip requirements-dev.txt by @Liujie0926 in #8258
- add scaling by @lugimzzz in #8256
- [LLM]Support Gemma model by @Southpika in #8082
- [BugFix] Try except sequence parallel utils by @DesmonDay in #8189
- Update CodeCov GitHub Action by @sijunhe in #8268
- [AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
- Fix sharding < 100 limitation bug by @sneaxiy in #8146
- use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
- [dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
- [Bug Fix]Fix merge parameters in pp by @Southpika in #8239
- [LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
- Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
- add a100 test ground truth by @zhiqiu in #8249
- [paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
- [paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
- Support llama-3 by @ZHUI in #8307
- [Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
- fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
- [paddle-pipelines] Update mkdocs by @w5688414 in #8310
- [benchmark]update llama2_ips by @Liujie0926 in #8322
- [dist CI]fix before_hook by @Liujie0926 in #8283
- benchmark llama worker=1 by @wanghuancoder in #8305
- 【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
- Add system env log for llama test by @zhangbo9674 in #8321
- [LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
- [Distributed] fix lora by @SylarTiaNII in #8325
- fix try import by @w5688414 in #8332
- [DEV] Support sync params in tensor parallel config by @From00 in #8311
- cherry pick paddlenlp 2.8 by @w5688414 in #8323
- textfeature_queryinput by @cxa-unique in #8331
- [BugFix] Fix gpu ci by @ZHUI in #8337
- [Trainer] Fix sharding overlap bug by @DesmonDay in #8333
- [Tokenizer]Add Chat template by @Southpika in #8226
- [AutoParallel]Refine lr warm_up configuration strategy for llama by @zhangbo9674 in #8329
- Add num_hidden_layer config for llama run_pretrain by @zhangbo9674 in #8288
- [XPU] llama add xpu support by @dynamicheart in #8282
- add eliminate_transpose arg by @zhiqiu in #8339
- change llama/modeling.py to opt npu performence by @Galaxy1458 in #8342
- Update llm docs requirements by @w5688414 in #8336
- Disable eval and predict for llama-2 benchmark. by @Xreki in #8366
- update by @Galaxy1458 in #8359
- [LLM] fix lora target modules on llama by @SylarTiaNII in #8372
- [paddle-pipelines] Update offline ann by @w5688414 in #8353
- refine benchmard bert ips stat by @wanghuancoder in #8361
- [BugFix] Update truncate in distributed training by @KB-Ding in #8362
- [dist benchmark]Fix llama2 benchmark by @Liujie0926 in #8376
- Revert "update" by @ZHUI in #8389
- Fix test init by @ZHUI in #8377
- [Performance] Optimize unified checkpoint save/load speed. by @ZHUI in #8204
- [npu model bug]fix_global_bug by @Galaxy1458 in #8399
- [Bugfix] Fix fast tokenizer import error by @w5688414 in #8367
- [bugfix] fix uie by @w5688414 in #8379
- fit for llama3 for auto_parallel by @zhiqiu in #8395
- [DistDataloader] Update implementation, add nested.py by @DesmonDay in #8380
- [LLM] Fix fuse or split with same key by @DrownFish19 in #8378
- [UC] Fix compatible with npu by @ZHUI in #8409
- pre copy pinned data to gpu by @wanghuancoder in #8386
- Refine position_ids for auto parallel training of llama by @zhangbo9674 in #8363
- [Distributed] enable tensor_parallel_output for finetuning by @SylarTiaNII in #8370
- fix type promotion problem. by @zxcd in #8414
- Fix ckpt done by @gongel in #8402
- [LLM] rename logits_tensor_parallel_output to avoid conflict by @SylarTiaNII in #8419
- [Trainer] fix distdataloader by @DesmonDay in #8420
- fix safe open. by @ZHUI in #8422
- adapter new type promotion rule for Paddle 2.6 by @zxcd in #8421
- [BugFix] Fix llama3
eot_id
by @ZHUI in #8371 - add npu-llama-opt0-script by @Galaxy1458 in #8401
- [LLM] add assertion for enable_stage1_overlap in lora mode by @SylarTiaNII in #8425
- [NPU]Custom fusion operator unification by @Galaxy1458 in #8431
- delete csrc/generation/reset_need_stop_value.cc by @yuanlehome in #8413
- Update llama_npu_opt_lora.sh by @Galaxy1458 in #8439
- [CI]add scripts for unittest by @Liujie0926 in #8433
- fix npu sft ckpt load bug and no FA bug by @NINGBENZHE in #8438
- Fix CI bugs by @ZHUI in #8430
- Fix/test gpu by @ZHUI in #8452
- Support fused_attention_qkv for auto_parallel llama by @zhangbo9674 in #8432
- [BugFix] Fix load rng compatibility. by @ZHUI in #8450
- update by @Galaxy1458 in #8448
- [GCU] Support llama for GCU by @EnflameGCU in #8445
- [bugfix] fix erniedoc by @w5688414 in #8393
- [benchmark]Add llama2 auto by @Liujie0926 in #8424
- Add llama2-70b for test_tipc by @zhangbo9674 in #8455
- Fix ci tests. by @ZHUI in #8471
- [NPU] support npu llama2-13B export & inference by @ronny1996 in #8442
- [LLM] fix bug when loss is None in llama modeling.py by @cqulilujia in #8459
- fix rotary_emb for llama by @EnflameGCU in #8470
- [Ops] RoPE kernel support theta input by @yinfan98 in #8440
- Support Sharding Overlap by @iosmers in #8473
- Revert "Support Sharding Overlap (#8473)" by @SylarTiaNII in #8491
- fix run_benchmark for llama2_70b in auto_parallel by @fightfat in #8484
- 【AutoParallel】Add split_backward for vpp by @heavyrain-lzy in #8479
- Quick fix from_pretrained. by @ZHUI in #8486
- Fix rng_state in llm models by @zhangyuqin1998 in #8396
- [AutoParallel] Support qwen for auto_parallel by @GhostScreaming in #8312
- modify block_multihead_attention api by @ming1753 in #8456
- [LLM] disable part of MC2 in lora by @SylarTiaNII in #8505
- Update model_utils.py by @ZHUI in #8509
- Update merge_lora_params.py by @Galaxy1458 in #8514
- [fea] moe support by @bo-ke in #8498
- Add Sharding V1 broadcast and V2 allgather overlap optimize by @iosmers in #8499
- [fix] Broadcast optimizer state using broadcast_dp without shard-resh… by @bo-ke in #8522
- Update README.md by @wawltor in #8524
- [Safetensors] Fix fast safe open slice. by @ZHUI in #8512
- Update Benchmark scripts by @iosmers in #8519
- fix eval. by @ZHUI in #8529
- [BugFix][NPU] fix llama attn_mask astype error by @tianhaodongbd in #8528
- fused_ln:Added implementation for the HIP platform by @asr-sheep1 in #8472
- [CI] Update pip source. by @ZHUI in #8540
- [PIP] Update run_ci.sh by @ZHUI in #8552
- add mteb evaluation by @cxa-unique in #8538
- [Cherry-pick] Add release grad & sharding format & decorate_exclude_layers by @ForFishes in #8545
- Add RingFlashAttention for context parallel by @zhangyuqin1998 in #8383
- fix codecov conflicts by @greycooker in #8555
- support fused weights for export_model by @ronny1996 in #8554
- 【benchmark】 add llama-7b_auto_dp2mp2pp2 benchmark script for cinn by @mmglove in #8423
- Fix memory leak bug by @sneaxiy in #8546
- Update sequence_parallel for predict by @DesmonDay in #8551
- [GPT][CE] Update modeling.py by @ZHUI in #8548
- add fuse_attention_ffn support for qwen by @deepllz in #8526
- Update generation_utils.py by @carryyu in #8502
- fix llama export by @ronny1996 in #8561
- Update llama_npu_opt_lora.sh by @Galaxy1458 in #8562
- [FIX DDP] fix ddp by @ZHUI in #8549
- [AutoParallel] Add benchmark for llama-7b-dy2st. by @GhostScreaming in #8559
- [Cherry pick] Sharding reshard function enhancement by @sneaxiy in #8544
- [BugFix] Fix test_long_sequence_strategies by @ZHUI in #8568
- Fix/ci pip by @ZHUI in #8541
- Add async save for optimizer by @ForFishes in #8557
- add llama & qwen dpo by @lugimzzz in #8474
- [LLM] support Qwen2 by @DrownFish19 in #8338
- [LLM] Fix Qwen2 by @DrownFish19 in #8584
- fix autotunner benchmark error and fix llama2 dy2st benchmark by @fightfat in #8587
- fix autoruner resume case by @Difers in #8259
- Enable test with re-try. by @ZHUI in #8590
- [xpu] add xpu custom ops support for llama2-7b by @NeroLoh in #8515
- xpu devices support llama-7b basic mode inference (turn on BlockAtten… by @zhink in #8588
- Add Pipeline Parallel for PPO training and support generation with InferenceModel by @guoshengCS in #7953
- [xpu] change xpu setup.py to paddlenlp_ops by @NeroLoh in #8595
- Clean RLHF main script by @guoshengCS in #8596
- Fix dataset with empty char. by @ZHUI in #8469
- XPU open ir pass by @zhink in #8598
- [bug fix] fix sharding stage1 allgather overlap bug, which needs to forbiden pin memory by @iosmers in #8594
- Add main process print function by @ForFishes in #8604
- [Feature] Optimize config saving. by @ZHUI in #8490
- to_json_string兼容性升级 by @sneaxiy in #8608
- [PaddleNLP 3.0] [Release] Refactor examples by @DrownFish19 in #8609
- finetune support continue_training by @tianhaodongbd in #8615
- [PaddleNLP 3.0] Refactor/3 part1- remove fast tokenizer. by @ZHUI in #8613
- Repo adjustment by @wtmlon in #8605
- [PaddleNLP 3.0] Refactor, merge examples/language_model model_zoo to legacy/model_zoo by @ZHUI in #8614
- [PaddleNLP 3.0] Refactor RLHF by @gongel in #8617
- Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. by @Xreki in #8623
- [PaddleNLP 3.0] Fix dead link by @ZHUI in #8626
- Update PaddleNLP to fix PPO by @sneaxiy in #8618
- [LLM] support sparse attention for LLAMA by @GuoxiaWang in #8592
- remove fast generation by @wtmlon in #8625
- fix npu llama by @zhink in #8628
- [PaddleNLP 3.0] Refactor/3 part3, move pipelines. by @ZHUI in #8619
- [PaddleNLP 3.0] update dataset preprocess by @DrownFish19 in #8629
- [LLM] Support prefix tuning and lora for qwen2 by @DrownFish19 in #8601
- modify path of model_zoo in ci_case_auto.sh and ci_case_dy.sh by @jeff41404 in #8633
- 【benchmark】 fix model_zoo path by @mmglove in #8643
- [PaddleNLP 3.0] [LLM] change llm content by @lugimzzz in #8627
- [LLM] Add sequence_parallel support for qwen by @Difers in #8558
- [NPU][LLM] add README & reformat llama scripts by @SylarTiaNII in #8642
- align llama auto_parallel dataloader with manual_parallel by @zhiqiu in #8639
- fix fast_ln compile error by @deepllz in #8650
- Apache License by @DrownFish19 in #8658
- Fix different length for numpy>=1.24.x by @DrownFish19 in #8655
- [LLM][NPU] fix on readme by @SylarTiaNII in #8659
- [DOC] Fix dead link by @DrownFish19 in #8662
- fix benchmark dir because of PR#8627 by @fightfat in #8649
- fix llama alibi pretrain by @lugimzzz in #8668
- inference support llama3(wint8|4/a8w8) by @yuanlehome in #8630
- 【benchmark】 fix benchmark script by @mmglove in #8648
- [cpu]llama avx model inference supports by @bukejiyu in #8634
- 【AutoParallel】Change benchmark config for llama2-7b by @heavyrain-lzy in #8667
- support flashmask by @lugimzzz in #8670
- [PaddleNLP 3.0] Update README.md by @DrownFish19 in #8666
- adjust llm readme by @lugimzzz in #8672
- Update export model by @DesmonDay in #8671
- Update version by @gongel in #8675
- Sft flash mask by @wtmlon in #8664
- Update version by @gongel in #8676
New Contributors
- @Southpika made their first contribution in #8082
- @cxa-unique made their first contribution in #8331
- @dynamicheart made their first contribution in #8282
- @EnflameGCU made their first contribution in #8445
- @cqulilujia made their first contribution in #8459
- @yinfan98 made their first contribution in #8440
- @zhangyuqin1998 made their first contribution in #8396
- @ming1753 made their first contribution in #8456
- @asr-sheep1 made their first contribution in #8472
- @NeroLoh made their first contribution in #8515
- @bukejiyu made their first contribution in #8634
Full Changelog: v2.8.1...v3.0.0-beta0