Releases · PaddlePaddle/PaddleNLP

16 Dec 09:35

ZHUI

v3.0.0-beta3

418c3a5

v3.0.0-beta3 Pre-release

Pre-release

本次更新增强了PaddleNLP的基础体验，新增了Llama-3.2、DeepSeekV2模型，升级了TokenizerFast功能，重构了SFTTrainer。

此外，PaddleNLP还支持了优化器状态的卸载和重载功能，实现了精细化的重新计算，训练性能提升7%。在Unified Checkpoint方面，进一步优化了异步保存逻辑，新增Checkpoint压缩功能，可节省78.5%存储空间。
最后，在大模型推理、自动并行、多硬件支持、文档使用上，我们都进行了深度优化。

主要更新与增强

新增模型：
- 新增了Llama-3.2模型（#9199）、DeepSeekV2模型（#9250），进一步丰富了大型模型的选择。
基础架构改进：
- 重构了SFTTrainer和SFTConfig，提高了代码的可维护性。（#9318)
- 支持优化器状态的卸载和重载功能（#9467），有效降低了内存使用。
- 通过Hook实现了精细化的重新计算支持，例如，在llama模型上，训练性能可提升7%。（#9396）
- Unified Checkpoint优化：
  - 更新了异步保存逻辑（#9173, #9274, #9321），显著提升了检查点的保存与加载效率。
  - 增加了对专家并行的支持（#9055），使模型训练更加灵活。
  - 支持在开启sharding_comm_overlap时使用Unified Checkpoint。（#9392）
  - 新增了Checkpoint压缩功能，最多可节省78.5%的存储空间。（#9183）
  - 通过多线程技术减少了检查点的加载时间（#9034）。
- Tokenizer功能增强：
  - 允许在Tokenizer调用时指定padding_side参数（#9258），提升了用户体验。
  - Qwen tokenizer现支持添加特殊标记（#9344），增强了其灵活性。
  - 修复了TokenizerFast中缺失的clean_up_tokenization_spaces问题（#9304），提高了文本处理的准确性。
  - 统一了分词器的_pad函数到基类。#9280
  - 新增了对BertTokenizerFast的支持，并允许在调用时注册tokenizer。（#9353）
  - 改进了Qwen、Gemma、Yuan模型chat template的特殊输入处理。（#9462）
推理性能提升：
- 支持LLM推理直接量化内置bos模型（#9197）。
- 加强了对LLM推理中FP8 量化的支持（如#9328, #9423），满足了多样化的精度需求。
- 增强了投机解码（speculative decoding）和Append Attention 的支持。(#9180) (#9244)
硬件兼容性扩展：
- 加强了对Intel HPU的支持（#9273），现在支持动态图预测。
- 为XPU等国产硬件提供了统一检查点功能（#9312）。
- 修复了XPU和DCU支持中的错误，并提升了性能。#9414 和#9433
自动并行优化：
- 修复了自动并行过程中的多个问题（如#9217, #9355），确保了并行训练的稳定性。
- 更新了自动并行配置与检查点转换器（如#9136, #9432），提升了训练的灵活性和稳定性。
文档和测试更新：
- 更新了多个文档，包括LLM模型文档（如#9314）和量化文档（如#9330），确保了信息的时效性和准确性。
- 新增了多个测试用例，如分布式数据加载测试（#9438），提高了测试的覆盖率。
- 修复了文档中的链接错误和排版问题（如#9127, #9515），提升了用户体验。

本次更新标志着PaddleNLP的持续进步，为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中，继续为用户带来更多的创新和价值。

What's Changed

[Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
add flashmask rm by @lugimzzz in #9154
[LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
[Models] Add Llama-3.2 by @DrownFish19 in #9199
move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
[Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
[AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
fix auto ci return bug when run in v100 by @fightfat in #9216
fix auto ci return bug when run in v100 by @AndSonder in #9228
[LLM] Add tools for parameters by @Hanyonggong in #9137
[AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
[CI] Fix ci import. by @ZHUI in #9239
[Version] Update version info by @DrownFish19 in #9241
[Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
[LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
[INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
[PIR] fix pir open bugs by @yuanlehome in #9248
Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
[Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
[PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
[Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
[AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
change dataset by @lugimzzz in #9266
[Unified Checkpoint] update async save logic by @DesmonDay in #9274
add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
Fix async hang by @DesmonDay in #9276
[AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
[Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
[Trainer] fix save_model by @DesmonDay in #9286
[CI] Skip inference test cases by @DrownFish19 in #9270
[LLM] Add deepseekv2 by @DrownFish19 in #9250
[Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
[CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
support attention mask using causal=True by @GuoxiaWang in #9268
[FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
bug fix for xpu_parallel_matmul by @FeixLiu in #9297
fix lora sharding v2 by @lugimzzz in #9300
[LLM INFER] Append attn by @yuanlehome in #9244
[Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
[Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
clean llama static modeling file by @zhiqiu in #9301
[Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
fix non-pipelinelayer to distributed by @gongel in #9310
change the legacy to slm by @wawltor in #9311
[TRL] Rename sft trainer. by @ZHUI in #9292
[XPU] support unified ckpt function by @cqulilujia in #9312
[LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
[Readme] Add flash mask by @lugimzzz in #9219
update llm infer docs by @yuanlehome in #9314
[Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
[METAX] Support llama for MX C550 by @idontkonwher in #9186
update QR code by @DrownFish19 in #9325
add flash_attention on model chatglm_v2 by @Mangodadada in #9296
fix readme by @Mangodadada in #9326
[Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
[paddle cpu inference]fix cpu doc by @bukejiyu in #9299
[LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
Fix pr 9334 by @yuanlehome in #9335
fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
[Docs] Update flashmask by @DrownFish19 in #9330
Update load_save_single_card.py by @DesmonDay in #9337
Update README.md by @DrownFish19 in #9339
[Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
[intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
Compatible with Tensor.to change to out_of_place. by @DrownFish19 in https://github.co...

Contributors

co63oc, zhiqiu, and 54 other contributors

Assets 2

08 Oct 08:52

ZHUI

v3.0.0-beta2

81de41a

v3.0.0-beta2 Pre-release

Pre-release

本次更新强化了PaddleNLP的基础设施，新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能，同时重命名了数据索引工具。

此外，还修复了MoE模型参数保存与加载等问题，提升了文本处理准确性，并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化，包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。

核心变更与增强功能

基础设施强化：
- 新增Qwen2.5模型（#9157 ），Mixtral 8*22B。进一步丰富模型库。
- Tokenizer功能升级，现支持加载额外解码标记added_tokens_decoder（#8997 ），提升灵活性。
- 数据索引工具tool_helpers重命名为fast_dataindex（#9134 ），以更直观反映其功能特性。
- 实现训练过程中数据间隔跳过的功能（#8989 ），优化数据处理效率。
- Unified Checkpoint优化：
  - 更新优化器异步保存信号（#8975 ），保证保存稳定。
  - 修复统一检查点中的多项问题（#9082 ），确保功能正确性。
问题修复：
- 解决了MoE模型参数保存与加载的问题（#9045 ）。
- 修正Tokenizer中空格与特殊符号处理的不足（#9010 , #9144 ），提升文本处理准确性。
文档与测试更新：
- 更新多个文档，涵盖LLM模型文档（如#8990 , #8999 ）及量化文档（#9057 ）等，确保信息的时效性与准确性。
- 新增测试用例，如针对PIR模式序列并行的测试（#9015 ），强化测试覆盖度。
- 修复文档中的链接错误（如#9127 ），提升用户体验。
其他关键变更：
- 推理性能优化：
  - LLM推理代码得到优化，支持更多模型与参数配置（如#8986 , #8995 ），拓宽应用场景。
  - 实现Qwen2_Moe多GPU推理（#9121 ）及wint4量化（#9129 ），提升推理效率。
  - 加强LLM推理对FP8与INT8的支持（如#9032 , #9151 ），满足多样化精度需求。
- 硬件支持拓展：
  - 增强对DCU、XPU、MLU等国产硬件的支持（如#8983 , #8504 , #9075 ），促进国产化替代。
  - 优化上述硬件上的模型训练与推理性能，提升整体运算效率。
- 自动并行优化：
  - 修复训练过程中数据重复跳过的问题（#8980 ），确保数据处理的正确性。
  - 更新自动并行配置与检查点转换器（如#8847 , #9136 ），提升并行训练的灵活性与稳定性。
  - 新增损失NaN/Inf检查器（#8943 ），及时发现并处理潜在数值问题。
  - 优化分布式训练中的数据加载与梯度合并流程（如#9120 , #9179 ），提升训练速度与稳定性。

What's Changed

[Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
更正run_dpo.py文件路径 by @Mangodadada in #8952
fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
[Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
fix pip error in legacy benchmarks by @fightfat in #8978
【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
[llm]update finetune.md by @lugimzzz in #8990
tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
add DCU inference docs by @YanhuiDua in #8983
[Distributed]Add loss nan/inf checker by @ForFishes in #8943
【llm】update docs by @lugimzzz in #8999
[Feature] Fused Mixtral support by @penPenf28 in #8901
[XPU] Add README.md for llama2-7b by @xiguapipi in #8979
Add gcu llama readme by @EnflameGCU in #8950
fix qwen model use_casual_mask by @deepllz in #9009
[ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
[LLM Inference] Fix step.cu bug by @yuanlehome in #8995
Refine checkpoint converter by @zhangbo9674 in #9001
[Feature] fused mixtral wint4 by @penPenf28 in #9013
llm inference docs by @Sunny-bot1 in #8976
[LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
fix llama3 static run by @yuanlehome in #8849
[paddle inference cpu]update cpu inference by @bukejiyu in #8984
fix the tipc ce case by @wawltor in #8748
[Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
[Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
[Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
Fix checker of nan/inf by @ForFishes in #9029
[Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
[Unified Checkpoint] Update async save info by @DesmonDay in #8982
[llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
[Bugfix] fix bias optional by @penPenf28 in #9037
fix setup.py for llm inference by @yuanlehome in #9041
[Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
[Inference] update fakequant support by @lixcli in #9047
add test for pir sequence parallel on llama model by @liym27 in #9015
Fix moe save load by @Meiyim in #9045
Update quantization.md by @ZHUI in #9057
【Fix】Initialize dp degree in single GPU by @greycooker in #9056
fix bos download by @westfish in #9023
[Inference] Update fakequant script by @lixcli in #9054
[AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
[MLU] Support rms_norm_mlu by @PeiyuLau in #8504
[Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
[Inference] Qwen2 support fp8 inference by @ckl117 in #8954
[Version] update version info by @DrownFish19 in #9060
[NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
[MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
[Inference] Fix weight_only_int4 bug by @lixcli in #9073
[Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
fix llm predict install error by @fightfat in #9088
[PIR] add pir grad merge test by @AndSonder in #9074
Update readme by @EnflameGCU in #9046
[LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
[data] update tool_helpers version and add unittest by @JunnYu in #9093
fix baseline because of PR#8769 by @fightfat in #9092
fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
[CI] Fix paddlepaddle install by @DesmonDay in #9102
[LLM] fix train on npu by @SylarTiaNII in #9101
Disable ut by @zhangbo9674 in #9108
[AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
[Inference] Remove ceval from run_finetune by @lixcli in #9100
[Bugfix] fix multi-gpu infer by @penPenf28 in #9107
【Inference】fix step kernel by @gzy19990617 in #9122
[DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
[Inference] FP8 gemm auto-tune by @ckl117 in #9094
Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
[LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
[Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
[Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
Add fast_ln spmd rules by @From00 in #9125
fix pir dtype by @wanghuancoder in #9130
Remove ring_flash_attention warning by @DrownFish19 in #9119
[DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
Add hardware flops for pretraining by @ZHUI in #9069
[Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
[Auto Parallel] Fix ckpt_converter for auto_parallel by...

Contributors

tizhou86, Meiyim, and 44 other contributors

Assets 2

22 Aug 03:41

ZHUI

v3.0.0-beta1

7473743

v3.0.0-beta1 Pre-release

Pre-release

PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本，带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型，并优化了LLM推理代码，提升了兼容性和效率。

基础性能优化方面，添加了快速分词器，实现了MoE优化器参数广播，加速了层归一化。同时，修复了多个bug，包括safetensors shape切片问题和Windows下mmap问题，提升了系统稳定性和兼容性。

文档与测试方面，进行了全面更新和优化，确保了文档的准确性和代码的可读性。此外，还增强了国产硬件支持，包括DCU和XPU的优化，以及PIR模式和自动并行的配置更新。

主要变更与新增功能

1. 新模型与特性引入

新模型：在#8654 中引入了Yuan模型；在#8513 和#8517 中分别添加了mamba和jamba新模型，并在后续Pull Request中修复了相关bug，确保了模型的稳定运行。
LLM推理优化：通过多个Pull Request，我们优化了LLM推理代码，并新增了对新模型和参数的支持，进一步提升了推理效率和兼容性。

2. 基础性能优化

快速分词器：在#8832 中，我们添加了基于tokenizers库的快速分词器，显著提升了分词速度和性能。
MoE优化：在#8810 中，我们实现了MoE（Mixture of Experts）优化器参数的广播，有效增强了模型训练的效率。
层归一化加速：通过多个Pull Request，我们添加了fast_rmsnorm，启用了use_fast_layer_norm，并更新了基准测试配置，进一步加速了模型训练过程。特别是在#8717 中，我们支持了在微调过程中使用use_fast_layer_norm，为用户提供了更多灵活性。
训练性能优化：在#8803 中，我们添加了enable_sp_async_reduce_scatter选项，有效优化了训练性能。
字典参数支持：在#8446 中，我们为trainer的argparser添加了支持字典参数的新特性，增强了参数传递的灵活性。同时，在#8904 中，我们更新了tensorboard的要求，确保了与最新版本的兼容性。

3. Bug修复

safetensors修复：在#8702 中，我们修复了safetensors的形状问题。
Windows系统mmap修复：在#8734 中修复了mmap问题，提升了windows的兼容性。
其他Bug修复：包括#8687 、#8730 等多个Pull Request中的bug修复。

4. 文档与测试更新

文档优化：在多个Pull Request中，我们进行了文档更新、代码风格清理和版本信息更新，确保了文档的准确性和可读性。
README修复与增强：在#8741 中，我们修复了README中的断链问题；同时，多个贡献者更新了README文档，添加了新的测试用例，确保了文档与代码的同步更新。

5. 其他重要变更

国产硬件支持增强

DCU支持：在#8580 中，我们实现了针对DCU的高性能LLM训练和推理，拓展了PaddleNLP的硬件支持范围。
XPU优化：在#8527 中，我们为XPU添加了LoRA优化；在#8697 和#8710 中，我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题，进一步提升了XPU上的模型训练效率。

PIR模式支持

导出与加载优化：在#8689 中，我们修改了PIR模式下llama模型的导出方式；在#8712 和#8766 中，我们支持了以三种模式（旧IR、PIR模型文件、PIR JSON文件）加载或保存Llama2-7b模型，为用户提供了更多灵活性和兼容性。

自动并行优化

配置更新：在#8679 中，我们更改了Llama2-7b配置中的max_steps以适应自动并行；在#8767 和#8828 中，我们优化了自动训练器的保存和加载功能；在#8750 中，我们更新了全局剪切的损失函数，进一步提升了自动并行的效率和准确性。

What's Changed

[DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
bug fix by @wtmlon in #8687
[XPU] add lora optimization by @dynamicheart in #8527
[pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
[AutoParallel]Change max_steps in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679
[benchmark] Change the mirror source for pip by @mmglove in #8699
update loss base of auto-parallel tests by @zhiqiu in #8701
Add new mistral by @wtmlon in #7425
[Safetensors] Fix safetensors shape by @DesmonDay in #8702
[BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
xpu use allgather by @FeixLiu in #8697
add fast_rmsnorm by @deepllz in #8680
enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
fix xpu gather for unified ckpt by @FeixLiu in #8710
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
fix fast_ln backward by @deepllz in #8719
finetune support use_fast_layer_norm by @tianhaodongbd in #8717
bug fix by @FeixLiu in #8730
disable lora by @lugimzzz in #8674
[Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
correct broken links in readme by @jzhang533 in #8741
revert benchmark fix by @ronny1996 in #8747
[LLM] Add Yuan model by @zhaogf01 in #8654
fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
[LLM] Update sequence parallel linear import by @DrownFish19 in #8706
[Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
update a100 loss by @zhiqiu in #8708
[PaddleNLP 3.0] Update README by @DrownFish19 in #8681
[AutoParallel] update loss for global clip by @JZ-LIANG in #8750
[NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
[DEV] Update develop version show by @DrownFish19 in #8754
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
add benchmark baichuan2 scripts by @fightfat in #8683
Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
fix the ce for the unittest by @wawltor in #8772
Enable parallel_config to use commas as delimiters. by @Difers in #8677
fix incorrect token counting in llm/predictor.py by @lszxb in #8769
Refine savable by @ZHUI in #8758
[CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
[XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
fix version show by @DrownFish19 in #8791
[BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
vera-pissa method added by @TranscenderNing in #8722
update version by @DrownFish19 in #8792
[Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
[DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
[Prediction] Update LLM prediction. by @DesmonDay in #8778
[Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
[AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
[MoE] Optimizer parameter broadcast by @DesmonDay in #8810
[Doc] Update README by @DrownFish19 in #8817
support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
add paddle nv-embed-v1 by @Li-Z-Q in #8785
fix pad_token_id bug by @yuanlehome in #8814
[DCU] fix llama inference bug on DCU by @Deleter-D in #8815
[Doc] Add LLaMA3.1 by @DrownFish19 in #8824
[BUG] Fix build train valid test datasets by @JunnYu in #8826
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
[AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
[Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
[Trainer] update clear_grad by @DesmonDay in #8829
[Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
[Inference LLM] support static c8 by @yuanlehome in #8833
support sft mapdataset by @greycooker in #8840
Cherry pick some changes from incubate branch by @sneaxiy in #8862
support nested list of dict inputs by @deepllz in #8876
Fix the bug with issues code 8641. by @smallbenxiong in #8880
Fix the issue of P-tuning official sample error by @guangyunms in #8884
modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
[llm]fix zeropadding by @lugimzzz in #8895
修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
Update run_pretrain.py by @...

Contributors

jzhang533, zhiqiu, and 41 other contributors

Assets 2

28 Jun 03:05

DrownFish19

v3.0.0-beta0

a2b8a78

v3.0.0-beta0 Latest

Latest

很高兴地通知大家，飞桨大模型套件发布v3.0.0beat版本：拥抱大模型，体验全升级。具体工作如下：

统一大模型工具链，实现国产计算芯片全流程接入；
全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；
自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推；
主流模型持续支持更新，提供高效解决方案。

大模型精调对齐训推优化

PEFT：
- 新增scaling策略，支持rslora, pissa算法 in #8256
- 适配FusedQKV和FastFFN参数 in #8372 #8526
DPO：
- 支持DPO（llama，qwen）in #8474
- 支持序列并行 in #7953
国产芯片支持：
- 适配NPU in #8303 #8342 #8359 #8399 #8409 #8401 #8431 #8439 #8438 #8442 #8528 #8642
- 适配XPU in #8282 #8505 #8515 #8588 #8595 #8598
- 适配GCU in #8445 #8470
性能优化：
- 优化Unified Checkpoint机制 in #8204 #8409 #8422 #8512
- 模型并行优化 in #8370
- 序列并行优化 in #8551
- 支持llama3 (wint8|4/a8w8) in #8630
其他
- 新增模型内存监控 in #8269

模型新增

新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
新增llama3模型 in #8307 #8371
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct

基础框架升级

功能优化：
- 支持FusedQKV和FastFFN权重自动融合分割 in #8202 #8378 #8432
- 支持模型并行参数同步设置 in #8311
- 支持RoPE算子设定theta in #8440
- 通信overlap优化 in #8276 #8473 #8499 #8594
AutoParallel优化
- llama支持recompute机制 in #8265
- 适配llama3 in #8395
- position_ids优化 in #8363
- 支持流水线并行split_backward in #8479
- 适配qwen in #8312
分布式能力优化：
- 修复流水线并行中enable_sharding_comm_overlap中参数错误问题 in #8333
- MoE并行支持 in #8498 #8522
chat能力优化：
- 增加Chat template in #8226
其他
- 文档 in #8336 #8393
- 更新nested操作 in #8380
- 随机性更新 in #8450 #8396
- 算子更新 in #8472
- example更新 in #8538

问题修复

修复sharding数量小于100的bug in #8146
修复TP/PP参数合并问题 in #8239
修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
修复text feature extraction任务中tokenizer输入 in #8331
修复import error in #8332 #8367

结构调整

PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666

What's Changed

[dist]pip requirements-dev.txt by @Liujie0926 in #8258
add scaling by @lugimzzz in #8256
[LLM]Support Gemma model by @Southpika in #8082
[BugFix] Try except sequence parallel utils by @DesmonDay in #8189
Update CodeCov GitHub Action by @sijunhe in #8268
[AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
Fix sharding < 100 limitation bug by @sneaxiy in #8146
use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
[dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
[Bug Fix]Fix merge parameters in pp by @Southpika in #8239
[LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
add a100 test ground truth by @zhiqiu in #8249
[paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
[paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
Support llama-3 by @ZHUI in #8307
[Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
[paddle-pipelines] Update mkdocs by @w5688414 in #8310
[benchmark]update llama2_ips by @Liujie0926 in #8322
[dist CI]fix before_hook by @Liujie0926 in #8283
benchmark llama worker=1 by @wanghuancoder in #8305
【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
Add system env log for llama test by @zhangbo9674 in #8321
[LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
[Distributed] fix lora by @SylarTiaNII in #8325
fix try import by @w5688414 in https://github.com/PaddlePaddle/Pa...

Contributors

zhiqiu, jeff41404, and 49 other contributors

Assets 2

20 Jun 07:42

ZHUI

v2.8.1

db99efd

v2.8.1

What's Changed

[Trainer] Fix sharding overlap bug by @DesmonDay in #8334
[Cherry-pick] update truncate by @KB-Ding in #8375
[BugFix] Fix llama3 eot_id. by @ZHUI in #8373
[Trainer] update distributed dataloader by @DesmonDay in #8426
[BugFix] Fix load rng compatibility. by @ZHUI in #8451
Cherry pick/fast_safe_open by @ZHUI in #8458
【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
Quick fix from pretrained. by @ZHUI in #8487
Release/2.8 by @Galaxy1458 in #8437
Fix from_pretrained os.path.split by @DesmonDay in #8508
[fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
[LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
Update sequence_parallel for predict by @DesmonDay in #8547
Cp/fix by @ZHUI in #8569
Do not save moe_group by @DesmonDay in #8570
[Release] 2.8.1 by @ZHUI in #8636

Full Changelog: v2.8.0...v2.8.1

Contributors

zxcd, ZHUI, and 4 other contributors

Assets 2

24 Apr 10:04

w5688414

v2.8.0

3105c18

v2.8.0

很高兴地通知大家，飞桨大模型套件发布v2.8.0版本。这个版本中，我们深度优化套件的大模型精调对齐的能力，提升大模型套件在国产计算硬件训推能力，具体工作如下：

特色精调和高效对齐：提供自研极致收敛的RsLoRA+算法，大幅提升PEFT训练收敛速度以及训练效果；引入高性能生成加速到RLHF PPO算法，打破 PPO 训练中生成速度瓶颈，PPO训练性能大幅领先。
大模型训练提速：通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式，大模型训练更快、更稳定。

大模型精调对齐训推优化

精调
- PEFT
  - 新增QLoRA pipeline parallel支持 #7801
  - 自定义python算子，优化LoRA的前反向计算 #8106
  - 新增 rslora，lora+，pissa 算法 #8111
- 长序列
  - 新增长序列方案和模型解耦。RotaryEmbedding，LinearScalingRotaryEmbedding，NTKScalingRotaryEmbedding，DynamicNTKScalingRotaryEmbedding等。#8076
- Alignment
  - 新增PPO 对齐算法 #7305
- 训练策略
  - 新增LLaMA sequence parallel #7746
  - 新增LLaMa master_grad #7658
  - GPT新增auto_parallel的支持。 #8160
- 新增算子
  - 新增GQA 算子支持 #7906
  - 新增gqa fuse attention qkv #7890
  - 新增SwiGLU 算子 #8038
推理
- 新增QWenVL 的静态图推理 #7808
  模型新增
新增Deberta，Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct

基础框架升级

Trainer升级
- Trainer新增 ignore_save_lr_and_optim 参数，可以忽略保存lr scheduler以及optimizer权重 #7978
- Trainer新增 Wandb 和 Tensorboard 支持。#7863
- Trainer支持同时解析命令行与json文件参数 #7768
- trainer新增gradient_sync_after_accumulate支持。#8045
- dataloader新增cuda编译检查 #8099
AutoParallel升级
- llama 自动并行支持bf16损失 #7874
- 增加refined-recompute机制#7349
- 在AMP-O2策略下支持master_grad#7658
- 进一步完善动静统一自动并行分布式训练基本功能#7985 #8114
- 新增Llama2模型基于AutoTrainer的半自动训练 #7851 #7885
- 新增llama的hybrid_parallel_topo_order策略。#8011
- llama模型组网动静统一 #8127
其他
- 重构download下载逻辑，支持从bos、hf hub、aistudio、model scope下载模型 #7608 #8020 #8088
- 新增分布式训练的pipeline parallel #8051
- 适配npu的FA #8171 #8210
- llama新增block_attention/cachekv quant #7649

其他支持

新增俄罗斯套娃（matryoshka representation learning）检索策略，节省计算和存储资源。#8165

问题修复

日志级别修改，并增加timelog计时日志，兼容不同设备。#8261
修复pipeline并行中随机初始化的shared weights不一致的问题，覆盖GPT/OPT等模型。#7772
关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
修复GPT模型下载key error问题。#8253
修复LlamaRotaryEmbedding #7882
修复allreduce dtype的问题 #7876
修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
修复Wandb单测问题 #8066 #8056
修复Trainer同时解析json与命令行列表参数报错问题#7860
修复Gradio UI 中的推理问题 #7740 #7788
修复 Tokenizer 相关的基础问题 #7797 7870
修复 custom devices上loading rng state的问题。#7894
修复自动并行打印BF16的loss编码错乱的问题#7874
采用float初始化模型，修复静态图自动并行AMP报错问题#8033#8199
修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
修复llama在custom devices的精度问题。#7895
修复NPU AICPU算子问题 #7976
修复FusedLinearWithGradAdd少传参数的问题。#8178

What's Changed

[Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
[AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
Add codecov check by @zjjlivein in #7760
[CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
[DOC] Update trainer.md by @ZHUI in #7761
[Release] Change version to 2.7.0 by @ZHUI in #7764
[benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
[Release] Update release.yml to release tags by @ZHUI in #7765
[AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
[New Features] support dynamic src_length by @wj-Mcat in #7740
Fix unified_checkpoint bug by @DrownFish19 in #7770
[DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
[Trainer] Fix dist dataloader eval by @DesmonDay in #7777
[Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
[PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
[Paddle-Pipelines] update faiss by @qingzhong1 in #7793
Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
[tests] download slow by @JunnYu in #7798
[INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
Add CE for Distributed Hybrid Parallel by @iosmers in #7782
add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
[Pretrain] Fix eval during pretrain by @DesmonDay in #7806
pipeline parallel benchmark by @zhangting2020 in #7759
[Bug fixes] fix br gradio by @wj-Mcat in #7788
delete useless code for write_cache_kv.cu by @yuanlehome in #7812
[llm]support qlora pp by @lugimzzz in #7801
Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
[LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
[Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
[llm]fix lora by @lugimzzz in #7824
fused rms spmd by @liuzhenhai93 in #7830
[Pretrain] Fix eval during pretrain by @DesmonDay in #7827
[neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
[neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
[Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
[semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
[faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
[text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
[LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
Support 5.2 bloom by @zhoutianzi666 in #7846
[unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
[unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
[New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
[Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
[CI] add ci approval pipelines by @zjjlivein in #7859
[fix] fix a bug of trainer/argparser.py by @greycooker in #7860
[Improvement] fix ops improting in utils by @wj-Mcat in #7865
[Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
[Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
Add PPO training. by @guoshengCS in #7305
Update reward_main.py by @wawltor in #7880
Update ppo_main.py by @wawltor in #7881
[LLM] revert benchmark codes by @RichardWooSJTU in #7871
[LLM]support QWenVL second part by @DanGuge in #7808
[Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
[Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
[Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
[CE] Add Qwen into CE process by @ziangqin-baidu in #7887
[Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
[CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
[LLM] ...

Contributors

co63oc, zhiqiu, and 54 other contributors

Assets 2

30 Jan 07:50

ZHUI

v2.7.2

b39e701

v2.7.2

本版本做了一些小问题的修复

What's Changed

[Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
[Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
[PEFT] Cherry pick lora fix by @lugimzzz in #7826
[Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
[Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
[Cherry-pick] logger level by @KB-Ding in #7920
[Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
[Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892

Full Changelog: v2.7.1...v2.7.2

Contributors

DrownFish19, ZHUI, and 3 other contributors

Assets 2

04 Jan 14:24

ZHUI

v2.7.1

bb9062e

v2.7.1

本版本做了一些小问题的修复

What's Changed

修复了训练恢复遇到的一些问题 @ZHUI in #7771
修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
修复了dist dataloader评估时的问题。 @DesmonDay in #7778

Full Changelog: v2.7.0...v2.7.1

Contributors

DrownFish19, ZHUI, and DesmonDay

Assets 2

03 Jan 04:07

ZHUI

v2.7.0

adf9e6f

PaddleNLP 2.7.0 Release Note

很高兴地通知大家，飞桨大模型套件发布v2.7.0版本。这个版本中，我们深入优化套件的大模型能力。从易用性、性能、到稳定性都有巨大提升。

总体而言，当前版本更新有以下亮点：

统一工具链大模型入口。统一预训练、精调、压缩、推理以及部署等环节的实现代码，到 PaddleNLP/llm目录。
全新大模型工具链文档。一站式指引用户从大模型入门到业务部署上线。文档见： https://paddlenlp.readthedocs.io/zh/latest/llm/finetune.html
全断点存储机制 Unified Checkpoint。在存储断点时将模型权重、优化器权重等进行统一safetensors格式存储，不再区分分布式策略存储，并且支持恢复训练的动态扩缩容，大大提高大模型存储的通用性。
高效微调升级。支持了高效微调+LoRA同时使用，支持了QLoRA等算法。

大模型训推全流程

预训练
- 统一了预训练入口到 llm/run_pretrain.py。
- 支持了qwen 等模型预训练，支持flash attention。
精调
- 支持可LoRA + Linear量化同时使用
- 支持了流水线并行模型 + lora一起使用
- 支持了NEFTune方法
- 添加了QLoRA支持
压缩
- 支持PTQ、QAT量化功能，包括A8W8、WINT8、WINT4、A8W4
- 支持SmoothQuant、GPTQ、AWQ等量化算法

Unified Checkpoint

在大模型背景下，通常我们需要进行多卡分布式的训练，在保存Checkpoint时所得到的模型权重通常是分片放置的，例如根据张量并行、流水线并行进行切分保存。这种根据分布式策略直接存储Checkpoint的方式非常直接明了，但也存在如下的问题：
- 对下游推理不够友好，当用户希望获取中间阶段保存的Checkpoint做下游推理时，需要手动对模型权重进行合并。
- 不利于应对做恢复训练时，可能会面临的分布式策略改变、训练节点数发生变化的情况。用户往往需要手动对Checkpoint进行处理，增加了操作复杂度。
为了最大程度地解决上述的问题，降低用户操作难度，我们对大模型存储框架进行了升级，提出了大模型统一存储方案——Unified Checkpoint。Unified Checkpoint的核心思想是将模型权重、优化器权重等进行统一safetensors格式存储，在Checkpoint存储时不再对分布式策略进行区分，提高大模型存储的通用性。
Unified Checkpoint具备以下功能与特点：
- 权重存储不区分分布式策略，并采用safetensors格式统一存储；
- 灵活支持大模型训练扩容、缩容等各种情况，能够适配不同分布式训练策略的切换。

模型新增

moka-ai/m3e-base 检索模型
BAAI/bge-small-zh-v1.5 检索模型

基础框架升级

Trainer 升级
- 支持了 "--skip_memory_metrics 0"是，显示实时显存、内存占用
- 支持 "--unified_checkpoint" "--unified_checkpoint_config" 支持混合并行下模型save，动态扩缩容重启。
新增 PretrainModelPipe基础类，支持流水线并行训练。
其他支持
支持了paddlenlp commit id 展示 paddlenlp.version.commit
支持AI Studio download add save to aistudio hub

问题修复

修复了dist_dataloader的一些问题
修复了一些模型动转静问题
修复了GPT训练的一些bug，移除了GPT2。修复了一些seed设置问题
修复了baichuan模型在流水线并行的一些问题。

New Contributors

@Wennie396 made their first contribution in #6897
@Wong4j made their first contribution in #7008
@yuanlehome made their first contribution in #7080
@Xreki made their first contribution in #7105
@Tom-Zheng made their first contribution in #7092
@TimeYWL made their first contribution in #7122
@From00 made their first contribution in #7168
@RichardWooSJTU made their first contribution in #7186
@heavyrain-lzy made their first contribution in #7269
@LokeZhou made their first contribution in #7337
@JZ-LIANG made their first contribution in #7301
@WAI-clear made their first contribution in #7402
@tianhaodongbd made their first contribution in #7293
@zzjjay made their first contribution in #7504
@anexplore made their first contribution in #7558
@niuliling123 made their first contribution in #7528
@zxcd made their first contribution in #7577
@MayYouBeProsperous made their first contribution in #7575
@iosmers made their first contribution in #7613
@AndSonder made their first contribution in #7343
@zhink made their first contribution in #7679
@kingTLE made their first contribution in #7708
Full Changelog: v2.6.1...v2.7.0

Contributors

anexplore, zxcd, and 20 other contributors

Assets 2

14 Sep 03:57

sijunhe

v2.6.1

fd2bed5

v2.6.1

What's Changed

在v2.6.1版本中，我们做了大量的bug修复，提高了LLM模型和相关组件的稳定性。除了bug修复以外，主要新增功能如下：

LLM：新增了 qwen 模型，InTokens数据流兼容了Pipeline Parallel，LLM精调支持从多个训练文件加载以及热启动，增强了LLaMA模型的不同recompute粒度
Trainer: hybrid_parallel_topo_order 选项，并修复了 sharding stage3 的保存模型。
Paddle-pipelines: 添加了对 ERNIE-Bot-turbo和ERNIE-embedding 的支持, 更新了分层搜索示例并且增强了 ChatPaper 的UI
Megatron 数据集：添加了加载 megatron 数据集的支持，支持ernie-1.0和T5数据类型

New Contributors

@xiezheng-XD made their first contribution in #6764
@carryyu made their first contribution in #6676
@xiaoxiaohehe001 made their first contribution in #6798
@MARD1NO made their first contribution in #6865
@zhoutianzi666 made their first contribution in #6905
@lchdl made their first contribution in #6964
@LaiXinyi823 made their first contribution in #6659

Full Changelog: v2.6.0...v2.6.1

Contributors

lchdl, carryyu, and 5 other contributors

Assets 2

Releases: PaddlePaddle/PaddleNLP

v3.0.0-beta3

主要更新与增强

What's Changed

Contributors

v3.0.0-beta2

核心变更与增强功能

What's Changed

Contributors

v3.0.0-beta1

主要变更与新增功能

1. 新模型与特性引入

2. 基础性能优化

3. Bug修复

4. 文档与测试更新

5. 其他重要变更

国产硬件支持增强

PIR模式支持

自动并行优化

What's Changed

Contributors

v3.0.0-beta0

大模型精调对齐训推优化

模型新增

基础框架升级

问题修复

结构调整

What's Changed

Contributors

v2.8.1

What's Changed

Contributors

v2.8.0

What's Changed

Contributors

v2.7.2

What's Changed

Contributors

v2.7.1

What's Changed

Contributors

PaddleNLP 2.7.0 Release Note

大模型训推全流程

Unified Checkpoint

模型新增

基础框架升级

问题修复

New Contributors

Contributors

v2.6.1

What's Changed

New Contributors

Contributors