v3.0.0-beta1
Pre-release
Pre-release
What's Changed
- [DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
- fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
- bug fix by @wtmlon in #8687
- [XPU] add lora optimization by @dynamicheart in #8527
- [pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
- [AutoParallel]Change
max_steps
in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679 - [benchmark] Change the mirror source for pip by @mmglove in #8699
- update loss base of auto-parallel tests by @zhiqiu in #8701
- Add new mistral by @wtmlon in #7425
- [Safetensors] Fix safetensors shape by @DesmonDay in #8702
- [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
- xpu use allgather by @FeixLiu in #8697
- add fast_rmsnorm by @deepllz in #8680
- enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
- fix xpu gather for unified ckpt by @FeixLiu in #8710
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
- fix fast_ln backward by @deepllz in #8719
- finetune support use_fast_layer_norm by @tianhaodongbd in #8717
- bug fix by @FeixLiu in #8730
- disable lora by @lugimzzz in #8674
- [Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
- correct broken links in readme by @jzhang533 in #8741
- revert benchmark fix by @ronny1996 in #8747
- [LLM] Add Yuan model by @zhaogf01 in #8654
- fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
- [LLM] Update sequence parallel linear import by @DrownFish19 in #8706
- [Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
- update a100 loss by @zhiqiu in #8708
- [PaddleNLP 3.0] Update README by @DrownFish19 in #8681
- [AutoParallel] update loss for global clip by @JZ-LIANG in #8750
- [NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
- [DEV] Update develop version show by @DrownFish19 in #8754
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
- add benchmark baichuan2 scripts by @fightfat in #8683
- Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
- fix the ce for the unittest by @wawltor in #8772
- Enable parallel_config to use commas as delimiters. by @Difers in #8677
- fix incorrect token counting in
llm/predictor.py
by @lszxb in #8769 - Refine savable by @ZHUI in #8758
- [CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
- [XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
- fix version show by @DrownFish19 in #8791
- [BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
- vera-pissa method added by @TranscenderNing in #8722
- update version by @DrownFish19 in #8792
- [Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
- [DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
- [Prediction] Update LLM prediction. by @DesmonDay in #8778
- [Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
- [AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
- [MoE] Optimizer parameter broadcast by @DesmonDay in #8810
- [Doc] Update README by @DrownFish19 in #8817
- support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
- add paddle nv-embed-v1 by @Li-Z-Q in #8785
- fix pad_token_id bug by @yuanlehome in #8814
- [DCU] fix llama inference bug on DCU by @Deleter-D in #8815
- [Doc] Add LLaMA3.1 by @DrownFish19 in #8824
- [BUG] Fix build train valid test datasets by @JunnYu in #8826
- Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
- fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
- [AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
- [Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
- [Trainer] update clear_grad by @DesmonDay in #8829
- [Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
- [Inference LLM] support static c8 by @yuanlehome in #8833
- support sft mapdataset by @greycooker in #8840
- Cherry pick some changes from incubate branch by @sneaxiy in #8862
- support nested list of dict inputs by @deepllz in #8876
- Fix the bug with issues code 8641. by @smallbenxiong in #8880
- Fix the issue of P-tuning official sample error by @guangyunms in #8884
- modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
- [llm]fix zeropadding by @lugimzzz in #8895
- 修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
- enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
- Update run_pretrain.py by @ZHUI in #8902
- [doc] Update readme by @DrownFish19 in #8905
- [AutoParallel] Bugfix auto parallel FA by @JZ-LIANG in #8903
- [Readme] Update README.md by @ZHUI in #8908
- [cherry-pick] Optimize async save by @ForFishes in #8878
- [LLM Inference] Refactor BlockInferencePredictor by @yuanlehome in #8879
- [Fix] modify tensorboard requirements by @greycooker in #8904
- [LLM Inference] Support qwen2 by @yuanlehome in #8893
- modify dict include none to aviod pir dytostatic bug in while op by @xiaoguoguo626807 in #8898
- [LLM]Update yuan model by @zhaogf01 in #8786
- update qwen && baichuan benchmark config by @deepllz in #8920
- [doc] Update README by @DrownFish19 in #8922
- [ New features]Trainer support dict parameter by @greycooker in #8446
- set logging_step to 5 with baichuan && qwen benchmark by @deepllz in #8928
- [Cherry-pick]fix pipeline eval by @gongel in #8924
- fix test_wint8 ut by @yuanlehome in #8930
- [LLM Inference] support llama3.1 by @yuanlehome in #8929
- Fix tokens count for benchmark by @DrownFish19 in #8938
- [bug fix] fix create_optimizer_and_scheduler for auto_parallel by @zhangyuqin1998 in #8937
- [LLM Inference] fix _get_tensor_parallel_mappings in llama by @yuanlehome in #8939
- [Unified Checkpoint] Fix load best checkpoint by @DesmonDay in #8935
- fix bug by @yuanlehome in #8947
- [LLM Inference] move llm.utils.utils.py to paddlenlp.utils.llm_utils.py by @yuanlehome in #8946
- support amp in pir dy2st mode. by @winter-wang in #8485
- [Trainer] Fix distributed dataloader by @DesmonDay in #8932
- [Tokenizer] Add Fast Tokenizer by @DrownFish19 in #8832
- [ZeroPadding] add greedy_zero_padding by @DesmonDay in #8933
- [NEW Model] Add mamba by @JunnYu in #8513
- [BUG] fix mamba tokenizer by @JunnYu in #8958
- [NEW Model] add jamba by @JunnYu in #8517
- [LLM Inference] add --use_fake_parameter option for ptq fake scales and fix compute error of total_max_length by @yuanlehome in #8955
- [LLM Inference] support qwen2 a8w8c8 inference by @ckl117 in #8925
- fix JambaModelIntegrationTest by @JunnYu in #8965
- [Fix] Enable tensor parallel tests. by @ZHUI in #8757
- [CI] Fix by @DrownFish19 in #8793
- [Unified Checkpoint] update async save by @DesmonDay in #8801
- [AutoParallel] Support save model for auto trainer by @zhangbo9674 in #8927
- fix qwen benchmark by @deepllz in #8969
- [ZeroPadding] padding to max_length for sequence parallel by @DrownFish19 in #8973
- add amp unit test case for auto_parallel ci. by @winter-wang in #8966
- [New Version] Upgrade to 3.0 b1 by @ZHUI in #8977
New Contributors
- @yuguo-Jack made their first contribution in #8580
- @ruisunyc made their first contribution in #8698
- @xiaoguoguo626807 made their first contribution in #8689
- @lizexu123 made their first contribution in #8712
- @jzhang533 made their first contribution in #8741
- @zhaogf01 made their first contribution in #8654
- @lszxb made their first contribution in #8768
- @TranscenderNing made their first contribution in #8722
- @Deleter-D made their first contribution in #8800
- @Li-Z-Q made their first contribution in #8785
- @Hanyonggong made their first contribution in #8799
- @smallbenxiong made their first contribution in #8880
- @guangyunms made their first contribution in #8884
- @winter-wang made their first contribution in #8485
- @ckl117 made their first contribution in #8925
Full Changelog: v3.0.0-beta0...v3.0.0-beta1