update llm infer docs #9314

yuanlehome · 2024-10-24T09:09:08Z

PR types

Others

PR changes

Docs

Description

update llm infer docs

paddle-bot · 2024-10-24T09:09:13Z

Thanks for your contribution!

codecov · 2024-10-24T09:42:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.41%. Comparing base (6211e3d) to head (d5ad497).
Report is 8 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9314      +/-   ##
===========================================
- Coverage    53.44%   52.41%   -1.04%     
===========================================
  Files          664      661       -3     
  Lines       109935   108376    -1559     
===========================================
- Hits         58757    56801    -1956     
- Misses       51178    51575     +397

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-10-24T14:36:47Z

llm/docs/predict/inference.md

@@ -94,6 +95,8 @@ PaddleNLP 提供了多种参数，用于配置推理模型和优化推理性能

 - `block_attn`: 是否使用 Block Attention 推理， 默认值为False。Block Attention 是基于 PageAttention 的思想设计并实现的，在保持高性能推理和动态插入的基础上可以动态地为 cachekv 分配存储空间，极大地节省显存并提升推理的吞吐。

+- `append_attn`: Append Attention 在 Block Attention 实现的基础上，进一步借鉴 FlashInfer 的实现对 Attention 模块进行了优化，并增加了C4的高性能支持，极大地提升了推理性能。


最好说明一下两者的关系，是二选一，还是叠加？

建议可以说一下 append_attn 的优势是什么？哪些场景下更合适，这里只说借鉴 FlashInfer，但是用户可能不知道FlashInfer是什么。

二选一，全场景下应该都合适，属于是block_attn的升级版

辛苦补充一下到文档里吧，主要是让用户看懂，看明白。

ZHUI · 2024-10-24T14:37:48Z

llm/docs/predict/llama.md

@@ -43,27 +43,27 @@ BF16推理

 ```shell
 # 动态图推理
-python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1
+python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1


文档是直接对外，append_attn 相关功能的CI也辛苦加一下，避免用户跑不通的情况。

CI有些问题，已经在调试了，在下个PR中会提交

ZHUI

LGTM

update llm infer docs

280d172

ZHUI reviewed Oct 24, 2024

View reviewed changes

update

d5ad497

qingqing01 approved these changes Oct 25, 2024

View reviewed changes

ZHUI approved these changes Oct 25, 2024

View reviewed changes

yuanlehome merged commit 81f5ab5 into PaddlePaddle:develop Oct 25, 2024
10 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update llm infer docs #9314

update llm infer docs #9314

yuanlehome commented Oct 24, 2024

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

ZHUI Oct 24, 2024

yuanlehome Oct 25, 2024

ZHUI Oct 25, 2024

yuanlehome Oct 25, 2024

ZHUI Oct 24, 2024

yuanlehome Oct 25, 2024

ZHUI left a comment

		@@ -94,6 +95,8 @@ PaddleNLP 提供了多种参数，用于配置推理模型和优化推理性能

		- `block_attn`: 是否使用 Block Attention 推理，默认值为False。Block Attention 是基于 PageAttention 的思想设计并实现的，在保持高性能推理和动态插入的基础上可以动态地为 cachekv 分配存储空间，极大地节省显存并提升推理的吞吐。

		- `append_attn`: Append Attention 在 Block Attention 实现的基础上，进一步借鉴 FlashInfer 的实现对 Attention 模块进行了优化，并增加了C4的高性能支持，极大地提升了推理性能。

update llm infer docs #9314

update llm infer docs #9314

Conversation

yuanlehome commented Oct 24, 2024

PR types

PR changes

Description

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

ZHUI Oct 24, 2024

Choose a reason for hiding this comment

yuanlehome Oct 25, 2024

Choose a reason for hiding this comment

ZHUI Oct 25, 2024

Choose a reason for hiding this comment

yuanlehome Oct 25, 2024

Choose a reason for hiding this comment

ZHUI Oct 24, 2024

Choose a reason for hiding this comment

yuanlehome Oct 25, 2024

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2024 •

edited

Loading