PaddlePaddle · yuanlehome · Oct 25, 2024 · Oct 24, 2024 · Oct 25, 2024 · ZHUI
diff --git a/llm/docs/predict/inference.md b/llm/docs/predict/inference.md
@@ -19,7 +19,7 @@ PaddleNLP大模型推理提供压缩、推理、服务全流程体验 ：
 
 - 支持多硬件大模型推理，包括[昆仑XPU](../../xpu/llama/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光K100](../dcu_install.md)、[燧原GCU](../../gcu/llama/README.md)、[X86 CPU](../cpu_install.md)等
 
-- 提供面向服务器场景的部署服务，支持连续批处理(continuous batching)、流式输出等功能，支持gRPC、HTTP协议的服务接口
+- 提供面向服务器场景的部署服务，支持连续批处理(continuous batching)、流式输出等功能，HTTP协议的服务接口
 
 
 ## 1. 模型支持
@@ -28,9 +28,10 @@ PaddleNLP 中已经添加高性能推理模型相关实现，已验证过的模
 | Models | Example Models |
 |--------|----------------|
 |Llama 3.1, Llama 3, Llama 2|`meta-llama/Meta-Llama-3.1-8B`, `meta-llama/Meta-Llama-3.1-8B-Instruct`, `meta-llama/Meta-Llama-3.1-405B`, `meta-llama/Meta-Llama-3.1-405B-Instruct`,`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-Guard-3-8B`, `Llama-2-7b, meta-llama/Llama-2-7b-chat`, `meta-llama/Llama-2-13b`, `meta-llama/Llama-2-13b-chat`, `meta-llama/Llama-2-70b`, `meta-llama/Llama-2-70b-chat`|
-|Qwen 2| `Qwen/Qwen2-0.5B`, `Qwen/Qwen2-0.5B-Instruct`, `Qwen/Qwen2-1.5B`, `Qwen/Qwen2-1.5B-Instruct`, `Qwen/Qwen2-7B`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-72B`, `Qwen/Qwen2-72B-Instruct`, `Qwen/Qwen2-57B-A14B`, `Qwen/Qwen2-57B-A14B-Instruct`|
-|Qwen-Moe| `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, `Qwen/Qwen2-57B-A14B`, `Qwen/Qwen2-57B-A14B-Instruct`|
-|Mixtral| `mistralai/Mixtral-8x7B-Instruct-v0.1`|
+|Qwen 2| `Qwen/Qwen2-0.5B`, `Qwen/Qwen2-0.5B-Instruct`, `Qwen/Qwen2-1.5B`, `Qwen/Qwen2-1.5B-Instruct`, `Qwen/Qwen2-7B`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-72B`, `Qwen/Qwen2-72B-Instruct`, `Qwen/Qwen2-57B-A14B`, `Qwen/Qwen2-57B-A14B-Instruct`, `Qwen/Qwen2-Math-1.5B-Instruct`|
+|Qwen 2.5| `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-14B-Instruct`, `Qwen/Qwen2.5-Math-1.5B-Instruct`, `Qwen/Qwen2.5-Coder-1.5B-Instruct`|
+|Qwen-MoE| `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, `Qwen/Qwen2-57B-A14B`, `Qwen/Qwen2-57B-A14B-Instruct`|
+|Mixtral| `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistralai/Mixtral-8x22B-Instruct-v0.1`|
 |ChatGLM 3, ChatGLM 2| `THUDM/chatglm3-6b`, `THUDM/chatglm2-6b`|
 |Baichuan 2, Baichuan|`baichuan-inc/Baichuan2-7B-Base`, `baichuan-inc/Baichuan2-7B-Chat`, `baichuan-inc/Baichuan2-13B-Base`, `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Base`, `baichuan-inc/Baichuan-13B-Chat`|
 
@@ -94,6 +95,8 @@ PaddleNLP 提供了多种参数，用于配置推理模型和优化推理性能
 
 - `block_attn`: 是否使用 Block Attention 推理， 默认值为False。Block Attention 是基于 PageAttention 的思想设计并实现的，在保持高性能推理和动态插入的基础上可以动态地为 cachekv 分配存储空间，极大地节省显存并提升推理的吞吐。
 
+- `append_attn`: Append Attention 在 Block Attention 实现的基础上，进一步借鉴 FlashInfer 的实现对 Attention 模块进行了优化，并增加了C4的高性能支持，极大地提升了推理性能。
+
 - `block_size`: 如果使用 Block Attention 推理，指定一个 Block 可以存储的 token 数量，默认值为64。
 
 
@@ -186,5 +189,5 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 
 ## 致谢
 
-我们参考[PageAttention](https://github.com/vllm-project/vllm)的page分块的思想实现了generation阶段的block attention。基于[Flash Decoding](https://github.com/Dao-AILab/flash-attention)的KV分块思想实现了长sequence场景下的推理加速。基于[Flash Attention2](https://github.com/Dao-AILab/flash-attention)实现了prefill阶段的attention加速。FP8 GEMM基于[CUTLASS](https://github.com/NVIDIA/cutlass)的高性能模板库实现。有部分算子如gemm_dequant参考了[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)和[FasterTransformer](https://github.com/NVIDIA/FasterTransformer.git)的实现和优化思路。
+我们参考[FlashInfer框架](https://github.com/flashinfer-ai/flashinfer)，在FlashInfer的基础上，实现了append attention。参考[PageAttention](https://github.com/vllm-project/vllm)的page分块的思想实现了generation阶段的block attention。基于[Flash Decoding](https://github.com/Dao-AILab/flash-attention)的KV分块思想实现了长sequence场景下的推理加速。基于[Flash Attention2](https://github.com/Dao-AILab/flash-attention)实现了prefill阶段的attention加速。FP8 GEMM基于[CUTLASS](https://github.com/NVIDIA/cutlass)的高性能模板库实现。有部分算子如gemm_dequant参考了[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)和[FasterTransformer](https://github.com/NVIDIA/FasterTransformer.git)的实现和优化思路。
 
diff --git a/llm/docs/predict/llama.md b/llm/docs/predict/llama.md
@@ -43,27 +43,27 @@ BF16推理
 
 ```shell
 # 动态图推理
-python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1
+python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1
 
 ```
 
 WINT8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 ```
 
 下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出，如 checkpoints/llama_ptq_ckpts，或者使用所提供的预先量化好的模型，如 meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8。
@@ -72,38 +72,38 @@ INT8-A8W8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8
 ```
 
 INT8-A8W8C8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 ```
 
 FP8-A8W8推理
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 ```
 
 405B INT8-A8W8C8 TP8推理
@@ -121,8 +121,8 @@ generation_config = GenerationConfig.from_pretrained("meta-llama/Meta-Llama-3.1-
 
 ```shell
 # 导出模型 (可在predict/export_model.py中设置paddle.set_device("cpu")，通过内存导出模型)
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --append_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1
 
 # 推理
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --append_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static
 ```
diff --git a/llm/docs/predict/mixtral.md b/llm/docs/predict/mixtral.md
@@ -31,7 +31,7 @@ python -m paddle.distributed.launch \
     --dtype bfloat16 \
     --mode "dynamic" \
     --inference_model \
-    --block_attn
+    --append_attn
 
 # 动转静导出模型
 export DEVICES=0,1
@@ -42,7 +42,7 @@ python -m paddle.distributed.launch \
     --output_path ./inference \
     --dtype bfloat16 \
     --inference_model \
-    --block_attn
+    --append_attn
 
 # 静态图推理
 # 需要设置下面的环境变量，否则会导致多卡推理阻塞
@@ -55,7 +55,7 @@ python -m paddle.distributed.launch \
     --dtype bfloat16 \
     --mode "static" \
     --inference_model \
-    --block_attn
+    --append_attn
 
 ```
 
@@ -71,7 +71,7 @@ python -m paddle.distributed.launch \
     --quant_type "weight_only_int8" \
     --mode "dynamic" \
     --inference_model \
-    --block_attn
+    --append_attn
 
 # 动转静导出模型
 export DEVICES=0,1
@@ -83,7 +83,7 @@ python -m paddle.distributed.launch \
     --dtype bfloat16 \
     --quant_type weight_only_int8 \
     --inference_model \
-    --block_attn
+    --append_attn
 
 # 静态图推理
 export FLAGS_dynamic_static_unified_comm=false
@@ -96,5 +96,5 @@ python -m paddle.distributed.launch \
     --quant_type weight_only_int8 \
     --mode "static" \
     --inference_model \
-    --block_attn
+    --append_attn
 ```
diff --git a/llm/docs/predict/qwen.md b/llm/docs/predict/qwen.md
@@ -39,27 +39,27 @@ BF16推理
 
 ```shell
 # 动态图推理
-python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1
+python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1
 
 ```
 
 WINT8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type weight_only_int8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type weight_only_int8
 ```
 
 下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出，如 checkpoints/qwen_ptq_ckpts，或者使用所提供的预先量化好的模型，如 Qwen/Qwen2-1.5B-Instruct-A8W8C8。
@@ -68,36 +68,36 @@ INT8-A8W8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/qwen_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/predictor.py --model_name_or_path checkpoints/qwen_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/qwen_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/export_model.py --model_name_or_path checkpoints/qwen_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8
 ```
 
 INT8-A8W8C8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static
 ```
 
 FP8-A8W8推理
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 
 # 静态图推理
-python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8_fp8
 ```