PaddlePaddle · jiangjiajun · Jan 4, 2023 · Jan 3, 2023 · Jan 3, 2023 · Jan 3, 2023
diff --git a/docs/cn/faq/heterogeneous_computing_on_timvx_npu.md b/docs/cn/faq/heterogeneous_computing_on_timvx_npu.md
@@ -0,0 +1,36 @@
+[English](../../en/faq/heterogeneous_computing_on_timvx_npu.md) | 中文
+
+# 在芯原系列 NPU 上实现异构计算
+在芯原系列 NPU 上，例如 RV1126 或者 A311D 上部署全量化模型时，有可能会有精度下降的问题，那么就需要在 NPU 和 ARM CPU 上进行异构计算，FastDeploy 中的异构计算是通过 subgraph.txt 配置文件来完成的，如果在更换全量化模型后，发现精度有较大的下降，可以参考本文档来定义异构配置文件。
+
+异构配置文件的更新步骤：
+1. 确定模型量化后在 ARM CPU 上的精度。
+- 如果在 ARM CPU 上，精度都无法满足，那量化本身就是失败的，此时可以考虑修改训练集或者更改量化方法。
+- 只需要修改 demo 中的代码，将 NPU 推理的部分改为使用 ARM CPU int8 推理，便可实现使用ARM CPU进行计算
+    ```
+    # 如下接口表示使用 NPU 进行推理
+    fastdeploy::RuntimeOption option;
+    option.UseTimVX(); # 开启 TIMVX 进行 NPU 推理
+    option.SetLiteSubgraphPartitionPath(subgraph_file); # 加载异构计算配置文件
+
+    # 如下接口表示使用 ARM CPU int8 推理
+    fastdeploy::RuntimeOption option;
+    option.UseLiteBackend();
+    option.EnableLiteInt8();
+    ```
+    如果 ARM CPU 计算结果精度达标，则继续下面的步骤。
+
+2. 获取整网拓扑信息。
+- 回退第一步中的修改，使用 NPU 进行推理的 API 接口，加载异构计算配置文件的开关保持关闭。
+- 将所有的日志信息写入到 log.txt中，在 log.txt 中搜索关键字 "subgraph operators" 随后的一段便是整个模型的拓扑信息
+- 它的格式如下：
+    - 每行记录由 ”算子类型:输入张量名列表:输出张量名列表“ 组成（即以分号分隔算子类型、输入和输出张量名列表），以逗号分隔输入、输出张量名列表中的每个张量名；
+    - 示例说明：
+    ```
+        op_type0:var_name0,var_name1:var_name2 # 表示将算子类型为 op_type0、输入张量为var_name0 和 var_name1、输出张量为 var_name2 的节点强制运行在 ARM CPU 上
+    ```
+
+3. 修改异构配置文件
+- 将所有的 Subgraph operators 写到在 subgraph.txt 中，并打开加载异构计算配置文件的接口
+- 逐行删除、成片删除、二分法，发挥开发人员的耐心，找到引入 NPU 精度异常的 layer，将其留在 subgraph.txt 中
+- 在 txt 中的结点都是需要异构到 ARM CPU 上的 layer，不用特别担心性能问题，Paddle Lite 的 ARM kernel 性能也是非常卓越的
diff --git a/docs/en/faq/heterogeneous_computing_on_timvx_npu.md b/docs/en/faq/heterogeneous_computing_on_timvx_npu.md
@@ -0,0 +1,36 @@
+English | [中文](../../cn/faq/heterogeneous_computing_on_timvx_npu.md)
+
+# Heterogeneous Computing on VeriSilicon Series NPUs
+When deploying a quantized model on a VeriSilicon series NPU, such as RV1126 or A311D, there may be a problem of decreased accuracy, so heterogeneous computing needs to be performed on the NPU and ARM CPU. The heterogeneous computing in FastDeploy is implemented by loading subgraph.txt configuration files. If you find that the accuracy has dropped significantly after replacing the quantized model, you can refer to this document to define the heterogeneous configuration file.
+
+Update steps for heterogeneous configuration files:
+1. Determine the accuracy of the quantized model on an ARM CPU.
+- If the accuracy cannot be satisfied on the ARM CPU, then there is a problem with the quantized model. At this time, you can consider modifying the dataset or changing the quantization method.
+- Only need to modify a few lines of code in the demo, change the part of NPU inference to use ARM CPU int8.
+    ```
+    # The following interface represents the use of NPU for inference
+    fastdeploy::RuntimeOption option;
+    option.UseTimVX(); # Turn on TIMVX for NPU inference
+    option.SetLiteSubgraphPartitionPath(subgraph_file); # Load heterogeneous computing configuration files
+
+    # The following interface indicates the use of ARM CPU int8 inference
+    fastdeploy::RuntimeOption option;
+    option.UseLiteBackend();
+    option.EnableLiteInt8();
+    ```
+    If the ARM CPU accuracy is up to standard, continue with the next steps.
+
+2. Obtain the topology information of the entire network.
+- Roll back the modification in the first step, use the API interface of NPU for inference, and keep the switch of loading heterogeneous computing configuration files off.
+- Write all the log information to log.txt, search for the keyword "subgraph operators" in log.txt and the following paragraph is the topology information of the entire model.
+- It has the following format:
+    - Each line of records consists of "operator type: list of input tensor names: list of output tensor names" (that is, the operator type, list of input and output tensor names are separated by semicolons), and the input and output tensor names are separated by commas each tensor name in the list;
+    - Example:
+    ```
+        op_type0:var_name0,var_name1:var_name2 # Indicates that the node whose operator type is op_type0, input tensors are var_name0 and var_name1, and output tensor is var_name2 is forced to run on the ARM CPU
+    ```
+
+3. Modify heterogeneous configuration files
+- Write all Subgraph operators in subgraph.txt, and open the interface for loading heterogeneous computing configuration files
+- Delete line by line, delete in pieces, dichotomy, use the patience of developers, find the layer that introduces NPU precision exception, and leave it in subgraph.txt
+- The nodes in txt all need to be heterogeneous to the layer on the ARM CPU, so don’t worry about performance issues. The ARM kernel performance of Paddle Lite is also very good.
diff --git a/examples/vision/classification/paddleclas/a311d/cpp/README.md b/examples/vision/classification/paddleclas/a311d/cpp/README.md
@@ -3,12 +3,13 @@
 
 ## 部署准备
 ### FastDeploy 交叉编译环境准备
-- 1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
+1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
 
 ### 量化模型准备
-- 1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
-- 2. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此 yaml 文件, 用户从 FP32 模型文件夹下复制此 yaml 文件到量化后的模型文件夹内即可.)
-- 更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
+1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
+2. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此 yaml 文件, 用户从 FP32 模型文件夹下复制此 yaml 文件到量化后的模型文件夹内即可.)
+
+更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
 
 ## 在 A311D 上部署量化后的 ResNet50_Vd 分类模型
 请按照以下步骤完成在 A311D 上部署 ResNet50_Vd 量化模型：

diff --git a/examples/vision/classification/paddleclas/rv1126/cpp/README.md b/examples/vision/classification/paddleclas/rv1126/cpp/README.md
@@ -3,12 +3,13 @@
 
 ## 部署准备
 ### FastDeploy 交叉编译环境准备
-- 1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/rv1126.md#交叉编译环境搭建)  
+1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/rv1126.md#交叉编译环境搭建)  
 
 ### 量化模型准备
-- 1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
-- 2. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此 yaml 文件, 用户从 FP32 模型文件夹下复制此 yaml 文件到量化后的模型文件夹内即可.)
-- 更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
+1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
+2. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此 yaml 文件, 用户从 FP32 模型文件夹下复制此 yaml 文件到量化后的模型文件夹内即可.)
+
+更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
 
 ## 在 RV1126 上部署量化后的 ResNet50_Vd 分类模型
 请按照以下步骤完成在 RV1126 上部署 ResNet50_Vd 量化模型：

diff --git a/examples/vision/detection/paddledetection/a311d/cpp/README.md b/examples/vision/detection/paddledetection/a311d/cpp/README.md
@@ -4,13 +4,15 @@
 
 ## 部署准备
 ### FastDeploy 交叉编译环境准备
-- 1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
+1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
 
 ### 模型准备
-- 1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
-- 2. 用户可以先使用 PaddleDetection 自行导出 Float32 模型，注意导出模型模型时设置参数：use_shared_conv=False，更多细节请参考：[PP-YOLOE](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe)
-- 3. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。（注意: 推理量化后的检测模型仍然需要FP32模型文件夹下的 infer_cfg.yml 文件，自行量化的模型文件夹内不包含此 yaml 文件，用户从 FP32 模型文件夹下复制此yaml文件到量化后的模型文件夹内即可。）
-- 更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
+1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
+2. 用户可以先使用 PaddleDetection 自行导出 Float32 模型，注意导出模型模型时设置参数：use_shared_conv=False，更多细节请参考：[PP-YOLOE](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe)
+3. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。（注意: 推理量化后的检测模型仍然需要FP32模型文件夹下的 infer_cfg.yml 文件，自行量化的模型文件夹内不包含此 yaml 文件，用户从 FP32 模型文件夹下复制此yaml文件到量化后的模型文件夹内即可。）
+4. 模型需要异构计算，异构计算文件可以参考：[异构计算](./../../../../../../docs/cn/faq/heterogeneous_computing_on_timvx_npu.md)，由于 FastDeploy 已经提供了模型，可以先测试我们提供的异构文件，验证精度是否符合要求。
+
+更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
 
 ## 在 A311D 上部署量化后的 PP-YOLOE  检测模型
 请按照以下步骤完成在 A311D 上部署 PP-YOLOE  量化模型：

diff --git a/examples/vision/detection/paddledetection/rv1126/cpp/README.md b/examples/vision/detection/paddledetection/rv1126/cpp/README.md
@@ -4,13 +4,15 @@
 
 ## 部署准备
 ### FastDeploy 交叉编译环境准备
-- 1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/rv1126.md#交叉编译环境搭建)  
+1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/rv1126.md#交叉编译环境搭建)  
 
 ### 模型准备
-- 1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
-- 2. 用户可以先使用 PaddleDetection 自行导出 Float32 模型，注意导出模型模型时设置参数：use_shared_conv=False，更多细节请参考：[PP-YOLOE](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe)
-- 3. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。（注意: 推理量化后的检测模型仍然需要FP32模型文件夹下的 infer_cfg.yml 文件，自行量化的模型文件夹内不包含此 yaml 文件，用户从 FP32 模型文件夹下复制此yaml文件到量化后的模型文件夹内即可。）
-- 更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
+1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
+2. 用户可以先使用 PaddleDetection 自行导出 Float32 模型，注意导出模型模型时设置参数：use_shared_conv=False，更多细节请参考：[PP-YOLOE](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe)
+3. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/)，自行进行模型量化, 并使用产出的量化模型进行部署。（注意: 推理量化后的检测模型仍然需要FP32模型文件夹下的 infer_cfg.yml 文件，自行量化的模型文件夹内不包含此 yaml 文件，用户从 FP32 模型文件夹下复制此yaml文件到量化后的模型文件夹内即可。）
+4. 模型需要异构计算，异构计算文件可以参考：[异构计算](./../../../../../../docs/cn/faq/heterogeneous_computing_on_timvx_npu.md)，由于 FastDeploy 已经提供了模型，可以先测试我们提供的异构文件，验证精度是否符合要求。
+
+更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
 
 ## 在 RV1126 上部署量化后的 PP-YOLOE  检测模型
 请按照以下步骤完成在 RV1126 上部署 PP-YOLOE  量化模型：

diff --git a/examples/vision/detection/yolov5/a311d/README.md b/examples/vision/detection/yolov5/a311d/README.md
@@ -1,8 +1,5 @@
 # YOLOv5 量化模型在 A311D 上的部署
-目前 FastDeploy 已经支持基于 Paddle Lite 部署 YOLOv5 量化模型到 A311D 上。
-
-模型的量化和量化模型的下载请参考：[模型量化](../quantize/README.md)
-
+目前 FastDeploy 已经支持基于 Paddle Lite 部署 [YOLOv5](https://github.com/ultralytics/yolov5/releases/tag/v6.1) 量化模型到 A311D 上。
 
 ## 详细部署文档
 

diff --git a/examples/vision/detection/yolov5/a311d/cpp/README.md b/examples/vision/detection/yolov5/a311d/cpp/README.md
@@ -4,12 +4,31 @@
 
 ## 部署准备
 ### FastDeploy 交叉编译环境准备
-- 1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
+1. 软硬件环境满足要求，以及交叉编译环境的准备，请参考：[FastDeploy 交叉编译环境准备](../../../../../../docs/cn/build_and_install/a311d.md#交叉编译环境搭建)  
 
 ### 量化模型准备
-- 1. 用户可以直接使用由 FastDeploy 提供的量化模型进行部署。
-- 2. 用户可以使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/),自行进行模型量化, 并使用产出的量化模型进行部署。
-- 更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
+可以直接使用由 FastDeploy 提供的量化模型进行部署，也可以按照如下步骤准备量化模型：
+1. 按照 [YOLOv5](https://github.com/ultralytics/yolov5/releases/tag/v6.1) 官方导出方式导出 ONNX 模型，或者直接使用如下命令下载
+```bash
+wget https://paddle-slim-models.bj.bcebos.com/act/yolov5s.onnx
+```
+2. 准备 300 张左右量化用的图片，也可以使用如下命令下载我们准备好的数据。
+```bash
+wget https://bj.bcebos.com/fastdeploy/models/COCO_val_320.tar.gz
+tar -xf COCO_val_320.tar.gz
+```
+3. 使用 FastDeploy 提供的[一键模型自动化压缩工具](../../../../../../tools/common_tools/auto_compression/),自行进行模型量化, 并使用产出的量化模型进行部署。
+```bash
+fastdeploy compress --config_path=./configs/detection/yolov5s_quant.yaml --method='PTQ' --save_dir='./yolov5s_ptq_model_new/'
+```
+4. YOLOv5 模型需要异构计算，异构计算文件可以参考：[异构计算](./../../../../../../docs/cn/faq/heterogeneous_computing_on_timvx_npu.md)，由于 FastDeploy 已经提供了 YOLOv5 模型，可以先测试我们提供的异构文件，验证精度是否符合要求。
+```bash
+# 先下载我们提供的模型，解压后将其中的 subgraph.txt 文件拷贝到新量化的模型目录中
+wget https://bj.bcebos.com/fastdeploy/models/yolov5s_ptq_model.tar.gz
+tar -xvf yolov5s_ptq_model.tar.gz
+```
+
+更多量化相关相关信息可查阅[模型量化](../../quantize/README.md)
 
 ## 在 A311D 上部署量化后的 YOLOv5 检测模型
 请按照以下步骤完成在 A311D 上部署 YOLOv5 量化模型：

diff --git a/examples/vision/detection/yolov5/rv1126/README.md b/examples/vision/detection/yolov5/rv1126/README.md
@@ -1,8 +1,5 @@
 # YOLOv5 量化模型在 RV1126 上的部署
-目前 FastDeploy 已经支持基于 Paddle Lite 部署 YOLOv5 量化模型到 RV1126 上。
-
-模型的量化和量化模型的下载请参考：[模型量化](../quantize/README.md)
-
+目前 FastDeploy 已经支持基于 Paddle Lite 部署 [YOLOv5](https://github.com/ultralytics/yolov5/releases/tag/v6.1) 量化模型到 RV1126 上。
 
 ## 详细部署文档