Add Retrieval based multi label classification #3656

w5688414 · 2022-11-02T16:26:38Z

PR types

New features

PR changes

APIs

Description

Add Retrieval base multi label classification

lugimzzz · 2022-11-10T08:14:43Z

applications/text_classification/multi_label/retrieval_based/README.md

+
+train.txt(训练数据集文件)， dev.txt(开发数据集文件)，test.txt(可选，测试数据集文件)，文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据；开发集指用于评测模型表现的数据，可以根据模型在开发集上的精度调整训练参数和模型；测试集用于测试模型表现，没有测试集时可以使用开发集代替。
+
+- train.txt/test.txt 文件格式：


这个格式看上去只有单个标签？

由于文本有多个标签，需要把文本拆分成多个单标签的形式，即多行文本标签对。已添加说明

lugimzzz · 2022-11-10T08:23:54Z

applications/text_classification/multi_label/retrieval_based/requirements.txt

+pandas==0.25.1 
+paddlenlp>=2.3.4    
+paddlepaddle-gpu>=2.3.0
+hnswlib>=0.5.2


paddlenlp应该已经默认依赖numpy visualdl，paddlepaddle-gpu安装的方式比较多样，感觉直接放在requirements.txt 用pip方式安装是不是不太合适

lugimzzz · 2022-11-10T08:30:10Z

applications/text_classification/multi_label/retrieval_based/README.md

+
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式


部署这块感觉文档不够清晰，可能需要解释清楚Paddle Inference预测、Paddle Serving部署、Pipeline方式之间逻辑关系。

模型部署分为：动转静导出，向量引擎，Paddle Inference推理， Paddle Serving服务化这几个部分。为了提升预测速度，通常需要把训练好的模型转换成静态图，然后就可以使用Paddle Inference静态图进行推理，向量引擎则是存放标签的向量的形式，方便快速检索，另外，Paddle Inference可以进一步对模型服务化，即使用Paddle Serving进行服务化，这样可以通过HTTP或者RPC的方式进行调用。Paddle Serving的服务化形式有Pipeline和C++两种形式，Pipeline灵活一点，方便进行修改，C++部署更麻烦一点，但C++的部署形式效率更高。

已添加说明

lugimzzz · 2022-11-10T08:30:36Z

applications/text_classification/multi_label/retrieval_based/README.md

+
+<a name="分类流程"></a>
+
+## 9. 分类流程


这个分类流程是属于服务化部署还是？

把整个分类串联起来，可以理解是serving等服务部署以后，给用户一个调用整个流程的示例。

已添加以下的说明：
为了演示基于检索的文本分类流程，我们使用下面的python脚本来完成整个流程，该分类系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问，得到分类的结果。

这相当于上面服务端部署流程的整合吗

基于语义索引的文本分类需要用到向量引擎和serving模型抽取向量，模块稍微复杂一点，这里给一个调用示例，更清晰一点

lugimzzz · 2022-11-16T08:00:52Z

applications/text_classification/multi_label/retrieval_based/README.md

+sh scripts/export_to_serving.sh
+```
+
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：


没有看到C++方式的介绍

已经修改

lugimzzz · 2022-11-16T08:03:57Z

applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml

+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'


建议默认为devices: '0'

lugimzzz · 2022-11-16T08:11:23Z

applications/text_classification/multi_label/retrieval_based/utils/config.py

+import os
+from milvus import MetricType, IndexType
+
+MILVUS_HOST = '10.21.226.173'


MILVUS_HOST建议设成一个本地地址，最好为可配置参数

README.md文件里面已经添加说明。

修改 utils/config.py 的配置 ip 和端口，本项目使用的是8530端口，而 Milvus 默认的是19530，需要根据情况进行修改：

MILVUS_HOST='your milvus ip' MILVUS_PORT = 8530

lugimzzz · 2022-11-16T08:13:15Z

applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py

+import os
+import sys
+from tqdm import tqdm
+import numpy as np


导入顺序: 标准库>第三方库>本地库各import类型之间需要用空行隔开

已添加空行

lugimzzz · 2022-11-16T08:13:53Z

applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from milvus import *


导入顺序: 标准库>第三方库>本地库各import类型之间需要用空行隔开

已添加空行

lugimzzz

LGTM

Add Retrieval base multi label classification

d0d25e3

w5688414 requested review from wawltor and lugimzzz November 2, 2022 16:26

w5688414 self-assigned this Nov 2, 2022

w5688414 added 2 commits November 4, 2022 10:21

Update README& Add Macro fl metric

6410c67

Remove unused functions

d44cdde

lugimzzz reviewed Nov 10, 2022

View reviewed changes

Update retrieval based classification README.md

9a1c89a

lugimzzz reviewed Nov 16, 2022

View reviewed changes

w5688414 added 3 commits November 16, 2022 16:40

format the import args

76091e6

Update import args format

08ec906

format

40e9e8c

lugimzzz approved these changes Nov 16, 2022

View reviewed changes

w5688414 added 2 commits November 16, 2022 17:16

Merge branch 'develop' into rt5

e0abd07

Merge branch 'develop' into rt5

fbf6ee4

w5688414 merged commit 5af28bc into PaddlePaddle:develop Nov 16, 2022

w5688414 mentioned this pull request Nov 17, 2022

PaddleNLP 2.4.3 Release Note Candidate #3774

Closed

w5688414 mentioned this pull request Jan 12, 2023

PaddleNLP 2.5.0 Release Note Candidate #4439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Retrieval based multi label classification #3656

Add Retrieval based multi label classification #3656

w5688414 commented Nov 2, 2022

lugimzzz Nov 10, 2022

w5688414 Nov 16, 2022 •

edited

Loading

lugimzzz Nov 10, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 10, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 10, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz Nov 16, 2022

w5688414 Nov 16, 2022

lugimzzz left a comment


		train.txt(训练数据集文件)， dev.txt(开发数据集文件)，test.txt(可选，测试数据集文件)，文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据；开发集指用于评测模型表现的数据，可以根据模型在开发集上的精度调整训练参数和模型；测试集用于测试模型表现，没有测试集时可以使用开发集代替。

		- train.txt/test.txt 文件格式：


		Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：

		#### Pipeline方式

Add Retrieval based multi label classification #3656

Add Retrieval based multi label classification #3656

Conversation

w5688414 commented Nov 2, 2022

PR types

PR changes

Description

Choose a reason for hiding this comment

w5688414 Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz left a comment

Choose a reason for hiding this comment

w5688414 Nov 16, 2022 •

edited

Loading