Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retrieval based multi label classification #3656

Merged
merged 9 commits into from
Nov 16, 2022

Conversation

w5688414
Copy link
Contributor

@w5688414 w5688414 commented Nov 2, 2022

PR types

  • New features

PR changes

  • APIs

Description

  • Add Retrieval base multi label classification

@w5688414 w5688414 requested review from wawltor and lugimzzz November 2, 2022 16:26
@w5688414 w5688414 self-assigned this Nov 2, 2022

train.txt(训练数据集文件), dev.txt(开发数据集文件),test.txt(可选,测试数据集文件),文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据;开发集指用于评测模型表现的数据,可以根据模型在开发集上的精度调整训练参数和模型;测试集用于测试模型表现,没有测试集时可以使用开发集代替。

- train.txt/test.txt 文件格式:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个格式看上去只有单个标签?

Copy link
Contributor Author

@w5688414 w5688414 Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于文本有多个标签,需要把文本拆分成多个单标签的形式,即多行文本标签对。已添加说明

pandas==0.25.1
paddlenlp>=2.3.4
paddlepaddle-gpu>=2.3.0
hnswlib>=0.5.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp应该已经默认依赖numpy visualdl,paddlepaddle-gpu安装的方式比较多样,感觉直接放在requirements.txt 用pip方式安装是不是不太合适

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除


Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法:

#### Pipeline方式
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

部署这块感觉文档不够清晰,可能需要解释清楚Paddle Inference预测、Paddle Serving部署、Pipeline方式之间逻辑关系。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模型部署分为:动转静导出,向量引擎,Paddle Inference推理, Paddle Serving服务化这几个部分。为了提升预测速度,通常需要把训练好的模型转换成静态图,然后就可以使用Paddle Inference静态图进行推理,向量引擎则是存放标签的向量的形式,方便快速检索,另外,Paddle Inference可以进一步对模型服务化,即使用Paddle Serving进行服务化,这样可以通过HTTP或者RPC的方式进行调用。Paddle Serving的服务化形式有Pipeline和C++两种形式,Pipeline灵活一点,方便进行修改,C++部署更麻烦一点,但C++的部署形式效率更高。

已添加说明


<a name="分类流程"></a>

## 9. 分类流程
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分类流程是属于服务化部署还是?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把整个分类串联起来,可以理解是serving等服务部署以后,给用户一个调用整个流程的示例。

已添加以下的说明:
为了演示基于检索的文本分类流程,我们使用下面的python脚本来完成整个流程,该分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问,得到分类的结果。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这相当于上面服务端部署流程的整合吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于语义索引的文本分类需要用到向量引擎和serving模型抽取向量,模块稍微复杂一点,这里给一个调用示例,更清晰一点

sh scripts/export_to_serving.sh
```

Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有看到C++方式的介绍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 1
# 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: '2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议默认为devices: '0'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

import os
from milvus import MetricType, IndexType

MILVUS_HOST = '10.21.226.173'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MILVUS_HOST建议设成一个本地地址,最好为可配置参数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README.md文件里面已经添加说明。

修改 utils/config.py 的配置 ip 和端口,本项目使用的是8530端口,而 Milvus 默认的是19530,需要根据情况进行修改:

MILVUS_HOST='your milvus ip'
MILVUS_PORT = 8530

import os
import sys
from tqdm import tqdm
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

导入顺序: 标准库>第三方库>本地库 各import类型之间需要用空行隔开

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加空行

# See the License for the specific language governing permissions and
# limitations under the License.

from milvus import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

导入顺序: 标准库>第三方库>本地库 各import类型之间需要用空行隔开

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加空行

Copy link
Contributor

@lugimzzz lugimzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@w5688414 w5688414 merged commit 5af28bc into PaddlePaddle:develop Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants