-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Retrieval based multi label classification #3656
Conversation
|
||
train.txt(训练数据集文件), dev.txt(开发数据集文件),test.txt(可选,测试数据集文件),文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据;开发集指用于评测模型表现的数据,可以根据模型在开发集上的精度调整训练参数和模型;测试集用于测试模型表现,没有测试集时可以使用开发集代替。 | ||
|
||
- train.txt/test.txt 文件格式: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个格式看上去只有单个标签?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由于文本有多个标签,需要把文本拆分成多个单标签的形式,即多行文本标签对。已添加说明
pandas==0.25.1 | ||
paddlenlp>=2.3.4 | ||
paddlepaddle-gpu>=2.3.0 | ||
hnswlib>=0.5.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp应该已经默认依赖numpy visualdl,paddlepaddle-gpu安装的方式比较多样,感觉直接放在requirements.txt 用pip方式安装是不是不太合适
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
|
||
Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: | ||
|
||
#### Pipeline方式 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
部署这块感觉文档不够清晰,可能需要解释清楚Paddle Inference预测、Paddle Serving部署、Pipeline方式之间逻辑关系。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模型部署分为:动转静导出,向量引擎,Paddle Inference推理, Paddle Serving服务化这几个部分。为了提升预测速度,通常需要把训练好的模型转换成静态图,然后就可以使用Paddle Inference静态图进行推理,向量引擎则是存放标签的向量的形式,方便快速检索,另外,Paddle Inference可以进一步对模型服务化,即使用Paddle Serving进行服务化,这样可以通过HTTP或者RPC的方式进行调用。Paddle Serving的服务化形式有Pipeline和C++两种形式,Pipeline灵活一点,方便进行修改,C++部署更麻烦一点,但C++的部署形式效率更高。
已添加说明
|
||
<a name="分类流程"></a> | ||
|
||
## 9. 分类流程 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个分类流程是属于服务化部署还是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把整个分类串联起来,可以理解是serving等服务部署以后,给用户一个调用整个流程的示例。
已添加以下的说明:
为了演示基于检索的文本分类流程,我们使用下面的python脚本来完成整个流程,该分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问,得到分类的结果。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这相当于上面服务端部署流程的整合吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
基于语义索引的文本分类需要用到向量引擎和serving模型抽取向量,模块稍微复杂一点,这里给一个调用示例,更清晰一点
sh scripts/export_to_serving.sh | ||
``` | ||
|
||
Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有看到C++方式的介绍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu | ||
device_type: 1 | ||
# 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 | ||
devices: '2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议默认为devices: '0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
import os | ||
from milvus import MetricType, IndexType | ||
|
||
MILVUS_HOST = '10.21.226.173' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MILVUS_HOST建议设成一个本地地址,最好为可配置参数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md文件里面已经添加说明。
修改 utils/config.py 的配置 ip 和端口,本项目使用的是8530端口,而 Milvus 默认的是19530,需要根据情况进行修改:
MILVUS_HOST='your milvus ip'
MILVUS_PORT = 8530
import os | ||
import sys | ||
from tqdm import tqdm | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
导入顺序: 标准库>第三方库>本地库 各import类型之间需要用空行隔开
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加空行
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from milvus import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
导入顺序: 标准库>第三方库>本地库 各import类型之间需要用空行隔开
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加空行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
PR changes
Description