-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the README description of Pipelines & Neural Search #3353
Merged
Merged
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
096677b
Fix the README description
w5688414 ac0fd90
Update Pipelines README.md
w5688414 b868be6
Merge branch 'develop' into pip25
w5688414 45f67dc
Update Docker README.md
w5688414 a6ad85a
Merge branch 'pip25' of https://github.com/w5688414/PaddleNLP into pip25
w5688414 c5f17d7
Add more details for ranking model
w5688414 0e0d10a
Merge branch 'develop' into pip25
w5688414 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,7 +51,7 @@ | |
<img src="https://user-images.githubusercontent.com/12107462/191469309-42a54a67-a3a3-4e43-b6b1-b12be81ddf3d.png" width="800px"> | ||
</div> | ||
|
||
以上是nerual_search的系统流程图,其中左侧为召回环节,核心是语义向量抽取模块;右侧是排序环节,核心是排序模型。召回环节需要用户通过自己的语料构建向量索引库,用户发起query了之后,就可以检索出最近的向量,然后找出向量对应的文本;排序环节主要是对召回的文本进行重新排序。下面我们分别介绍召回中的语义向量抽取模块,以及排序模型。 | ||
以上是nerual_search的系统流程图,其中左侧为召回环节,核心是语义向量抽取模块;右侧是排序环节,核心是排序模型。召回环节需要用户通过自己的语料构建向量索引库,用户发起query了之后,就可以检索出相似度最近的向量,然后找出该向量对应的文本;排序环节主要是对召回的文本进行重新排序。下面我们分别介绍召回中的语义向量抽取模块,以及排序模型。 | ||
|
||
|
||
#### 2.2.2 召回模块 | ||
|
@@ -74,7 +74,7 @@ | |
|
||
#### 2.2.3 排序模块 | ||
|
||
排序模块基于前沿的预训练模型 ERNIE-Gram,训练 Pair-wise 语义匹配模型。召回模型负责从海量(千万级)候选文本中快速(毫秒级)筛选出与 Query 相关性较高的 TopK Doc,排序模型会在召回模型筛选出的 TopK Doc 结果基础之上针对每一个 (Query, Doc) Pair 对进行两两匹配计算相关性,排序效果更精准。 | ||
排序模块有2种选择,第一种基于前沿的预训练模型 ERNIE-Gram,训练 Pair-wise 语义匹配模型;第二种是基于RocketQA模型训练的Cross Encoder模型,用户可以根据自身数据情况进行实验选择最优的方式。召回模型负责从海量(千万级)候选文本中快速(毫秒级)筛选出与 Query 相关性较高的 TopK Doc,排序模型会在召回模型筛选出的 TopK Doc 结果基础之上针对每一个 (Query, Doc) Pair 对进行两两匹配计算相关性,排序效果更精准。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里是不是说明,如何选择两种方式? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 已经修改 |
||
|
||
## 3. 文献检索实践 | ||
|
||
|
@@ -86,7 +86,7 @@ | |
|
||
首先是利用 ERNIE模型进行 Domain-adaptive Pretraining,在得到的预训练模型基础上,进行无监督的 SimCSE 训练,最后利用 In-batch Negatives 方法进行微调,得到最终的语义索引模型,把建库的文本放入模型中抽取特征向量,然后把抽取后的向量放到语义索引引擎 milvus 中,利用 milvus 就可以很方便得实现召回了。 | ||
|
||
**排序**:使用 ERNIE-Gram 的单塔结构对召回后的数据精排序。 | ||
**排序**:使用 ERNIE-Gram 的单塔结构/RocketQA的Cross Encoder对召回后的数据精排序。 | ||
|
||
#### 3.1.2 评估指标 | ||
|
||
|
@@ -110,14 +110,14 @@ | |
|
||
(3)使用文献的的query, title, keywords,构造带正标签的数据集,不包含负标签样本,基于 In-batch Negatives 策略进行训练; | ||
|
||
(4)在排序阶段,使用点击(作为正样本)和展现未点击(作为负样本)数据构造排序阶段的训练集,使用ERNIE-Gram模型进行精排训练。 | ||
(4)在排序阶段,使用点击(作为正样本)和展现未点击(作为负样本)数据构造排序阶段的训练集,进行精排训练。 | ||
|
||
| 阶段 |模型 | 训练集 | 评估集(用于评估模型效果) | 召回库 |测试集 | | ||
| ------------ | ------------ |------------ | ------------ | ------------ | ------------ | | ||
| 召回 | Domain-adaptive Pretraining | 2kw | - | - | - | | ||
| 召回 | 无监督预训练 - SimCSE | 798w | 20000 | 300000| 1000 | | ||
| 召回 | 有监督训练 - In-batch Negatives | 3998 | 20000 |300000 | 1000 | | ||
| 排序 | 有监督训练 - ERNIE-Gram单塔 Pairwise| 1973538 | 57811 | - | 1000 | | ||
| 排序 | 有监督训练 - ERNIE-Gram单塔 Pairwise/RocketQA Cross Encoder| 1973538 | 57811 | - | 1000 | | ||
|
||
我们将除 Domain-adaptive Pretraining 之外的其他数据集全部开源,下载地址: | ||
|
||
|
@@ -187,7 +187,7 @@ query2 \t 用户点击的title2 | |
...... | ||
``` | ||
|
||
2. 对于排序模型的训练,排序模型目前提供了2种,第一种是Pairwise训练的方式,第二种是RocketQA的排序模型,对于第一种排序模型,需要准备训练集`train_pairwise.csv`,验证集`dev_pairwise.csv`两个文件, | ||
2. 对于排序模型的训练,排序模型目前提供了2种,第一种是Pairwise训练的方式,第二种是RocketQA的排序模型,对于第一种排序模型,需要准备训练集`train_pairwise.csv`,验证集`dev_pairwise.csv`两个文件,除此之外还可以准备测试集文件`test.csv`或者`test_pairwise.csv`。 | ||
|
||
训练数据集`train_pairwise.csv`的格式如下: | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
相似度最近 -> 相似度最高?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改