Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support other ext tasks except aso task and fix sentiment analysis based on SKEP #4357

Merged
merged 152 commits into from
Jan 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
37d27f9
initial commit
1649759610 Jun 24, 2022
d65208e
refine readme
1649759610 Jun 24, 2022
ac4d644
refine codestyle
1649759610 Jun 24, 2022
3f433b9
refine readme
1649759610 Jun 24, 2022
d3f6ada
refine readme
1649759610 Jun 24, 2022
54ed34b
fix model saving bug
1649759610 Jun 26, 2022
63b0a76
Merge branch 'develop' into develop
Jul 6, 2022
4669194
initial commit
1649759610 Jul 11, 2022
f6f93e1
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jul 11, 2022
fb3ade1
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Jul 11, 2022
a83a902
initial commit
1649759610 Jul 11, 2022
7bd988a
initial commit
1649759610 Jul 12, 2022
700810a
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jul 12, 2022
68e025a
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jul 21, 2022
02a997b
use common metric instead of eval_metrics.py and remove unuseful code
1649759610 Jul 28, 2022
1500e5f
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Jul 28, 2022
faaf5f5
Merge branch 'develop' into develop
1649759610 Aug 1, 2022
6a512a0
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Aug 2, 2022
a99fc68
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Aug 12, 2022
4b5fa30
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Nov 1, 2022
b837b67
mv stage project to ASO_analysis
1649759610 Nov 7, 2022
f415740
add unified sentiment analysis
1649759610 Nov 7, 2022
41b020d
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Nov 7, 2022
252860d
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Nov 7, 2022
aed2d64
refine readme
1649759610 Nov 7, 2022
8899a12
refine readme
1649759610 Nov 7, 2022
425a273
refnie readme
1649759610 Nov 7, 2022
acd9add
add unified sentiment analysis
1649759610 Nov 7, 2022
4796016
refine readme
1649759610 Nov 7, 2022
e857c6a
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Nov 21, 2022
88649f4
initial commit
1649759610 Nov 25, 2022
4141916
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Nov 25, 2022
12628a7
initial commit
1649759610 Nov 25, 2022
70ba157
refine readme
1649759610 Nov 28, 2022
531428e
add taskflow for sentiment analysis with UIE
1649759610 Nov 28, 2022
8e8b853
refine Readme
1649759610 Nov 29, 2022
410f0e4
refine readme.md
1649759610 Nov 30, 2022
64edd33
support sentiment analysis (UIE) with inputing by file format
1649759610 Nov 30, 2022
1f9637c
refine readme
1649759610 Nov 30, 2022
93509cd
delete predict scripts
1649759610 Nov 30, 2022
6473b10
refine readme
1649759610 Nov 30, 2022
09d0f12
delete unuseful files
1649759610 Nov 30, 2022
e546952
add pipeline for sentiment_analysis
1649759610 Dec 6, 2022
447fc14
merging with the newest code
1649759610 Dec 6, 2022
83252b0
merging code with the newest code
1649759610 Dec 6, 2022
f57df8c
fix to convert data without synonyms
1649759610 Dec 9, 2022
249a8a9
add senta pipeline
1649759610 Dec 9, 2022
cd3f4e7
refine readme
1649759610 Dec 9, 2022
a1de96d
drop functions: inputting file and saving results
1649759610 Dec 9, 2022
a5f83b1
add UIE-seta-[base, medium, mini, micro, nano]
1649759610 Dec 9, 2022
1da02d0
modify .gitignore to trace deploy code
1649759610 Dec 9, 2022
c4c135a
add deploy with SimpleServer
1649759610 Dec 9, 2022
363963b
add debug mode
1649759610 Dec 12, 2022
a0c8608
fix debug mode
1649759610 Dec 12, 2022
5afa387
update the loading method of UIE
1649759610 Dec 12, 2022
7109f24
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Dec 12, 2022
bb7da20
refine readme
1649759610 Dec 12, 2022
1af076f
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Dec 12, 2022
8c104f2
fix bug caused by version updating
1649759610 Dec 12, 2022
332a1e4
fix hard coding for model name.
1649759610 Dec 12, 2022
5e550bd
refine codestyle
1649759610 Dec 13, 2022
98c7084
modify readme according the way of 'step by step'
1649759610 Dec 13, 2022
7ec344f
refine codestyl
1649759610 Dec 13, 2022
6af902d
change saving txt to json files
1649759610 Dec 13, 2022
a7f57ea
download font automatically when not input font_path
1649759610 Dec 13, 2022
7c331d2
change readme in the way 'step by step'
1649759610 Dec 13, 2022
d78f73b
add model prediction by batch
1649759610 Dec 13, 2022
e6f0359
add uie-senta-x to support_schema_list
1649759610 Dec 13, 2022
e17d624
update sentiment analysis in taskflow
1649759610 Dec 13, 2022
485044c
add prediction with saved offline model
1649759610 Dec 14, 2022
267509c
change the exception exposure way
1649759610 Dec 14, 2022
30e3044
add description for visual schema
1649759610 Dec 14, 2022
7c88879
delete comments
1649759610 Dec 14, 2022
3e8c777
remove comments
1649759610 Dec 14, 2022
f3429b7
remove unused code and comments
1649759610 Dec 14, 2022
6c2a712
convert uie-senta-x model params to fit ernie/uie
1649759610 Dec 14, 2022
6f15864
refine readme for sentiment analysis
1649759610 Dec 16, 2022
55276fa
add running time
1649759610 Dec 16, 2022
6c4fd93
refine readme for senta pipeline
1649759610 Dec 16, 2022
d0729c0
change uie-base to uie-senta-base
1649759610 Dec 16, 2022
46fc8fe
load uie-senta-x with auto module
1649759610 Dec 16, 2022
ee5938b
add deploy with SimpleServer
1649759610 Dec 16, 2022
46c417c
refine codestyle
1649759610 Dec 16, 2022
96bd92f
refine readme
1649759610 Dec 16, 2022
fbf6567
add uie-senta-x to support_schema_list
1649759610 Dec 16, 2022
ac7c5ed
fix hard coding for mdoel anme
1649759610 Dec 16, 2022
d21452b
refine codestyle
1649759610 Dec 16, 2022
4128753
refine codestyl
1649759610 Dec 16, 2022
4d072d7
refine codestyle
1649759610 Dec 16, 2022
661e944
refine codestyle
1649759610 Dec 16, 2022
77c090f
refine codestyle
1649759610 Dec 16, 2022
128d154
refine codestyle
1649759610 Dec 16, 2022
8c59f76
refine codestyle
1649759610 Dec 16, 2022
8651ae2
refine codestyle
1649759610 Dec 16, 2022
e14ff4a
refine codestyle
1649759610 Dec 16, 2022
aae9da1
fix senta response
1649759610 Dec 16, 2022
bb441ca
add uie_senta_x
1649759610 Dec 16, 2022
d99c204
refine codestyle
1649759610 Dec 19, 2022
5650404
remove lambda expressions
1649759610 Dec 19, 2022
1614c6c
add link of senta pipeline
1649759610 Dec 19, 2022
5f89ceb
refine codestyle
1649759610 Dec 19, 2022
9a9be2c
remove local path
1649759610 Dec 19, 2022
87782d3
Merge branch 'develop' into develop
1649759610 Dec 20, 2022
288aaab
fix typos
1649759610 Dec 20, 2022
92f1278
refine readme
1649759610 Dec 20, 2022
0190ca1
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Dec 20, 2022
485f3cf
refine readme
1649759610 Dec 22, 2022
9400392
Merge branch 'automodel' into develop
1649759610 Dec 22, 2022
432e5e2
load uie-senta-x with automodel
1649759610 Dec 22, 2022
422ac9a
remove commented code
1649759610 Dec 22, 2022
98779c9
restore auto
1649759610 Dec 22, 2022
99fe1cb
Merge branch 'develop' into develop
1649759610 Dec 22, 2022
22394fa
add link of hotel dataset to readme.
1649759610 Dec 26, 2022
48b20c5
add link for downloading test_hotel.txt
1649759610 Dec 26, 2022
93073c5
fix url problem for server and client
1649759610 Dec 26, 2022
9d1243f
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Dec 26, 2022
075a4ae
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Dec 26, 2022
5783f19
refine readme
1649759610 Dec 26, 2022
e61ad7b
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Dec 26, 2022
19265f3
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Dec 28, 2022
ff664d2
fix for senta_examples.py
1649759610 Dec 28, 2022
69932d3
update visualization function
1649759610 Dec 29, 2022
f922241
update visualization function
1649759610 Dec 29, 2022
0b5c224
refine readme and update visualization description
1649759610 Dec 29, 2022
7d618a8
update visualization function
1649759610 Dec 29, 2022
a62d8c5
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Dec 29, 2022
ce9825a
refine readme and update visualization function
1649759610 Dec 29, 2022
a3ec63a
change logger in PaddleNLP to log information
1649759610 Dec 29, 2022
c344b8c
fix running time for skep and uie
1649759610 Dec 30, 2022
df1ad55
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Dec 30, 2022
7f00796
fix bug to solve tokenizer updating problem
1649759610 Dec 30, 2022
1ea8757
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Dec 30, 2022
3123147
refine label-studio readme
1649759610 Jan 3, 2023
88060da
refine label-studio readme
1649759610 Jan 3, 2023
ab18fc8
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jan 3, 2023
fc52a00
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Jan 3, 2023
ac89d49
refine label-studio readme
1649759610 Jan 3, 2023
579631b
optimize example construction for a, o, as, ao extraction task
1649759610 Jan 5, 2023
b4364cf
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jan 5, 2023
30ea32f
Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…
1649759610 Jan 5, 2023
4bfab1c
add the labeling method for ext task: a, as, ao and so on.
1649759610 Jan 5, 2023
091e216
add note for visual_analysis.py
1649759610 Jan 5, 2023
0a5b991
change link for downloading data and refine log output
1649759610 Jan 6, 2023
070bfda
refine log output
1649759610 Jan 6, 2023
c34bedb
refine readme
1649759610 Jan 6, 2023
e5a94c2
expose options interface
1649759610 Jan 6, 2023
e624418
refine readme
1649759610 Jan 6, 2023
a021936
modify typos
1649759610 Jan 6, 2023
b754bb1
expose options for customing sentiment analysis
1649759610 Jan 6, 2023
561f607
README.md
1649759610 Jan 9, 2023
7a2b5ca
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jan 10, 2023
1792554
Merge branch 'PaddlePaddle:develop' into develop
1649759610 Jan 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion applications/sentiment_analysis/ASO_analysis/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.

import argparse
import re

import paddle
from utils import decoding, load_dict
Expand Down Expand Up @@ -54,9 +55,10 @@ def predict(args, ext_model, cls_model, tokenizer, ext_id2label, cls_id2label):

while True:
input_text = input("input text: \n")
input_text = re.sub(" +", "", input_text.strip())
if not input_text:
continue
if input_text == "quit":
if input_text == "quit" or input_text == "exit":
break

input_text = input_text.strip().replace(" ", "")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import copy
import json
import os
import re
from collections import defaultdict
from functools import partial

Expand Down Expand Up @@ -146,6 +147,11 @@ def convert_example_to_feature_cls(example, tokenizer, label2id, max_seq_len=512
return encoded_inputs


def remove_blanks(example):
example["text"] = re.sub(" +", "", example["text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里比较好奇,为什么要改动原文的输入?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去除原文中的空格,当前的tokenizer 在encode时会忽略空格,导致input_ids长度!=原始文本的长度,会有匹配上的一些问题。

return example


class Predictor(object):
def __init__(self, args):
self.args = args
Expand Down Expand Up @@ -202,6 +208,7 @@ def create_predictor(self, model_path):

def predict_ext(self, args):
datasets = load_dataset("text", data_files={"test": args.test_path})
datasets["test"] = datasets["test"].map(remove_blanks)
trans_func = partial(
convert_example_to_feature_ext,
tokenizer=self.tokenizer,
Expand Down
7 changes: 7 additions & 0 deletions applications/sentiment_analysis/ASO_analysis/predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import argparse
import copy
import json
import re
from collections import defaultdict
from functools import partial

Expand Down Expand Up @@ -46,11 +47,17 @@ def concate_aspect_and_opinion(text, aspect, opinions):
return aspect_text


def remove_blanks(example):
example["text"] = re.sub(" +", "", example["text"])
return example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上


def predict_ext(args):
# load dict and dataset
model_name = "skep_ernie_1.0_large_ch"
ext_label2id, ext_id2label = load_dict(args.ext_label_path)
datasets = load_dataset("text", data_files={"test": args.test_path})
datasets["test"] = datasets["test"].map(remove_blanks)

tokenizer = SkepTokenizer.from_pretrained(model_name)
trans_func = partial(
Expand Down
4 changes: 2 additions & 2 deletions applications/sentiment_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,6 @@ PaddleNLP情感分析应用立足真实企业用户对情感分析方面的需

## **3. 快速开始**

- 👉 [基于UIE的情感分析方案](./unified_sentiment_extraction/README)
- 👉 [基于UIE的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction)

- 👉 [基于SKEP的情感分析方案](./ASO_analysis/README)
- 👉 [基于SKEP的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/ASO_analysis)
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ python3 -m pip install wordcloud==1.8.2.2
<a name="4.2.1"></a>

#### **4.2.1 数据描述**
输入数据如下方式进行组织,每行表示一个文本评论。可以点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/test_hotel.txt)下载酒店场景的测试数据进行分析。
输入数据如下方式进行组织,每行表示一个文本评论。可以点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/test_hotel.tar.gz)下载酒店场景的测试数据进行分析。

```
非常好的酒店 不枉我们爬了近一个小时的山,另外 大厨手艺非常棒 竹筒饭 竹筒鸡推荐入住的客人必须要点,
Expand Down Expand Up @@ -345,7 +345,6 @@ python batch_predict.py \
- ``model``: 进行情感分析的模型名称,可以在这些模型中进行选择:['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano']。
- ``load_from_dir``: 指定需要加载的离线模型目录,比如训练后保存的模型,如果不进行指定,则默认根据 `model` 指定的模型名称自动下载相应模型。
- ``schema``: 基于UIE模型进行信息抽取的Schema描述。
- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。
- ``batch_size``: 预测过程中的批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。
- ``max_seq_len``: 模型支持处理的最大序列长度,默认为512。
- ``aspects``: 预先给定的属性,如果设置,模型将只针对这些属性进行情感分析,比如分析这些属性的观点词。
Expand All @@ -362,21 +361,22 @@ python batch_predict.py \

**4.2.3.1 一键生成情感分析结果**

基于以上生成的情感分析结果,可以使用`visual_analysis.py`脚本对情感分析结果进行可视化,最终可视化结果将会被保存在 `save_dir` 指定的目录下,示例如下:
基于以上生成的情感分析结果,可以使用`visual_analysis.py`脚本对情感分析结果进行可视化,最终可视化结果将会被保存在 `save_dir` 指定的目录下。 使用时需要指定情感分析可视化的结果的任务类型,若是语句级的情感分类,则将task_type指定为``cls``,若是属性级的情感分析,则将task_type指定为``ext``,示例如下:

```
python visual_analysis.py \
--file_path "./outputs/test_hotel.json" \
--save_dir "./outputs/images"
--save_dir "./outputs/images" \
--task_type "ext"
```

可配置参数说明:
- ``file_path``: 指定情感分析结果的保存路径。
- ``save_dir``: 指定图片的保存目录。
- ``task_type``: 指定任务类型,语句级情感分类请指定为``cls``,属性级情感分析请指定为``ext``,默认为``ext``。
- ``font_path``: 指定字体文件的路径,用以在生成的wordcloud图片中辅助显示中文,如果为空,则会自动下载黑体字,用以展示中文字体。
- ``aspect_prompt``: 属性的Prompt文本,默认为`评价维度`。
- ``opinion_prompt``: 观点词的Prompt文本,默认为`观点词`。
- ``sentiment_prompt``: 情感分类的Prompt文本,当对属性进行情感分类时,应设置为`情感倾向[正向,负向,未提及]`, 当进行语句级情感分类时,应该设置为`情感倾向[正向,负向]`。

**备注**:在`visual_analysis.py`脚本启动时,默认会删除当前已经存在的`save_dir`目录以及其中文件,然后在该目录下重新生成相应的可视化图片。

下图展示了对酒店场景数据分析后的部分图片:

Expand Down Expand Up @@ -495,64 +495,60 @@ vs.plot_opinion_with_aspect(aspect, sr.aspect_opinion, save_path, image_type="hi
<img src=https://user-images.githubusercontent.com/35913314/203001847-8e41709b-0f5a-4673-8aca-5c4fb7705d4a.png />
</div>

为方便用户使用,本项目提供了300+条酒店场景的标注数据,可点击[label_studio.json](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/label_studio.json)进行下载,请注意该数据仅适合用于 `抽取` 类型的任务。
为方便用户使用,本项目提供了300+条酒店场景的标注数据,可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/label_studio.tar.gz)进行下载,请注意该数据仅适合用于 `抽取` 类型的任务。


<a name="5.1.1"></a>

#### **5.1.1 样本构建:语句级情感分类任务**

对于语句级情感分类任务,可以配置参数`prompt_prefix`和`options`,通过以下命令构造相关训练数据
对于语句级情感分类任务,默认支持2分类:``正向`` 和 ``负向``,可以通过如下命令构造相关训练数据

```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向"
--options "正向" "负向" \
--is_shuffle True \
--seed 1000
```

参数介绍:
- ``label_studio_file``: 从label studio导出的数据标注文件
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。
- ``label_studio_file``: 从label studio导出的语句级情感分类的数据标注文件
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务,其中前者需要设置为``ext``,后者需要设置为``cls``。由于此处为语句级情感分类任务,因此需要设置为``cls``
- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。
- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。
- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。这里需要配置为["正向", "负向"]。
- ``options``: 情感极性分类任务的选项设置。对于语句级情感分类任务,默认支持2分类:``正向`` 和 ``负向``;对于属性级情感分析任务,默认支持3分类:``正向``, ``负向``和 ``未提及``,其中``未提及``表示要分析的属性在原文本评论中未提及,因此无法分析情感极性。如果业务需要其他情感极性选项,可以通过``options``字段进行设置,需要注意的是,如果定制了``options``,参数``label_studio_file``指定的文件需要包含针对新设置的选项的标注数据。
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.

**备注**:参数``options``可以不进行手动指定,如果这么做,则采用默认的设置。针对语句级情感分类任务,其默认将被设置为:``"正向" "负向"``;对于属性级情感分析任务,默认将被设置为:``"正向" "负向" "未提及"``。

<a name="5.1.2"></a>

#### **5.1.2 样本构建:属性抽取相关任务**

针对抽取式的任务,比如属性抽取、观点抽取、属性分类任务等,可以使用如下命令将label-studio导出数据转换为模型训练数据
针对抽取式的任务,比如属性-观点抽取、属性-情感极性-观点词抽取、属性分类任务等,可以使用如下命令将label-studio导出数据转换为模型训练数据

```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向" "未提及" \
--separator "##" \
--negative_ratio 5 \
--is_shuffle True \
--seed 1000
```

参数介绍:
- ``label_studio_file``: 从label studio导出的数据标注文件。
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。
- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。
- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。
- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为["正向", "负向", "未提及"]。
- ``separator``: 实体类别/属性与分类标签的分隔符,该参数只对实体/属性分类任务有效。默认为"##"。
- ``negative_ratio``: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.
重点参数介绍:
- ``label_studio_file``: 从label studio导出的属性抽取相关的数据标注文件。
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务,其中前者需要设置为``ext``,后者需要设置为``cls``。由于此处为属性抽取相关任务,因此需要设置为``ext``。
- ``negative_ratio``表示对于一个样本,为每个子任务(属性级的观点抽取,属性级的情感分类)最多生成``negative_ratio``个负样本。如果额外提供了属性同义词标或隐性观点抽取词表,将结合两者信息生成更多的负样本,以增强属性聚合和隐性观点抽取能力。
其他参数解释同上,这里不再赘述。

<a name="5.1.3"></a>

Expand Down Expand Up @@ -585,14 +581,12 @@ python label_studio.py \
```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--synonym_file ./data/synonyms.json \
--synonym_file ./data/synonyms.txt \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向" "未提及" \
--separator "##" \
-- negative_ratio 5 \
--negative_ratio 5 \
--is_shuffle True \
--seed 1000
```
Expand Down Expand Up @@ -621,14 +615,12 @@ python label_studio.py \
```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--implicit_file ./data/implicit_opinions.json \
--implicit_file ./data/implicit_opinions.txt \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向" "未提及" \
--separator "##" \
-- negative_ratio 5 \
--negative_ratio 5 \
--is_shuffle True \
--seed 1000
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ def main(args):
"""
start_time = time.time()
# read file
logger.info("Trying to load dataset: {}".format(args.file_path))
if not os.path.exists(args.file_path):
raise ValueError("something with wrong for your file_path, it may be not exists.")
raise ValueError("something with wrong for your file_path, it may not exist.")
examples = load_txt(args.file_path)

# define Taskflow for sentiment analysis
Expand All @@ -55,6 +56,7 @@ def main(args):
)

# predict with Taskflow
logger.info("Start to perform sentiment analysis for your dataset, this may take some time.")
results = senta(examples)

# save results
Expand Down
Loading