Skip to content

Commit

Permalink
Merge pull request PaddlePaddle#674 from linjieccc/textcnn
Browse files Browse the repository at this point in the history
[TextCNN] Add TextCNN example for sentiment analysis
  • Loading branch information
linjieccc authored Jul 6, 2021
2 parents 58db4d0 + e50ae68 commit 7be95b5
Show file tree
Hide file tree
Showing 7 changed files with 793 additions and 0 deletions.
192 changes: 192 additions & 0 deletions examples/sentiment_analysis/textcnn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# 使用TextCNN模型完成中文对话情绪识别任务

情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。情感分析其中的一个任务就是对话情绪识别,针对智能对话中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极(positive)、消极(negative)和中性(neutral)。

本示例展示了如何用TextCNN预训练模型在机器人聊天数据集上进行Finetune完成中文对话情绪识别任务。

## 快速开始

### 代码结构说明

以下是本项目主要代码结构及说明:

```text
textcnn/
├── deploy # 部署
│   └── python
│   └── predict.py # python预测部署示例
├── data.py # 数据处理脚本
├── export_model.py # 动态图参数导出静态图参数脚本
├── model.py # 模型组网脚本
├── predict.py # 模型预测脚本
├── README.md # 文档说明
└── train.py # 对话情绪识别任务训练脚本
```

### 数据准备

这里我们提供一份已标注的机器人聊天数据集,包括训练集(train.tsv),开发集(dev.tsv)和测试集(test.tsv)。
完整数据集可以通过以下命令下载并解压:

```shell
wget https://paddlenlp.bj.bcebos.com/datasets/RobotChat.tar.gz
tar xvf RobotChat.tar.gz
```

### 词表下载

在模型训练之前,需要先下载词汇表文件word_dict.txt,用于构造词-id映射关系。

```shell
wget https://paddlenlp.bj.bcebos.com/robot_chat_word_dict.txt
```

**NOTE:** 词表的选择和实际应用数据相关,需根据实际数据选择词表。

### 预训练模型下载

这里我们提供了一个百度基于海量数据训练好的TextCNN模型,用户通过以下方式下载预训练模型。

```shell
wget https://paddlenlp.bj.bcebos.com/models/textcnn.pdparams
```

### 模型训练

在下载好词表和预训练模型后就可以在机器人聊天数据集上进行finetune,通过运行以下命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证,这里通过`--init_from_ckpt=./textcnn.pdparams`指定TextCNN预训练模型。

CPU 启动:

```shell
python train.py --vocab_path=./robot_chat_word_dict.txt \
--init_from_ckpt=./textcnn.pdparams \
--device=cpu \
--lr=5e-5 \
--batch_size=64 \
--epochs=10 \
--save_dir=./checkpoints \
--data_path=./RobotChat
```

GPU 启动:

```shell
unset CUDA_VISIBLE_DEVICES
python -m paddle.distributed.launch --gpus "0" train.py \
--vocab_path=./robot_chat_word_dict.txt \
--init_from_ckpt=./textcnn.pdparams \
--device=gpu \
--lr=5e-5 \
--batch_size=64 \
--epochs=10 \
--save_dir=./checkpoints \
--data_path=./RobotChat
```

XPU启动:

```shell
python train.py --vocab_path=./robot_chat_word_dict.txt \
--init_from_ckpt=./textcnn.pdparams \
--device=xpu \
--lr=5e-5 \
--batch_size=64 \
--epochs=10 \
--save_dir=./checkpoints \
--data_path=./RobotChat
```

以上参数表示:

* `vocab_path`: 词汇表文件路径。
* `init_from_ckpt`: 恢复模型训练的断点路径。
* `device`: 选用什么设备进行训练,可选cpu、gpu或xpu。如使用gpu训练则参数gpus指定GPU卡号。
* `lr`: 学习率, 默认为5e-5。
* `batch_size`: 运行一个batch大小,默认为64。
* `epochs`: 训练轮次,默认为10。
* `save_dir`: 训练保存模型的文件路径。
* `data_path`: 数据集文件路径。


程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
如:
```text
checkpoints/
├── 0.pdopt
├── 0.pdparams
├── 1.pdopt
├── 1.pdparams
├── ...
└── final.pdparams
```

**NOTE:**

* 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可,程序会自动加载模型参数`checkpoints/0.pdparams`,也会自动加载优化器状态`checkpoints/0.pdopt`
* 使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
运行方式:

```shell
python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params
```

其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。

导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。运行方式:

```shell
python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams
```

### 模型预测

启动预测:

CPU启动:

```shell
python predict.py --vocab_path=./robot_chat_word_dict.txt \
--device=cpu \
--params_path=./checkpoints/final.pdparams
```

GPU启动:

```shell
export CUDA_VISIBLE_DEVICES=0
python predict.py --vocab_path=./robot_chat_word_dict.txt \
--device=gpu \
--params_path=./checkpoints/final.pdparams
```

XPU启动:

```shell
python predict.py --vocab_path=./robot_chat_word_dict.txt \
--device=xpu \
--params_path=./checkpoints/final.pdparams
```

待预测数据如以下示例:

```text
你再骂我我真的不跟你聊了
你看看我附近有什么好吃的
我喜欢画画也喜欢唱歌
```

经过`preprocess_prediction_data`函数处理后,调用`predict`函数即可输出预测结果。


```text
Data: 你再骂我我真的不跟你聊了 Label: negative
Data: 你看看我附近有什么好吃的 Label: neutral
Data: 我喜欢画画也喜欢唱歌 Label: positive
```

## Reference

TextCNN参考论文:

- [EMNLP2014-Convolutional Neural Networks for Sentence Classification](https://aclanthology.org/D14-1181.pdf)
98 changes: 98 additions & 0 deletions examples/sentiment_analysis/textcnn/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import numpy as np
import paddle
from paddlenlp.datasets import load_dataset


def create_dataloader(dataset,
mode='train',
batch_size=1,
batchify_fn=None,
trans_fn=None):
"""
Create dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`).
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.map(trans_fn)

shuffle = True if mode == 'train' else False
if mode == "train":
sampler = paddle.io.DistributedBatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
else:
sampler = paddle.io.BatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
dataset, batch_sampler=sampler, collate_fn=batchify_fn)
return dataloader

def preprocess_prediction_data(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3):
"""
It process the prediction data as the format used as training.
Args:
data (obj:`list[str]`): The prediction data whose each element is a tokenized text.
tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
max_ngram_filter_size (obj:`int`, optional, defaults to 3) Max n-gram size in TextCNN model.
Users should refer to the ngram_filter_sizes setting in TextCNN, if ngram_filter_sizes=(1, 2, 3)
then max_ngram_filter_size=3
Returns:
examples (obj:`list`): The processed data whose each element
is a `list` object, which contains
- word_ids(obj:`list[int]`): The list of word ids.
"""
examples = []
for text in data:
ids = tokenizer.encode(text)
seq_len = len(ids)
# Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model
if seq_len < max_ngram_filter_size:
ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len))
examples.append(ids)
return examples

def convert_example(example, tokenizer):
"""convert_example"""
input_ids = tokenizer.encode(example["text"])
input_ids = np.array(input_ids, dtype='int64')

label = np.array(example["label"], dtype="int64")
return input_ids, label

def read_custom_data(filename):
"""Reads data."""
with open(filename, 'r', encoding='utf-8') as f:
# Skip head
next(f)
for line in f:
data = line.strip().split("\t")
label, text = data
yield {"text": text, "label": label}
Loading

0 comments on commit 7be95b5

Please sign in to comment.