Skip to content

Commit

Permalink
fix embedding and bigbird doc (PaddlePaddle#78)
Browse files Browse the repository at this point in the history
* add __all__, acknowledge and citation

* 1. remove useless code of Mask. 2. polish embedding, bigbird README.md. 3.change the dataset api in embedding example scripts

* add link to token embedding desc

* add more detail of embedding training

* update use_gpu->select_device

* add attn_mask

* device->select_device

* remove useless space

* select_device->device; add paddle.distributed.launch gpus, log_dir args description
  • Loading branch information
joey12300 authored Mar 9, 2021
1 parent 0668035 commit a19bcf8
Show file tree
Hide file tree
Showing 9 changed files with 85 additions and 374 deletions.
15 changes: 13 additions & 2 deletions examples/language_model/bigbird/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## 模型介绍
[Big Bird](https://arxiv.org/abs/2007.14062)(Transformers for Longer Sequences) 是Google的研究人员提出的针对长序列预训练模型,使用了稀疏注意力机制,将计算复杂度、空间复杂度降到线性复杂度,大大提升了长序列任务的预测能力。

本项目是 Big Bird 的 PaddlePaddle 实现, 包含模型训练,模型验证等内容。以下是本例的简要目录结构及说明:

```text
Expand Down Expand Up @@ -32,7 +33,7 @@


### 数据准备
根据论文中的信息,目前 Big Bird 的预训练数据是主要是由 Books,CC-News,Stories, Wikipedia 4种预训练数据来构造,用户可以根据自己的需要来下载和清洗相应的数据。提供一份示例数据在 data 目录。
根据论文中的信息,目前 Big Bird 的预训练数据是主要是由 Books,CC-News,Stories, Wikipedia 4种预训练数据来构造,用户可以根据自己的需要来下载和清洗相应的数据。目前已提供一份示例数据在 data 目录。


### 预训练任务
Expand All @@ -41,7 +42,7 @@

```shell
unset CUDA_VISIBLE_DEVICES
python -m paddle.distributed.launch --gpus "0" run_pretrain.py --model_name_or_path bigbird-base-uncased \
python -m paddle.distributed.launch --gpus "0" --log_dir log run_pretrain.py --model_name_or_path bigbird-base-uncased \
--input_dir "./data" \
--output_dir "output" \
--batch_size 4 \
Expand All @@ -56,6 +57,8 @@ python -m paddle.distributed.launch --gpus "0" run_pretrain.py --model_name_or_

其中参数释义如下:

- `gpus` paddle.distributed.launch参数,用于指定使用哪张显卡。单卡格式:"0";多卡格式:"0,1,2"
- `log_dir` paddle.distributed.launch参数,用于指定训练日志输出的目录,默认值为`log`。(注意,如果需要在同一目录多次启动run_pretrain.py,需要设置不同的log_dir,否则日志会重定向到相同的文件中)。
- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。
- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。
- `output_dir` 指定输出文件。
Expand Down Expand Up @@ -101,3 +104,11 @@ python run_classifier.py --model_name_or_path bigbird-base-uncased-finetune \
| Task | Metric | Result |
|:-----:|:----------------------------:|:-----------------:|
| IMDB | Accuracy | 0.9449 |

### 致谢

* 感谢[Google 研究团队](https://github.com/google-research/bigbird)提供BigBird开源代码的实现以及预训练模型。

### 参考论文

* Zaheer, et al. "Big bird: Transformers for longer sequences" Advances in Neural Information Processing Systems, 2020
6 changes: 5 additions & 1 deletion examples/language_model/bigbird/run_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ def _create_dataloader(mode, tokenizer, max_encoder_length, pad_val=0):

def main():
# Initialization for the parallel enviroment
assert args.device in [
"cpu", "gpu", "xpu"
], "Invalid device! Available device should be cpu, gpu, or xpu."

paddle.set_device(args.device)
set_seed(args)
# Define the model and metric
Expand Down Expand Up @@ -168,7 +172,7 @@ def do_train(model, criterion, metric, optimizer, train_data_loader,
model_to_save.save_pretrained(output_dir)

if global_steps >= args.max_steps:
break
return
metric.reset()


Expand Down
36 changes: 19 additions & 17 deletions examples/word_embedding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## 简介

PaddleNLP已预置多个公开的预训练Embedding,用户可以通过使用`paddlenlp.embeddings.TokenEmbedding`接口加载预训练Embedding,从而提升训练效果。以下通过文本分类训练的例子展示`paddlenlp.embeddings.TokenEmbedding`对训练提升的效果。
PaddleNLP已预置多个公开的预训练Embedding,用户可以通过使用`paddlenlp.embeddings.TokenEmbedding`接口加载预训练Embedding,从而提升训练效果。以下通过基于开源情感倾向分类数据集ChnSentiCorp的文本分类训练例子展示`paddlenlp.embeddings.TokenEmbedding`对训练提升的效果。更多的`paddlenlp.embeddings.TokenEmbedding`用法,请参考[TokenEmbedding 接口使用指南](../../paddlenlp/embeddings/README.md)


## 快速开始
Expand All @@ -17,37 +17,39 @@ PaddleNLP已预置多个公开的预训练Embedding,用户可以通过使用`p

- python >= 3.6
- paddlepaddle >= 2.0.0
- paddlenlp >= 2.0.0rc
- paddlenlp >= 2.0.0rc8

```
pip install paddlenlp==2.0.0rc
pip install paddlenlp==2.0.0rc8
```

### 下载词表

下载词汇表文件dict.txt,用于构造词-id映射关系。

```bash
wget https://paddlenlp.bj.bcebos.com/data/dict.txt
```

### 启动训练

我们以中文情感分类公开数据集ChnSentiCorp为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在验证集(dev.tsv)验证。
我们以中文情感分类公开数据集ChnSentiCorp为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在验证集(dev.tsv)验证。训练时会自动下载词表dict.txt,用于对数据集进行切分,构造数据样本。

启动训练:
```

```shell
# 使用paddlenlp.embeddings.TokenEmbedding
python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=5e-4 --batch_size=64 --epochs=20 --use_token_embedding=True --vdl_dir='./vdl_dir'
python train.py --device='gpu' \
--lr=5e-4 \
--batch_size=64 \
--epochs=20 \
--use_token_embedding=True \
--vdl_dir='./vdl_dir'

# 使用paddle.nn.Embedding
python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=1e-4 --batch_size=64 --epochs=20 --use_token_embedding=False --vdl_dir='./vdl_dir'
python train.py --device='gpu' \
--lr=1e-4 \
--batch_size=64 \
--epochs=20 \
--use_token_embedding=False \
--vdl_dir='./vdl_dir'
```

以上参数表示:

* `vocab_path`: 词汇表文件路径。
* `use_gpu`: 是否使用GPU进行训练, 默认为`True`
* `device`: 选择训练设备,目前可选'gpu', 'cpu', 'xpu'。 默认为`gpu`
* `lr`: 学习率, 默认为5e-4。
* `batch_size`: 运行一个batch大小,默认为64。
* `epochs`: 训练轮次,默认为5。
Expand Down
4 changes: 2 additions & 2 deletions examples/word_embedding/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,13 @@ def convert_example(example, vocab, unk_token_id=1, is_test=False):
"""

input_ids = []
for token in tokenizer.cut(example[0]):
for token in tokenizer.cut(example['text']):
token_id = vocab.get(token, unk_token_id)
input_ids.append(token_id)
valid_length = len(input_ids)

if not is_test:
label = np.array(example[-1], dtype="int64")
label = np.array(example["labels"], dtype="int64")
return input_ids, valid_length, label
else:
return input_ids, valid_length
Expand Down
92 changes: 33 additions & 59 deletions examples/word_embedding/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,77 +18,48 @@

import paddle
import paddle.nn as nn
import paddlenlp as nlp
from paddlenlp.datasets import ChnSentiCorp
import paddlenlp
from paddlenlp.utils.downloader import get_path_from_url
from paddlenlp.embeddings import TokenEmbedding
from paddlenlp.data import JiebaTokenizer, Vocab
import data
from paddlenlp.datasets import load_dataset

parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"--epochs", type=int, default=5, help="Number of epoches for training.")
parser.add_argument(
'--use_gpu',
type=eval,
default=True,
help="Whether use GPU for training, input should be True or False")
parser.add_argument(
"--lr", type=float, default=5e-4, help="Learning rate used to train.")
parser.add_argument(
"--save_dir",
type=str,
default='./checkpoints/',
help="Directory to save model checkpoint")
parser.add_argument(
"--batch_size",
type=int,
default=64,
help="Total examples' number of a batch for training.")
parser.add_argument(
"--vocab_path",
type=str,
default="./dict.txt",
help="The directory to dataset.")
parser.add_argument(
"--init_from_ckpt",
type=str,
default=None,
help="The path of checkpoint to be loaded.")
parser.add_argument(
"--use_token_embedding",
type=eval,
default=True,
help="Whether use pretrained embedding")
parser.add_argument(
"--embedding_name",
type=str,
default="w2v.baidu_encyclopedia.target.word-word.dim300",
help="The name of pretrained embedding")
parser.add_argument(
"--vdl_dir", type=str, default="vdl_dir/", help="VisualDL log directory")
import data

# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.")
parser.add_argument("--device", type=str, default="gpu", help="Select cpu, gpu, xpu devices to train model.")
parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
parser.add_argument("--save_dir", type=str, default='./checkpoints/', help="Directory to save model checkpoint")
parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
parser.add_argument("--use_token_embedding", type=eval, default=True, help="Whether use pretrained embedding")
parser.add_argument("--embedding_name", type=str, default="w2v.baidu_encyclopedia.target.word-word.dim300", help="The name of pretrained embedding")
parser.add_argument("--vdl_dir", type=str, default="vdl_dir/", help="VisualDL log directory")
args = parser.parse_args()
# yapf: enable

WORD_DICT_URL = "https://paddlenlp.bj.bcebos.com/data/dict.txt"


def create_dataloader(dataset,
trans_fn=None,
mode='train',
batch_size=1,
use_gpu=False,
pad_token_id=0):
"""
Creats dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.apply(trans_fn, lazy=True)
dataset = dataset.map(trans_fn, lazy=True)

shuffle = True if mode == 'train' else False
sampler = paddle.io.BatchSampler(
Expand Down Expand Up @@ -133,7 +104,7 @@ def __init__(self,
padding_idx = vocab_size - 1
self.embedder = nn.Embedding(
vocab_size, emb_dim, padding_idx=padding_idx)
self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim)
self.bow_encoder = paddlenlp.seq2vec.BoWEncoder(emb_dim)
self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
self.dropout = nn.Dropout(p=0.3, axis=1)
Expand All @@ -158,26 +129,29 @@ def forward(self, text, seq_len=None):


if __name__ == '__main__':
paddle.set_device('gpu') if args.use_gpu else paddle.set_device('cpu')
assert args.device in [
"cpu", "gpu", "xpu"
], "Invalid device! Available device should be cpu, gpu, or xpu."
paddle.set_device(args.device)

# Loads vocab.
if not os.path.exists(args.vocab_path):
raise RuntimeError('The vocab_path can not be found in the path %s' %
args.vocab_path)
vocab = data.load_vocab(args.vocab_path)
vocab_path = "./dict.txt"
if not os.path.exists(vocab_path):
# download in current directory
get_path_from_url(WORD_DICT_URL, "./")
vocab = data.load_vocab(vocab_path)

if '[PAD]' not in vocab:
vocab['[PAD]'] = len(vocab)
# Loads dataset.
train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets(
['train', 'dev', 'test'])
train_ds, dev_ds, test_ds = load_dataset(
"chnsenticorp", splits=["train", "dev", "test"], lazy=False)

# Constructs the newtork.
num_classes = len(train_ds.get_labels())
model = BoWModel(
vocab_size=len(vocab),
num_classes=num_classes,
vocab_path=args.vocab_path,
num_classes=len(train_ds.label_list),
vocab_path=vocab_path,
use_token_embedding=args.use_token_embedding)
if args.use_token_embedding:
vocab = model.embedder.vocab
Expand Down
2 changes: 2 additions & 0 deletions paddlenlp/embeddings/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ print(score) # 8.611071

## 训练

以下为`TokenEmbedding`简单的组网使用方法。有关更多`TokenEmbedding`训练流程相关的使用方法,请参考[Word Embedding with PaddleNLP](../../examples/word_embedding/README.md)

```python
in_words = paddle.to_tensor([0, 2, 3])
input_embeddings = token_embedding(in_words)
Expand Down
1 change: 1 addition & 0 deletions paddlenlp/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

from .model_utils import PretrainedModel, register_base_model
from .tokenizer_utils import PretrainedTokenizer
from .attention_utils import create_bigbird_rand_mask_idx_list

from .bert.modeling import *
from .bert.tokenizer import *
Expand Down
Loading

0 comments on commit a19bcf8

Please sign in to comment.