Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle2.4.1与paddlenlp的兼容性问题 #4593

Closed
VBPython opened this issue Jan 27, 2023 · 14 comments
Closed

paddle2.4.1与paddlenlp的兼容性问题 #4593

VBPython opened this issue Jan 27, 2023 · 14 comments
Assignees
Labels

Comments

@VBPython
Copy link

bug描述 Describe the Bug

在Aistudio上,我使用paddle2.2.2运行ernie的SKEP(paddlenlp)预训练模型时,模型训练速度为0.8step/s,在同等条件下(V100 32G显存),paddle升级到2.4.1,就会运行非常慢,共运行过四次,有一次是0.01step/s,有三次是直接内存溢出。请帮忙排查一下是否是paddle 2.4.1和paddlenlp 2.5.0之间的兼容性方面存在问题或者是上述框架与Aistudio之间存在兼容性问题,因为按照训练惯例,我们在进行深度学习训练前一般都是将paddle和paddlenlp upgrade到最新版本的,因此,如果出现兼容性问题,对训练影响会比较大,因此,还请排查,谢谢。
下附运行代码:

'''python
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade paddlepaddle==2.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

指定模型名称,一键加载模型

model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=len(train_ds.label_list))

同样地,通过指定模型名称一键加载对应的Tokenizer,用于处理文本数据,如切分token,转token_id等。

tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
import os
from functools import partial

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False):

# 将原数据处理成model可读入的格式,enocded_inputs是一个dict,包含input_ids、token_type_ids等字段
encoded_inputs = tokenizer(
    text=example["text"], max_seq_len=max_seq_length)

# input_ids:对文本切分token后,在词汇表中对应的token id
input_ids = encoded_inputs["input_ids"]
# token_type_ids:当前token属于句子1还是句子2,即上述图中表达的segment ids
token_type_ids = encoded_inputs["token_type_ids"]

if not is_test:
    # label:情感极性类别
    label = np.array([example["label"]], dtype="int64")
    return input_ids, token_type_ids, label
else:
    # qid:每条数据的编号
    qid = np.array([example["qid"]], dtype="int64")
    return input_ids, token_type_ids, qid

批量数据大小

batch_size = 32

文本序列最大长度

max_seq_length = 256

将数据处理成模型可读入的数据格式

trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)

将数据组成批量式数据,如

将不同长度的文本序列padding到批量式数据中最大长度

将每条数据label堆叠在一起

batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack() # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode='dev',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
import time

from utils import evaluate

训练轮次

epochs = 1

训练过程中保存模型参数的文件夹

ckpt_dir = "skep_ckpt"

len(train_data_loader)一轮训练所需要的step数

num_training_steps = len(train_data_loader) * epochs

Adam优化器

optimizer = paddle.optimizer.AdamW(
learning_rate=2e-5,
parameters=model.parameters())

交叉熵损失函数

criterion = paddle.nn.loss.CrossEntropyLoss()

accuracy评价指标

metric = paddle.metric.Accuracy()

开启训练

global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
# 喂数据给model
logits = model(input_ids, token_type_ids)
# 计算损失函数值
loss = criterion(logits, labels)
# 预测分类概率值
probs = F.softmax(logits, axis=1)
# 计算acc
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()

    global_step += 1
    if global_step % 10 == 0:
        print(
            "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
            % (global_step, epoch, step, loss, acc,
                10 / (time.time() - tic_train)))
        tic_train = time.time()
    
    # 反向梯度回传,更新参数
    loss.backward()
    optimizer.step()
    optimizer.clear_grad()

    if global_step % 100 == 0:
        save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
        # 评估当前训练的模型
        evaluate(model, criterion, metric, dev_data_loader)
        # 保存当前模型参数等
        model.save_pretrained(save_dir)
        # 保存tokenizer的词表等
        tokenizer.save_pretrained(save_dir)

'''

其他补充信息 Additional Supplementary Information

No response

@paddle-bot
Copy link

paddle-bot bot commented Jan 27, 2023

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@LDOUBLEV
Copy link

LDOUBLEV commented Jan 28, 2023

!pip install --upgrade paddlepaddle==2.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

你安装的是PaddlePaddle cpu版本,建议安装GPU版本,pip install paddlepaddle-gpu==2.4.1,参考https://www.paddlepaddle.org.cn/

@VBPython
Copy link
Author

!pip install --upgrade paddlepaddle==2.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

你安装的是PaddlePaddle cpu版本,建议安装GPU版本,pip install paddlepaddle-gpu==2.4.1,参考https://www.paddlepaddle.org.cn/

在aistudio上调整为:
!pip install --upgrade paddlepaddle-gpu==2.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
训练可以正常运行了,但是执行:
import paddle
print(paddle.version)
显示仍未2.2.2。
这个是怎么回事?是否2.4.1仍然没有更新成功?请予以告知,谢谢。

@VBPython
Copy link
Author

另外,安装完成后,有时候会出现以下的报错。
Error: Can not import paddle core while this file exists: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
这个是怎么回事?怎么解决?

@LDOUBLEV
Copy link

pip list |grep paddle 看下是否正确安装

可能是paddle安装到了不同的python版本上;

如果有多个版本的paddle,建议卸载其他,只保留想要的paddle版本

@VBPython
Copy link
Author

在aistudio上,运行:
!pip list |grep paddle
显示:
paddle-bfloat 0.1.7
paddle2onnx 1.0.5
paddlefsl 1.1.0
paddlehub 2.0.4
paddlenlp 2.5.0
paddlepaddle-gpu 2.4.1(之前是2.2.2)
tb-paddle 0.3.6
应该说明已经更新成功。
但是运行:
import paddle
print(paddle.version)
依然显示:
2.2.2
这是怎么回事儿?

@LDOUBLEV
Copy link

应该是没有安装成功,
打开aistudio项目里的终端,输入nvcc -V 看下cuda版本是不是10.1的
image

2.4.1的paddle需要cuda版本>=10.2
创建项目的时候如果选择了paddle版本==2.2.2,默认的cuda可能是10.1,不符合2.4.1版本的要求

@VBPython
Copy link
Author

确实是10.1,应该是没有安装成功。
好吧!那就又回归到本文主题了,似乎这个训练如果运行在paddle 2.4.1上就会报错(之前训练成功还是在2.2.2上)。
在BML CodeLab我们构建了一个类似的项目,运行:
PaddlePaddle/Paddle#1、update the enviromnet
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade paddlepaddle-gpu==2.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
PaddlePaddle/Paddle#2,Prepare the datasets
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp",splits=["train","dev","test"])
报错:
Error: Can not import paddle core while this file exists: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so

查过,仅有一个paddle版本。

@LDOUBLEV
Copy link

Error: Can not import paddle core while this file exists: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so

这个报错是paddle没有安装成功,已经遇到过了

另外,安装完成后,有时候会出现以下的报错。 Error: Can not import paddle core while this file exists: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so 这个是怎么回事?怎么解决?

先nvcc -V看下cuda版本是否满足条件,然后根据cuda版本安装paddle

比如 cuda11.2:
python -m pip install paddlepaddle-gpu==2.4.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
cuda11.6:
python -m pip install paddlepaddle-gpu==2.4.1.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

我这边测试没问题了
image

@VBPython
Copy link
Author

按照上述方法,对代码重构如下:
PaddlePaddle/Paddle#1、update the enviromnet
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!python -m pip install paddlepaddle-gpu==2.4.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
PaddlePaddle/Paddle#2,Prepare the datasets
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp",splits=["train","dev","test"])
PaddlePaddle/Paddle#3,import pretrained model and tokenizer
from paddlenlp.transformers import SkepForSequenceClassification,SkepTokenizer
model=SkepForSequenceClassification.from_pretrained(
pretrained_model_name_or_path="skep_ernie_1.0_large_ch",
num_classes=len(train_ds.label_list)
)
tokenizer=SkepTokenizer.from_pretrained(
pretrained_model_name_or_path="skep_ernie_1.0_large_ch"
)
PaddlePaddle/Paddle#4, Prepare dateset
import os
import numpy as np
#过tokenizer
def convert_example(example,tokenizer,max_seq_length=512,is_test=False):
encoded_inputs=tokenizer(text=example["text"], max_seq_length=max_seq_length)
input_ids=encoded_inputs["input_ids"]
token_type_ids=encoded_inputs["token_type_ids"]
if not is_test:
label=np.array([example["label"]],dtype="int64")
return input_ids, token_type_ids, label
else:
qid=np.array([example["qid"]],dtype="int64")
return input_ids, token_type_ids, qid
PaddlePaddle/Paddle#5, setup the model
from functools import partial
from paddlenlp.data import Tuple,Pad,Stack
from utils import create_dataloader
epochs=1
batch_size=32
max_seq_length=256
trans_func=partial(convert_example,tokenizer=tokenizer,max_seq_length=max_seq_length)
batchify_fn=lambda samples,fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id),
Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
Stack()
):[data for data in fn(samples)]

train_data_loader=create_dataloader(train_ds, trans_fn = trans_func, mode="train", batch_size = batch_size, batchify_fn= batchify_fn)
dev_data_loader=create_dataloader(dev_ds, trans_fn = trans_func, mode="dev", batch_size = batch_size, batchify_fn= batchify_fn)
import time
import paddle
import os
import numpy as np
import paddle.nn.functional as F
from utils import evaluate

训练轮次

epochs = 1

训练过程中保存模型参数的文件夹

saved_dir="skep_ckpt"

len(train_data_loader)一轮训练所需要的step数

num_training_steps = len(train_data_loader) * epochs

误差函数

criterion = paddle.nn.loss.CrossEntropyLoss()

优化器

optimizer = paddle.optimizer.AdamW(
learning_rate=2e-5,
parameters=model.parameters()
)

accuracy评价指标

metric = paddle.metric.Accuracy()

6,开启训练

#CUDA_VISIBLE_DEVICES=0,1,2,3
global_step = 0
tic_train = time.time()
for epoch in range(1,epochs+1):
for step, batch in enumerate(train_data_loader,start=1):
input_ids, token_type_ids, labels = batch
logits = model(input_ids,token_type_ids)
loss=criterion(logits,labels)
probs=F.softmax(logits,axis=1)
#计算acc
correct = metric.compute(probs, labels)
metric.update(correct)
acc=metric.accumulate()

    global_step+=1
    if global_step % 10 ==0:
        print(
            "global_step: %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f, speed: %.2f step/s" % (global_step, epoch,batch, loss, acc, 10/(time.time()-tic_train))
        )
        tic_train=time.time()
    loss.backward()
    optimizer.step()
    optimizer.clear_grad()

    if global_step % 100 ==0 :
        save_dir=os.path.join( saved_dir,"model_%d"%global_step)
        if not os.path.exists(save_dir):
            os.makedir(save_dir)
        evaluate(model,criterion,metric,dev_data_loader)
        model.save_pretrained(save_dir)
        tokenizer.save_pretrained(save_dir)

paddle相关版本的安装应该是没问题了,但是还是出现如题的问题,一开始训练GPU就爆了,所以感觉还是paddle 2.4.1与paddlenlp的兼容性问题。采用的环境是BML CodeLab(V100 32G显存)。

@LDOUBLEV
Copy link

可否整理下代码格式,在github中显示的太乱了,或者直接提供aistudio项目地址

代码放在```引号中间即可
image

code is here

@VBPython
Copy link
Author

抱歉,aistudio项目地址如下:
项目“文本情感分析”共享链接(有效期三天):https://aistudio.baidu.com/studio/project/partial/verify/5416082/a6a54f89dd034dd4a5c29930a950811c

@LDOUBLEV LDOUBLEV assigned guoshengCS and unassigned LDOUBLEV Jan 29, 2023
@guoshengCS guoshengCS transferred this issue from PaddlePaddle/Paddle Jan 31, 2023
@chenxiaozeng
Copy link
Contributor

chenxiaozeng commented Feb 6, 2023

项目“文本情感分析”共享链接(有效期三天):
https://aistudio.baidu.com/studio/project/partial/verify/5416082/8075d88d387f44adaec7a18264093e51

@chenxiaozeng chenxiaozeng assigned sijunhe and unassigned guoshengCS Feb 6, 2023
@sijunhe
Copy link
Collaborator

sijunhe commented Feb 6, 2023

@VBPython 代码有一个小问题,encoded_inputs=tokenizer(text=example["text"], max_seq_length=max_seq_length)这里使用的API错误,应该更改为
encoded_inputs=tokenizer(text=example["text"], max_length=max_seq_length, truncation=True), 这样才输入才能被正确地截断

@paddle-bot paddle-bot bot closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants