-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NLP model interpretation #1752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除所有[.gitkeep]等空文件
[//]: shenyaozong(shenyaozong@baidu.com) | ||
|
||
|
||
讨论 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除以下这部分内容,由于是对外的,不需要暴露icode和Hi群等信息。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return args | ||
|
||
|
||
def dataLoad(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataLoad -> data_load
需要统一函数命名风格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleNLP会发个小版本解决RobERTa的英文模型加载问题,到时可以简化掉这里的使用体验,不用让用户搬一大段代码和模型进开发目录
@@ -0,0 +1,33 @@ | |||
backports.entry-points-selectable==1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于进到examples的用户,你可以认为已经默认安装成功paddlenlp和paddle的cpu或者gpu版本。
所以想确定下,如果用户在已经安装了paddlepaddle和paddlenlp之后,是否还需要额外安装哪些依赖。
BTW,requiremetnt固定死版本会导致该代码与用户环境大概率出现版本冲突与兼容,不是一种好的requirements处理做法,非不得已一般采用>=的版本号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除paddle相关依赖,==改为>=
@@ -0,0 +1,33 @@ | |||
backports.entry-points-selectable==1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backport依赖在哪需要呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在每个task下的tokenizer_util.py中加载lru_cache时要用到
query_att = attention[0] | ||
title_att = attention[1] | ||
|
||
model.clear_gradients() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle 2.0后默认推荐使用model.clear_grad()接口,与torch一致,请全局替换该函数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已全局替换
print('query_att: %s' % query_att.shape) | ||
print('title_att: %s' % title_att.shape) | ||
|
||
# print([query_att, query, title_att, title]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除无意义注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# yapf: enable | ||
|
||
|
||
def interpreter(model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名的设计通常为动名词结构,或者纯动词,此处使用纯名词,函数的表意上好像不是特别精准
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改为纯动词interpret
@@ -0,0 +1,57 @@ | |||
TASK=similarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
脚本文件名需要统一小写,按照百度代码规范要求的话。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
脚本文件名已统一为小写
|
||
dev_ds = Senti_data().read( | ||
os.path.join(args.data_dir, 'dev'), args.language) | ||
dev_ds.map(map_fn, batched=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般来说,python的代码组织是按照模块来划分,个人感觉code这个目录从代码托管的角度没有形成模块的意义,建议去掉这个目录层级。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已去除code目录
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README中已做相应修改
|
||
sys.path.append('../task/similarity') | ||
from LIME.lime_text import LimeTextExplainer | ||
from roberta.tokenizer import RobertaTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块后续可以直接从PaddleNLP调用。
return args | ||
|
||
|
||
class Similarity_data(DatasetBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是类的coding style,应该为
class SimilarityData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import logging | ||
import argparse | ||
|
||
import paddle as P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle官方代码对外没有推荐使用,
import paddle as P
这部分可以讨论下,可以使用import paddle.functional as F.
主要从API体系和体验上,较少看见有import torch as T, 这个comment可以讨论
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已变为直接import paddle
return args | ||
|
||
|
||
class Senti_data(DatasetBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python类名按GoogleCodingStyle采用大驼峰形式
Senti_data -> SentiData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,30 @@ | |||
backports.entry-points-selectable>=1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些依赖大部分都是安装paddlenlp的时候会自动引入的,无需用户额外安装。请重新在conda创建个干净的python 3.8环境,后,确认代码是否有新增额外的依赖。
我理解像visualdl、LAC,spacy是一个会新增的额外依赖,
但像numpy,pandas这些,大部分都是paddle和paddlenlp正确安装时回自动引入的,无需要额外强调。
譬如six,默认都适用py3了,就不需要six这些库,所以请检查下这里的依赖需求,简化一下。重点只强调在安装了paddlenlp和paddle之后,为了跑你这个模块需要额外安装的东西。而不是把你正在运行环境的所有依赖都导入。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,608 @@ | |||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我确认下,这个tokenizer文件,你有额外的开发和插入工作吗?譬如这部分代码如果在paddlnlp library内可以访问到的话,你还需要额外搬运这个代码吗?因为比较冗余
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有roberta线下tokenizer已替换为调用paddlenlp线上版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们在下周的小版本2.2.5修复了Roberta en模型缺失的问题,可以尝试本地setup安装最新的paddlenlp develop版本,这样可以减少很多代码和文档。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于所有目录中roberta目录下的modeling.py, generation_utils.py等文件,是否都复用PaddleNLP内部文件?而不用每个任务放置一份重复的代码量巨大的文件提交出来呢?
@@ -0,0 +1,630 @@ | |||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些modeling代码是否也一样可以复用PaddleNLP的呢?
@@ -0,0 +1,1023 @@ | |||
# !/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我在想这里generation_utils.py modeling.py文件都可以同步删除?我理解是不是都是和paddlenlp库里面重复的文件?
# -*- coding:utf-8 -*- | ||
########################################################## | ||
# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved # | ||
########################################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还请注意copyright的一致
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
eos_token_id = eos_token_id if eos_token_id is not None else getattr( | ||
self, 'eos_token_id', None) | ||
pad_token_id = pad_token_id if pad_token_id is not None else getattr( | ||
self, 'pad_token_id', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还请确认下是否可直接使用 paddlenlp.transformers.generation_utils ,看上去是这个稍早些的版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py,全部改为调用paddlenlp接口
return None # Overwrite for models with output embeddings | ||
|
||
@classmethod | ||
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件也还请确认下是否可以直接使用 paddlenlp.transformers.model_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py,全部改为调用paddlenlp接口
questions, | ||
contexts, | ||
stride=args.doc_stride, | ||
max_seq_len=args.max_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp中tokenizer做了些更新,为了对齐HF的行为有些break的修改,tokenizer原来返回的是list of dict,修改后会返回dict of list, https://github.com/PaddlePaddle/PaddleNLP/pull/1713/files#diff-86a1b461121c41dca0e85147910f19e6018e3e23aa374b28ddc3f60751c0fd3e 。
如果希望不改动其他代码的情况下,这里可以简单加下return_dict=False
保持其他代码继续可用,还需要简单适配下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加对tokenizer结果的类型判断,手动修改结果格式到老版本对应格式
max_position_embeddings=512, | ||
type_vocab_size=16, | ||
initializer_range=0.02, | ||
pad_token_id=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意和paddlenlp/transformers/roberta/modeling.py的差异,可以也加上layer_norm_eps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已加入
loss = scaler.scale(loss) | ||
loss.backward() | ||
scaler.minimize(opt, loss) | ||
model.clear_gradients() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果optimizer使用了的建议还是使用optimizer.clear_grad()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已替换model.clear_gradients()->opt.clear_grad()
# start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2) | ||
# attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb) | ||
_, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret( | ||
*fwd_args, **fwd_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前考虑的其实是用户通过register_forward_post_hook的形式来获取任意想要的中间结果,比如可以在hook中将需要的内容放入全局的list中,以此将各种获取中间结果的需求插件化,也能更好的复用当前model中的代码。
我们在做蒸馏整理也有类似的需求,粗略的搞了一般插件化形式的 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/distill_utils.py#L189 ,
我们后续也看看如何把这种功能需求做成API接口提供出来更方便大家使用。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们在第二版开源中也做了类似接口,当前还没有
version_2_with_negative: bool=False, | ||
n_best_size: int=20, | ||
max_answer_length: int=30, | ||
cls_threshold: float=0.5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是会和当前paddlenlp.metric中的有区别是吗,如果是较为常用的metric的话,后面也可以看看加入paddlenlp下面搞到包里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加了适用于本项目的一些类,并对paddlenlp.metric中已有的两个类做了修改
answerable_probs[1] | ||
]) | ||
|
||
# Only keep the best `n_best_size` predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还请注意格式,另外也还麻烦再确认下是否使用pre-commit了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般结束一轮CR的时候我会用pre-commit扫一遍,现已使用pre-commit修复codestyle的问题
@@ -0,0 +1,12089 @@ | |||
[PAD] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
像这种稍大些的数据和词典是否放在bos提供链接出来呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑到提供链接还需要单独下载和放到对应目录比较麻烦,就没有单独存储字典,可否在下一版中改进,当前时间比较紧张
from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model | ||
from roberta.transformer import TransformerEncoderLayer, TransformerEncoder | ||
|
||
__all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外也看到存在了多个 roberta/modeling.py、roberta/transformer.py ,请问下这些是一样的不
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transformer.py是一样的,modeling.py是不一样的
* upload NLP interpretation * fix problems and relocate project * remove abandoned picture * remove abandoned picture * fix dead link in README * fix dead link in README * fix code style problems * fix CR round 1 * remove .gitkeep files * fix code style * fix file encoding problem * fix code style * delete duplicated files due to directory rebuild * fix CR round 2 * fix code style * fix ernie tokenizer * fix code style * fix problem from CR round 1 * fix bugs * fix README * remove duplicated files * deal with diff of old and new tokenizer results * fix CR round 4 * fix code style * add missing dependence * fix broken import path * move some data file to cloud * MRC upper case to lower case Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> Co-authored-by: binlinquge <xxx> Co-authored-by: Guo Sheng <guosheng@baidu.com>
PR types
upload of new module
PR changes
New module
Description
this module is used for interpreting NLP models. please see README for detail.