Add NLP model interpretation #1752

binlinquge · 2022-03-10T09:19:18Z

PR types

upload of new module

PR changes

New module

Description

this module is used for interpreting NLP models. please see README for detail.

CLAassistant · 2022-03-10T09:31:32Z

All committers have signed the CLA.

…o develop

ZeyuChen

删除所有[.gitkeep]等空文件

ZeyuChen · 2022-03-15T16:05:14Z

examples/model_interpretation/README.md

+[//]: shenyaozong(shenyaozong@baidu.com)
+
+
+讨论


删除以下这部分内容，由于是对外的，不需要暴露icode和Hi群等信息。

ZeyuChen · 2022-03-15T16:06:26Z

examples/model_interpretation/model_interpretation/evaluation/faithfulness/NewP_Analysis.py

+    return args
+
+
+def dataLoad(args):


dataLoad -> data_load
需要统一函数命名风格

ZeyuChen

PaddleNLP会发个小版本解决RobERTa的英文模型加载问题，到时可以简化掉这里的使用体验，不用让用户搬一大段代码和模型进开发目录

ZeyuChen · 2022-03-15T16:15:51Z

examples/model_interpretation/requirements.txt

@@ -0,0 +1,33 @@
+backports.entry-points-selectable==1.1.1


对于进到examples的用户，你可以认为已经默认安装成功paddlenlp和paddle的cpu或者gpu版本。
所以想确定下，如果用户在已经安装了paddlepaddle和paddlenlp之后，是否还需要额外安装哪些依赖。
BTW，requiremetnt固定死版本会导致该代码与用户环境大概率出现版本冲突与兼容，不是一种好的requirements处理做法，非不得已一般采用>=的版本号

已删除paddle相关依赖，==改为>=

ZeyuChen · 2022-03-15T16:16:15Z

examples/model_interpretation/requirements.txt

@@ -0,0 +1,33 @@
+backports.entry-points-selectable==1.1.1


backport依赖在哪需要呢？

在每个task下的tokenizer_util.py中加载lru_cache时要用到

ZeyuChen · 2022-03-15T16:17:59Z

...es/model_interpretation/model_interpretation/task/similarity/simnet/interpreter_attention.py

+        query_att = attention[0]
+        title_att = attention[1]
+
+        model.clear_gradients()


paddle 2.0后默认推荐使用model.clear_grad()接口，与torch一致，请全局替换该函数

已全局替换

ZeyuChen · 2022-03-15T16:18:07Z

...es/model_interpretation/model_interpretation/task/similarity/simnet/interpreter_attention.py

+        print('query_att: %s' % query_att.shape)
+        print('title_att: %s' % title_att.shape)
+
+        # print([query_att, query, title_att, title])


删除无意义注释

ZeyuChen · 2022-03-15T16:19:03Z

...es/model_interpretation/model_interpretation/task/similarity/simnet/interpreter_attention.py

+# yapf: enable
+
+
+def interpreter(model,


函数名的设计通常为动名词结构，或者纯动词，此处使用纯名词，函数的表意上好像不是特别精准

已改为纯动词interpret

ZeyuChen · 2022-03-15T16:27:33Z

examples/model_interpretation/model_interpretation/rationale_extraction/Generate.sh

@@ -0,0 +1,57 @@
+TASK=similarity


脚本文件名需要统一小写，按照百度代码规范要求的话。

脚本文件名已统一为小写

ZeyuChen · 2022-03-15T16:30:09Z

examples/model_interpretation/model_interpretation/rationale_extraction/code/sentiment_pred.py

+
+    dev_ds = Senti_data().read(
+        os.path.join(args.data_dir, 'dev'), args.language)
+    dev_ds.map(map_fn, batched=True)


一般来说，python的代码组织是按照模块来划分，个人感觉code这个目录从代码托管的角度没有形成模块的意义，建议去掉这个目录层级。

已去除code目录

README中已做相应修改

ZeyuChen · 2022-03-15T16:30:56Z

examples/model_interpretation/model_interpretation/rationale_extraction/code/similarity_pred.py

+
+sys.path.append('../task/similarity')
+from LIME.lime_text import LimeTextExplainer
+from roberta.tokenizer import RobertaTokenizer


这块后续可以直接从PaddleNLP调用。

ZeyuChen · 2022-03-15T16:31:33Z

examples/model_interpretation/model_interpretation/rationale_extraction/code/similarity_pred.py

+    return args
+
+
+class Similarity_data(DatasetBuilder):


如果是类的coding style，应该为

class SimilarityData

ZeyuChen · 2022-03-15T16:34:16Z

examples/model_interpretation/model_interpretation/rationale_extraction/code/similarity_pred.py

+import logging
+import argparse
+
+import paddle as P


paddle官方代码对外没有推荐使用，

import paddle as P

这部分可以讨论下，可以使用import paddle.functional as F.
主要从API体系和体验上，较少看见有import torch as T, 这个comment可以讨论

已变为直接import paddle

ZeyuChen · 2022-03-16T15:17:29Z

examples/model_interpretation/rationale_extraction/sentiment_pred.py

+    return args
+
+
+class Senti_data(DatasetBuilder):


python类名按GoogleCodingStyle采用大驼峰形式
Senti_data -> SentiData

ZeyuChen · 2022-03-16T15:28:25Z

examples/model_interpretation/requirements.txt

@@ -0,0 +1,30 @@
+backports.entry-points-selectable>=1.1.1


这些依赖大部分都是安装paddlenlp的时候会自动引入的，无需用户额外安装。请重新在conda创建个干净的python 3.8环境，后，确认代码是否有新增额外的依赖。
我理解像visualdl、LAC，spacy是一个会新增的额外依赖，
但像numpy，pandas这些，大部分都是paddle和paddlenlp正确安装时回自动引入的，无需要额外强调。
譬如six，默认都适用py3了，就不需要six这些库，所以请检查下这里的依赖需求，简化一下。重点只强调在安装了paddlenlp和paddle之后，为了跑你这个模块需要额外安装的东西。而不是把你正在运行环境的所有依赖都导入。

ZeyuChen · 2022-03-16T15:29:13Z

examples/model_interpretation/task/mrc/roberta/bert_tokenizer.py

@@ -0,0 +1,608 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


我确认下，这个tokenizer文件，你有额外的开发和插入工作吗？譬如这部分代码如果在paddlnlp library内可以访问到的话，你还需要额外搬运这个代码吗？因为比较冗余

所有roberta线下tokenizer已替换为调用paddlenlp线上版本

ZeyuChen

我们在下周的小版本2.2.5修复了Roberta en模型缺失的问题，可以尝试本地setup安装最新的paddlenlp develop版本，这样可以减少很多代码和文档。

ZeyuChen

关于所有目录中roberta目录下的modeling.py, generation_utils.py等文件，是否都复用PaddleNLP内部文件？而不用每个任务放置一份重复的代码量巨大的文件提交出来呢？

ZeyuChen · 2022-03-18T16:18:03Z

examples/model_interpretation/task/similarity/roberta/modeling.py

@@ -0,0 +1,630 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


这些modeling代码是否也一样可以复用PaddleNLP的呢？

ZeyuChen · 2022-03-18T16:19:06Z

examples/model_interpretation/task/senti/roberta/generation_utils.py

@@ -0,0 +1,1023 @@
+# !/usr/bin/env python3


我在想这里generation_utils.py modeling.py文件都可以同步删除？我理解是不是都是和paddlenlp库里面重复的文件？

guoshengCS · 2022-03-21T02:56:10Z

examples/model_interpretation/rationale_extraction/available_gpu.py

+# -*- coding:utf-8 -*-
+##########################################################
+# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved #
+##########################################################


还请注意copyright的一致

guoshengCS · 2022-03-21T06:16:43Z

examples/model_interpretation/task/mrc/roberta/generation_utils.py

+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(
+            self, 'eos_token_id', None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(
+            self, 'pad_token_id', None)


还请确认下是否可直接使用 paddlenlp.transformers.generation_utils ，看上去是这个稍早些的版本

已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py，全部改为调用paddlenlp接口

guoshengCS · 2022-03-21T06:19:03Z

examples/model_interpretation/task/mrc/roberta/model_utils.py

+        return None  # Overwrite for models with output embeddings
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):


这个文件也还请确认下是否可以直接使用 paddlenlp.transformers.model_utils

已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py，全部改为调用paddlenlp接口

guoshengCS · 2022-03-22T05:19:12Z

examples/model_interpretation/rationale_extraction/mrc_pred.py

+        questions,
+        contexts,
+        stride=args.doc_stride,
+        max_seq_len=args.max_seq_len)


paddlenlp中tokenizer做了些更新，为了对齐HF的行为有些break的修改，tokenizer原来返回的是list of dict，修改后会返回dict of list， https://github.com/PaddlePaddle/PaddleNLP/pull/1713/files#diff-86a1b461121c41dca0e85147910f19e6018e3e23aa374b28ddc3f60751c0fd3e 。

如果希望不改动其他代码的情况下，这里可以简单加下return_dict=False保持其他代码继续可用，还需要简单适配下

添加对tokenizer结果的类型判断，手动修改结果格式到老版本对应格式

guoshengCS · 2022-03-23T09:38:38Z

examples/model_interpretation/task/similarity/roberta/modeling.py

+                 max_position_embeddings=512,
+                 type_vocab_size=16,
+                 initializer_range=0.02,
+                 pad_token_id=0):


注意和paddlenlp/transformers/roberta/modeling.py的差异，可以也加上layer_norm_eps

guoshengCS · 2022-03-24T01:43:10Z

examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py

+                    loss = scaler.scale(loss)
+                    loss.backward()
+                    scaler.minimize(opt, loss)
+                    model.clear_gradients()


如果optimizer使用了的建议还是使用optimizer.clear_grad()

已替换model.clear_gradients()->opt.clear_grad()

guoshengCS · 2022-03-24T02:05:12Z

examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py

+        # start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2)
+        # attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb)
+        _, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret(
+            *fwd_args, **fwd_kwargs)


之前考虑的其实是用户通过register_forward_post_hook的形式来获取任意想要的中间结果，比如可以在hook中将需要的内容放入全局的list中，以此将各种获取中间结果的需求插件化，也能更好的复用当前model中的代码。

我们在做蒸馏整理也有类似的需求，粗略的搞了一般插件化形式的 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/distill_utils.py#L189 ，

我们后续也看看如何把这种功能需求做成API接口提供出来更方便大家使用。

我们在第二版开源中也做了类似接口，当前还没有

guoshengCS · 2022-03-24T02:44:35Z

examples/model_interpretation/task/mrc/saliency_map/squad.py

+                                 version_2_with_negative: bool=False,
+                                 n_best_size: int=20,
+                                 max_answer_length: int=30,
+                                 cls_threshold: float=0.5):


是会和当前paddlenlp.metric中的有区别是吗，如果是较为常用的metric的话，后面也可以看看加入paddlenlp下面搞到包里

增加了适用于本项目的一些类，并对paddlenlp.metric中已有的两个类做了修改

guoshengCS · 2022-03-24T02:47:31Z

examples/model_interpretation/task/mrc/saliency_map/squad.py

+                answerable_probs[1]
+            ])
+
+# Only keep the best `n_best_size` predictions.


还请注意格式，另外也还麻烦再确认下是否使用pre-commit了

一般结束一轮CR的时候我会用pre-commit扫一遍，现已使用pre-commit修复codestyle的问题

guoshengCS · 2022-03-24T02:54:07Z

examples/model_interpretation/task/senti/rnn/vocab.txt

@@ -0,0 +1,12089 @@
+[PAD]


像这种稍大些的数据和词典是否放在bos提供链接出来呢

考虑到提供链接还需要单独下载和放到对应目录比较麻烦，就没有单独存储字典，可否在下一版中改进，当前时间比较紧张

guoshengCS · 2022-03-24T02:56:06Z

examples/model_interpretation/task/senti/roberta/modeling.py

+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from roberta.transformer import TransformerEncoderLayer, TransformerEncoder
+
+__all__ = [


另外也看到存在了多个 roberta/modeling.py、roberta/transformer.py ，请问下这些是一样的不

transformer.py是一样的，modeling.py是不一样的

* upload NLP interpretation * fix problems and relocate project * remove abandoned picture * remove abandoned picture * fix dead link in README * fix dead link in README * fix code style problems * fix CR round 1 * remove .gitkeep files * fix code style * fix file encoding problem * fix code style * delete duplicated files due to directory rebuild * fix CR round 2 * fix code style * fix ernie tokenizer * fix code style * fix problem from CR round 1 * fix bugs * fix README * remove duplicated files * deal with diff of old and new tokenizer results * fix CR round 4 * fix code style * add missing dependence * fix broken import path * move some data file to cloud * MRC upper case to lower case Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> Co-authored-by: binlinquge <xxx> Co-authored-by: Guo Sheng <guosheng@baidu.com>

upload NLP interpretation

e786867

yingyibiao assigned ZeyuChen Mar 11, 2022

ZeyuChen self-requested a review March 11, 2022 13:42

ZeyuChen added the enhancement New feature or request label Mar 11, 2022

binlinquge added 5 commits March 14, 2022 15:42

fix problems and relocate project

63afacb

remove abandoned picture

4d715fc

remove abandoned picture

ef6d8f6

fix dead link in README

d4e9e68

fix dead link in README

04b7cb5

ZeyuChen changed the title ~~upload NLP interpretation~~ Add NLP interpretation Mar 14, 2022

ZeyuChen changed the title ~~Add NLP interpretation~~ Add NLP model interpretation Mar 14, 2022

binlinquge and others added 3 commits March 15, 2022 19:27

Merge branch 'PaddlePaddle:develop' into develop

3b0e081

fix code style problems

b3337c3

Merge branch 'develop' of https://github.com/binlinquge/PaddleNLP int…

4a54362

…o develop

ZeyuChen added this to the PaddleNLP v2.3 milestone Mar 15, 2022

ZeyuChen reviewed Mar 15, 2022

View reviewed changes

ZeyuChen and others added 7 commits March 16, 2022 00:36

Merge branch 'develop' into develop

18bf999

fix CR round 1

d766d94

remove .gitkeep files

734911b

fix code style

bd2f2d2

fix file encoding problem

916cff3

fix code style

5111301

delete duplicated files due to directory rebuild

0145072

ZeyuChen reviewed Mar 16, 2022

View reviewed changes

ZeyuChen reviewed Mar 17, 2022

View reviewed changes

fix CR round 2

6724d26

binlinquge force-pushed the develop branch from 180884c to 6724d26 Compare March 18, 2022 12:04

fix code style

1ed3581

ZeyuChen reviewed Mar 18, 2022

View reviewed changes

binlinquge and others added 3 commits March 21, 2022 21:41

fix ernie tokenizer

8493aec

fix code style

3555ff9

fix problem from CR round 1

1ff7953

guoshengCS reviewed Mar 22, 2022

View reviewed changes

binlinquge added 2 commits March 23, 2022 14:49

fix bugs

f3112b9

fix README

7bf71fc

guoshengCS reviewed Mar 23, 2022

View reviewed changes

remove duplicated files

36357ed

guoshengCS reviewed Mar 24, 2022

View reviewed changes

binlinquge added 7 commits March 24, 2022 14:07

deal with diff of old and new tokenizer results

ee07f9d

fix CR round 4

b180c59

fix code style

c75353b

add missing dependence

ff71607

fix broken import path

3270158

move some data file to cloud

f06d261

MRC upper case to lower case

ef88b5e

guoshengCS approved these changes Mar 25, 2022

View reviewed changes

Merge branch 'develop' into develop

7bd929e

guoshengCS merged commit 93cae49 into PaddlePaddle:develop Mar 25, 2022

guoshengCS mentioned this pull request Mar 27, 2022

Upgrade Roberta tokenizer #1821

Merged

guoshengCS mentioned this pull request Apr 29, 2022

PaddleNLP v2.3rc Release Note Candidate #2031

Closed

		@@ -0,0 +1,608 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,630 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

		[//]: shenyaozong(shenyaozong@baidu.com)


		讨论

Add NLP model interpretation #1752

Add NLP model interpretation #1752

Conversation

binlinquge commented Mar 10, 2022 • edited Loading

PR types

PR changes

Description

CLAassistant commented Mar 10, 2022 • edited Loading

ZeyuChen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeyuChen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeyuChen left a comment

Choose a reason for hiding this comment

ZeyuChen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binlinquge commented Mar 10, 2022 •

edited

Loading

CLAassistant commented Mar 10, 2022 •

edited

Loading

ZeyuChen left a comment •

edited

Loading

guoshengCS Mar 24, 2022 •

edited

Loading