Add unittest for bert modeling and tokenizer #2624

yingyibiao · 2022-06-23T07:52:28Z

PR types

Test

PR changes

Models

Description

Add unittest for bert modeling and tokenizer

guoshengCS · 2022-07-14T07:44:12Z

paddlenlp/transformers/bert/tokenizer.py

+        "uer/chinese-roberta-6l-768h": 512,
+        "uer/chinese-roberta-small": 512,
+        "uer/chinese-roberta-mini": 512,
+        "uer/chinese-roberta-tiny": 512,


这里是所有生态模型的也都要放在这里吗，是怎样一个添加原则呢

我看HF上也没有进行生态模型的相关设置，只对代码中的模型进行了设置。

FrostML · 2022-07-18T07:20:32Z

paddlenlp/transformers/bert/modeling.py

@@ -496,7 +496,7 @@ class BertModel(BertPretrainedModel):
    """

    def __init__(self,
-                 vocab_size,
+                 vocab_size=30522,


文档是否是需要加上默认值的设置？

FrostML · 2022-07-18T07:23:46Z

paddlenlp/transformers/bert/tokenizer.py

    padding_side = 'right'

    def __init__(self,
                 vocab_file,
                 do_lower_case=True,
+                 do_basic_tokenize=True,
+                 never_split=None,


是否也是需要补充下文档？

FrostML · 2022-07-18T07:23:55Z

paddlenlp/transformers/bert/tokenizer.py

                 unk_token="[UNK]",
                 sep_token="[SEP]",
                 pad_token="[PAD]",
                 cls_token="[CLS]",
                 mask_token="[MASK]",
+                 tokenize_chinese_chars=True,
+                 strip_accents=None,


paddlenlp/transformers/tokenizer_utils_base.py

FrostML · 2022-07-18T07:58:08Z

tests/transformers/test_modeling_common.py

+global_rng = random.Random()
+
+
+def ids_tensor(shape, vocab_size, rng=None):


这个方法是否可以直接

return paddle.randint(low=0, high=vocab_size, dtype="int32", shape=shape)

呢？
没有直接使用 paddle 是因为对比当前方法下会有什么问题么？

应该没有问题，已按照建议修改

FrostML · 2022-07-19T03:32:18Z

tests/transformers/test_modeling_common.py

+    return attn_mask
+
+
+def floats_tensor(shape, scale=1.0, rng=None):


同上，是否是也可以简化。简化后是否是会有什么问题？

应该没有问题，已按照建议修改

guoshengCS · 2022-07-19T07:14:14Z

paddlenlp/transformers/tokenizer_utils_base.py

+               return_overflowing_tokens: bool = False,
+               return_special_tokens_mask: bool = False,
+               return_offsets_mapping: bool = False,
+               return_length: bool = False,


这个有造成已知不兼容的模型示例吗，这种要起码保证check了咱们自己代码库中没有break的

这里没有新增参数，只是将参数的顺序调整为和 HF 一致，没有发现有 break 的现象。

针对这个pr回归发现一个模型break:

这个有造成已知不兼容的模型示例吗，这种要起码保证check了咱们自己代码库中没有break的

就是这种起码要保证咱们自己在IDE中搜下.encode(确认没有不兼容的使用情况，不是说没有新增参数，只是将参数的顺序调整为和 HF 一致就不会break

guoshengCS · 2022-07-19T09:36:07Z

paddlenlp/data/vocab.py

+            return [
+                self._token_to_idx[token] if tokens in self._token_to_idx else
+                self._token_to_idx[self.unk_token] for token in tokens
+            ]


看着如果self.unk_token不是None的话Vocab的_token_to_idx是会返回unk_token的，这里是必须修改的吗

没有太理解这个comment，这里修改后有什么潜在的问题呢～

就是为啥要修改这里，是Vocab及_token_to_idx对unk的处理有bug吗，看代码Vocab及_token_to_idx是能够处理unk的

guoshengCS · 2022-07-20T02:15:05Z

paddlenlp/transformers/tokenizer_utils.py

-
+            if token in self.all_special_tokens:
+                token = token.lower() if hasattr(
+                    self, "do_lower_case") and self.do_lower_case else token


注意到这里还挺好的，这里是否后面需要调整成在上面产生split_tokens之前就用lower_case https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738 ，是否会有区别

special_token 不应该受到 do_lower_case 影响。split_tokens 里面的 special_token 不应该进行 lower 操作。

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738

【产生split_tokens之前就用lower_case】和【special_token不应该受到do_lower_case影响】是两个事情，上面那里的代码就是同时做这两个的，而且那个是tokenize里的代码，按理来说 get_offset_mapping 这里是要和tokenize完全一致的，而这里实际上和tokenize里的处理不太一样，所以会问这里是否会有区别和影响。

如果回归没有问题的话就尽快合入吧

guoshengCS · 2022-07-20T02:17:57Z

tests/transformers/bert/test_modeling.py

+                                        self.type_vocab_size)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask


这种最好以上加上TODO说明以后要增加吧

没有太理解这里的Todo具体指什么

这个TODO是说要后面加上更多的输入内容

guoshengCS · 2022-07-20T02:28:21Z

tests/transformers/bert/test_modeling.py

-        self.expected_pooled_shape = (self.config['batch_size'], 2)
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        # print(config_and_inputs)


注意去掉

guoshengCS · 2022-07-20T02:34:12Z

tests/transformers/test_modeling_common.py

+            if model_class == self.base_model_class:
+                model = model_class(**config)
+            else:
+                model = model_class(self.base_model_class(**config))


这里为什么我们需要引入并且区分是否是base_model_class呢，是否是因为resize_token_embeddings实现的不够好呢

这里的判断和 resize_token_embeddings 无关。主要是模型初始化的差异：base_model_class 可以使用 config 进行初始化，其他 class 需要使用 base 模型实例进行初始化。

guoshengCS · 2022-07-20T02:34:20Z

tests/transformers/test_modeling_common.py

+            model_vocab_size = config["vocab_size"]
+            # Retrieve the embeddings and clone theme
+            model_embed = model.resize_token_embeddings(model_vocab_size)
+            print(model_embed)


注意去掉

guoshengCS · 2022-07-20T03:53:34Z

tests/transformers/test_tokenizer_common.py

+                "models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow.",
+                "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
+                "conditioning on both left and right context in all layers.",
+                "The quick brown fox jumps over the lazy dog.",


这个case是不是最好还是改下呢

guoshengCS · 2022-07-20T10:33:23Z

paddlenlp/data/vocab.py

+            return [
+                self._token_to_idx[token] if tokens in self._token_to_idx else
+                self._token_to_idx[self.unk_token] for token in tokens
+            ]


就是为啥要修改这里，是Vocab及_token_to_idx对unk的处理有bug吗，看代码Vocab及_token_to_idx是能够处理unk的

guoshengCS · 2022-07-20T10:38:31Z

paddlenlp/transformers/tokenizer_utils.py

-
+            if token in self.all_special_tokens:
+                token = token.lower() if hasattr(
+                    self, "do_lower_case") and self.do_lower_case else token


https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738

【产生split_tokens之前就用lower_case】和【special_token不应该受到do_lower_case影响】是两个事情，上面那里的代码就是同时做这两个的，而且那个是tokenize里的代码，按理来说 get_offset_mapping 这里是要和tokenize完全一致的，而这里实际上和tokenize里的处理不太一样，所以会问这里是否会有区别和影响。

guoshengCS · 2022-07-20T10:40:29Z

tests/transformers/bert/test_modeling.py

+                                        self.type_vocab_size)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask


这个TODO是说要后面加上更多的输入内容

guoshengCS · 2022-07-20T11:05:48Z

paddlenlp/transformers/tokenizer_utils_base.py

+               return_overflowing_tokens: bool = False,
+               return_special_tokens_mask: bool = False,
+               return_offsets_mapping: bool = False,
+               return_length: bool = False,


这个有造成已知不兼容的模型示例吗，这种要起码保证check了咱们自己代码库中没有break的

就是这种起码要保证咱们自己在IDE中搜下.encode(确认没有不兼容的使用情况，不是说没有新增参数，只是将参数的顺序调整为和 HF 一致就不会break

guoshengCS · 2022-07-20T11:50:37Z

tests/transformers/test_tokenizer_common.py

+                token_type_padding_idx = tokenizer.pad_token_type_id
+
+                encoded_sequence = tokenizer.encode(
+                    sequence, return_special_tokens_mask=True)


咱们也可以加上encode_plus的

guoshengCS · 2022-07-20T11:58:09Z

tests/transformers/test_tokenizer_common.py

+    #                 return_tensors="pd")
+    #             self.assertEqual(batch_encoder_only.input_ids.shape[1], 3)
+    #             self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3)
+    #             self.assertNotIn("decoder_input_ids", batch_encoder_only)


以上这些再清理下吧

guoshengCS · 2022-07-20T12:01:44Z

paddlenlp/transformers/tokenizer_utils.py

-
+            if token in self.all_special_tokens:
+                token = token.lower() if hasattr(
+                    self, "do_lower_case") and self.do_lower_case else token


如果回归没有问题的话就尽快合入吧

guoshengCS

LGTM

zjjlivein · 2022-07-21T02:18:00Z

tests/transformers/bert/test_tokenizer.py


+from tests.testing_utils import slow
+from tests.transformers.test_tokenizer_common import TokenizerTesterMixin, filter_non_english


这里路径问题，会出现代码覆盖率计算错误。建议修改成：
from testing_utils import slow
from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english

zjjlivein · 2022-07-21T02:18:41Z

tests/transformers/bert/test_modeling.py

+from paddlenlp.transformers import BertModel, BertForQuestionAnswering, BertForSequenceClassification,\
+    BertForTokenClassification, BertForPretraining, BertForMultipleChoice, BertForMaskedLM, BertPretrainedModel
+from tests.transformers.test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin
+from tests.testing_utils import slow


修改为：
from ..test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin
from testing_utils import slow

zjjlivein · 2022-07-21T02:19:33Z

tests/transformers/test_tokenizer_common.py

+                                    BertTokenizer, PretrainedTokenizer)
+from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase
+from paddlenlp.transformers.tokenizer_utils import AddedToken, Trie
+from tests.testing_utils import get_tests_dir, slow


同上：
from testing_utils import get_tests_dir, slow

zjjlivein · 2022-07-21T02:26:16Z

tests/transformers/bert/test_modeling.py

-
-class TestBertFromPretrain(CommonTest):
+    @slow
+    def test_inference_no_attention(self):


这里test_inference_no_attention与 test_inference_with_attention case一样

yingyibiao added 7 commits June 23, 2022 15:51

add unittest for bert modeling

6c0497f

add unittest for bert modeling

e7d9da7

Merge branch 'tests' of github.com:yingyibiao/PaddleNLP into tests

46e9032

Merge branch 'develop' into tests

28752f8

Merge branch 'develop' into tests

c878ae9

Merge branch 'tests' of github.com:yingyibiao/PaddleNLP into tests

42ff872

Merge branch 'develop' into tests

cf78676

yingyibiao requested a review from guoshengCS July 7, 2022 03:14

yingyibiao added 3 commits July 8, 2022 10:49

Merge branch 'develop' into tests

0bbd7a7

Merge branch 'tests' of github.com:yingyibiao/PaddleNLP into tests

252563e

Merge branch 'develop' into tests

c340b93

wj-Mcat mentioned this pull request Jul 8, 2022

fix resize token embedding method #2763

Merged

yingyibiao added 2 commits July 8, 2022 15:47

update

453fc3d

update bert tokenizer unttest

b44d53c

yingyibiao changed the title ~~Add unittest for bert modeling~~ Add unittest for bert modeling and tokenizer Jul 12, 2022

Merge branch 'develop' into tests

398b18a

yingyibiao marked this pull request as ready for review July 12, 2022 12:17

yingyibiao added 4 commits July 13, 2022 11:46

update bert tokenizer unttest

cc89b95

Merge branch 'develop' into tests

2d794f3

update

cca2316

Merge branch 'develop' into tests

2562b88

yingyibiao requested a review from zjjlivein July 14, 2022 02:58

guoshengCS reviewed Jul 15, 2022

View reviewed changes

yingyibiao added 4 commits July 15, 2022 12:56

fix relative import

65adeb2

Merge branch 'tests' of github.com:yingyibiao/PaddleNLP into tests

6035e93

Merge branch 'develop' into tests

eb6c37e

add set_input_embeddings

33a5065

yingyibiao requested a review from guoshengCS July 18, 2022 07:38

FrostML reviewed Jul 19, 2022

View reviewed changes

update

938f676

yingyibiao requested a review from FrostML July 19, 2022 08:10

yingyibiao added 4 commits July 19, 2022 16:20

update

87d6ba4

Merge branch 'develop' into tests

6946a79

fix vocab bug

5c5ee2b

Merge branch 'tests' of github.com:yingyibiao/PaddleNLP into tests

c775ddd

guoshengCS reviewed Jul 19, 2022

View reviewed changes

guoshengCS reviewed Jul 20, 2022

View reviewed changes

yingyibiao added 3 commits July 20, 2022 16:28

Merge branch 'develop' into tests

31c8d3d

delete print statements

03ad203

fix get_offset_mapping bug

dd8cf79

guoshengCS reviewed Jul 20, 2022

View reviewed changes

guoshengCS approved these changes Jul 20, 2022

View reviewed changes

zjjlivein reviewed Jul 21, 2022

View reviewed changes

yingyibiao merged commit 8e2f1dc into PaddlePaddle:develop Jul 21, 2022

yingyibiao deleted the tests branch July 21, 2022 02:36

		global_rng = random.Random()


		def ids_tensor(shape, vocab_size, rng=None):

		return attn_mask


		def floats_tensor(shape, scale=1.0, rng=None):


		from tests.testing_utils import slow
		from tests.transformers.test_tokenizer_common import TokenizerTesterMixin, filter_non_english

Add unittest for bert modeling and tokenizer #2624

Add unittest for bert modeling and tokenizer #2624

Conversation

yingyibiao commented Jun 23, 2022 • edited Loading

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingyibiao commented Jun 23, 2022 •

edited

Loading