Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unittest for bert modeling and tokenizer #2624

Merged
merged 29 commits into from
Jul 21, 2022

Conversation

yingyibiao
Copy link
Contributor

@yingyibiao yingyibiao commented Jun 23, 2022

PR types

Test

PR changes

Models

Description

Add unittest for bert modeling and tokenizer

@yingyibiao yingyibiao changed the title Add unittest for bert modeling Add unittest for bert modeling and tokenizer Jul 12, 2022
@yingyibiao yingyibiao marked this pull request as ready for review July 12, 2022 12:17
"uer/chinese-roberta-6l-768h": 512,
"uer/chinese-roberta-small": 512,
"uer/chinese-roberta-mini": 512,
"uer/chinese-roberta-tiny": 512,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是所有生态模型的也都要放在这里吗,是怎样一个添加原则呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看HF上也没有进行生态模型的相关设置,只对代码中的模型进行了设置。

@@ -496,7 +496,7 @@ class BertModel(BertPretrainedModel):
"""

def __init__(self,
vocab_size,
vocab_size=30522,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档是否是需要加上默认值的设置?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

padding_side = 'right'

def __init__(self,
vocab_file,
do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否也是需要补充下文档?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

unk_token="[UNK]",
sep_token="[SEP]",
pad_token="[PAD]",
cls_token="[CLS]",
mask_token="[MASK]",
tokenize_chinese_chars=True,
strip_accents=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

global_rng = random.Random()


def ids_tensor(shape, vocab_size, rng=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法是否可以直接

return paddle.randint(low=0, high=vocab_size, dtype="int32", shape=shape)

呢?
没有直接使用 paddle 是因为对比当前方法下会有什么问题么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该没有问题,已按照建议修改

return attn_mask


def floats_tensor(shape, scale=1.0, rng=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,是否是也可以简化。简化后是否是会有什么问题?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该没有问题,已按照建议修改

@yingyibiao yingyibiao requested a review from FrostML July 19, 2022 08:10
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有新增参数,只是将参数的顺序调整为和 HF 一致,没有发现有 break 的现象。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

针对这个pr回归发现一个模型break:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的

就是这种起码要保证咱们自己在IDE中搜下.encode(确认没有不兼容的使用情况,不是说没有新增参数,只是将参数的顺序调整为和 HF 一致就不会break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

return [
self._token_to_idx[token] if tokens in self._token_to_idx else
self._token_to_idx[self.unk_token] for token in tokens
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看着如果self.unk_token不是None的话Vocab的_token_to_idx是会返回unk_token的,这里是必须修改的吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有太理解这个comment,这里修改后有什么潜在的问题呢~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是为啥要修改这里,是Vocab及_token_to_idx对unk的处理有bug吗,看代码Vocab及_token_to_idx是能够处理unk的


if token in self.all_special_tokens:
token = token.lower() if hasattr(
self, "do_lower_case") and self.do_lower_case else token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意到这里还挺好的,这里是否后面需要调整成在上面产生split_tokens之前就用lower_case https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738 ,是否会有区别

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special_token 不应该受到 do_lower_case 影响。split_tokens 里面的 special_token 不应该进行 lower 操作。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738

【产生split_tokens之前就用lower_case】 和 【special_token不应该受到do_lower_case影响】 是两个事情,上面那里的代码就是同时做这两个的,而且那个是tokenize里的代码,按理来说 get_offset_mapping 这里是要和tokenize完全一致的,而这里实际上和tokenize里的处理不太一样,所以会问这里是否会有区别和影响。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果回归没有问题的话就尽快合入吧

self.type_vocab_size)

config = self.get_config()
return config, input_ids, token_type_ids, input_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种最好以上加上TODO说明以后要增加吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有太理解这里的Todo具体指什么

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个TODO是说要后面加上更多的输入内容

self.expected_pooled_shape = (self.config['batch_size'], 2)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
# print(config_and_inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if model_class == self.base_model_class:
model = model_class(**config)
else:
model = model_class(self.base_model_class(**config))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么我们需要引入并且区分是否是base_model_class呢,是否是因为resize_token_embeddings实现的不够好呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的判断和 resize_token_embeddings 无关。主要是模型初始化的差异:base_model_class 可以使用 config 进行初始化,其他 class 需要使用 base 模型实例进行初始化。

model_vocab_size = config["vocab_size"]
# Retrieve the embeddings and clone theme
model_embed = model.resize_token_embeddings(model_vocab_size)
print(model_embed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow.",
"BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
"conditioning on both left and right context in all layers.",
"The quick brown fox jumps over the lazy dog.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个case是不是最好还是改下呢

return [
self._token_to_idx[token] if tokens in self._token_to_idx else
self._token_to_idx[self.unk_token] for token in tokens
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是为啥要修改这里,是Vocab及_token_to_idx对unk的处理有bug吗,看代码Vocab及_token_to_idx是能够处理unk的


if token in self.all_special_tokens:
token = token.lower() if hasattr(
self, "do_lower_case") and self.do_lower_case else token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738

【产生split_tokens之前就用lower_case】 和 【special_token不应该受到do_lower_case影响】 是两个事情,上面那里的代码就是同时做这两个的,而且那个是tokenize里的代码,按理来说 get_offset_mapping 这里是要和tokenize完全一致的,而这里实际上和tokenize里的处理不太一样,所以会问这里是否会有区别和影响。

self.type_vocab_size)

config = self.get_config()
return config, input_ids, token_type_ids, input_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个TODO是说要后面加上更多的输入内容

return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的

就是这种起码要保证咱们自己在IDE中搜下.encode(确认没有不兼容的使用情况,不是说没有新增参数,只是将参数的顺序调整为和 HF 一致就不会break

token_type_padding_idx = tokenizer.pad_token_type_id

encoded_sequence = tokenizer.encode(
sequence, return_special_tokens_mask=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

咱们也可以加上encode_plus的

# return_tensors="pd")
# self.assertEqual(batch_encoder_only.input_ids.shape[1], 3)
# self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3)
# self.assertNotIn("decoder_input_ids", batch_encoder_only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以上这些再清理下吧


if token in self.all_special_tokens:
token = token.lower() if hasattr(
self, "do_lower_case") and self.do_lower_case else token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果回归没有问题的话就尽快合入吧

Copy link
Contributor

@guoshengCS guoshengCS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


from tests.testing_utils import slow
from tests.transformers.test_tokenizer_common import TokenizerTesterMixin, filter_non_english
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里路径问题,会出现代码覆盖率计算错误。建议修改成:
from testing_utils import slow
from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english

from paddlenlp.transformers import BertModel, BertForQuestionAnswering, BertForSequenceClassification,\
BertForTokenClassification, BertForPretraining, BertForMultipleChoice, BertForMaskedLM, BertPretrainedModel
from tests.transformers.test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin
from tests.testing_utils import slow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为:
from ..test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin
from testing_utils import slow

BertTokenizer, PretrainedTokenizer)
from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase
from paddlenlp.transformers.tokenizer_utils import AddedToken, Trie
from tests.testing_utils import get_tests_dir, slow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上:
from testing_utils import get_tests_dir, slow


class TestBertFromPretrain(CommonTest):
@slow
def test_inference_no_attention(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里test_inference_no_attention与 test_inference_with_attention case一样

@yingyibiao yingyibiao merged commit 8e2f1dc into PaddlePaddle:develop Jul 21, 2022
@yingyibiao yingyibiao deleted the tests branch July 21, 2022 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants