-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unittest for bert modeling and tokenizer #2624
Conversation
"uer/chinese-roberta-6l-768h": 512, | ||
"uer/chinese-roberta-small": 512, | ||
"uer/chinese-roberta-mini": 512, | ||
"uer/chinese-roberta-tiny": 512, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是所有生态模型的也都要放在这里吗,是怎样一个添加原则呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看HF上也没有进行生态模型的相关设置,只对代码中的模型进行了设置。
@@ -496,7 +496,7 @@ class BertModel(BertPretrainedModel): | |||
""" | |||
|
|||
def __init__(self, | |||
vocab_size, | |||
vocab_size=30522, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档是否是需要加上默认值的设置?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
padding_side = 'right' | ||
|
||
def __init__(self, | ||
vocab_file, | ||
do_lower_case=True, | ||
do_basic_tokenize=True, | ||
never_split=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否也是需要补充下文档?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
unk_token="[UNK]", | ||
sep_token="[SEP]", | ||
pad_token="[PAD]", | ||
cls_token="[CLS]", | ||
mask_token="[MASK]", | ||
tokenize_chinese_chars=True, | ||
strip_accents=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
global_rng = random.Random() | ||
|
||
|
||
def ids_tensor(shape, vocab_size, rng=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个方法是否可以直接
return paddle.randint(low=0, high=vocab_size, dtype="int32", shape=shape)
呢?
没有直接使用 paddle 是因为对比当前方法下会有什么问题么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该没有问题,已按照建议修改
return attn_mask | ||
|
||
|
||
def floats_tensor(shape, scale=1.0, rng=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,是否是也可以简化。简化后是否是会有什么问题?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该没有问题,已按照建议修改
return_overflowing_tokens: bool = False, | ||
return_special_tokens_mask: bool = False, | ||
return_offsets_mapping: bool = False, | ||
return_length: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里没有新增参数,只是将参数的顺序调整为和 HF 一致,没有发现有 break 的现象。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的
就是这种起码要保证咱们自己在IDE中搜下.encode(
确认没有不兼容的使用情况,不是说没有新增参数,只是将参数的顺序调整为和 HF 一致
就不会break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修复
return [ | ||
self._token_to_idx[token] if tokens in self._token_to_idx else | ||
self._token_to_idx[self.unk_token] for token in tokens | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看着如果self.unk_token
不是None的话Vocab的_token_to_idx是会返回unk_token的,这里是必须修改的吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有太理解这个comment,这里修改后有什么潜在的问题呢~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就是为啥要修改这里,是Vocab及_token_to_idx对unk的处理有bug吗,看代码Vocab及_token_to_idx是能够处理unk的
|
||
if token in self.all_special_tokens: | ||
token = token.lower() if hasattr( | ||
self, "do_lower_case") and self.do_lower_case else token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意到这里还挺好的,这里是否后面需要调整成在上面产生split_tokens之前就用lower_case https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils.py#L738 ,是否会有区别
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special_token 不应该受到 do_lower_case 影响。split_tokens 里面的 special_token 不应该进行 lower 操作。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
【产生split_tokens之前就用lower_case】 和 【special_token不应该受到do_lower_case影响】 是两个事情,上面那里的代码就是同时做这两个的,而且那个是tokenize里的代码,按理来说 get_offset_mapping 这里是要和tokenize完全一致的,而这里实际上和tokenize里的处理不太一样,所以会问这里是否会有区别和影响。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果回归没有问题的话就尽快合入吧
self.type_vocab_size) | ||
|
||
config = self.get_config() | ||
return config, input_ids, token_type_ids, input_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种最好以上加上TODO说明以后要增加吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有太理解这里的Todo具体指什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个TODO是说要后面加上更多的输入内容
self.expected_pooled_shape = (self.config['batch_size'], 2) | ||
def test_model(self): | ||
config_and_inputs = self.model_tester.prepare_config_and_inputs() | ||
# print(config_and_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if model_class == self.base_model_class: | ||
model = model_class(**config) | ||
else: | ||
model = model_class(self.base_model_class(**config)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么我们需要引入并且区分是否是base_model_class呢,是否是因为resize_token_embeddings实现的不够好呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的判断和 resize_token_embeddings 无关。主要是模型初始化的差异:base_model_class 可以使用 config 进行初始化,其他 class 需要使用 base 模型实例进行初始化。
model_vocab_size = config["vocab_size"] | ||
# Retrieve the embeddings and clone theme | ||
model_embed = model.resize_token_embeddings(model_vocab_size) | ||
print(model_embed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow.", | ||
"BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly " | ||
"conditioning on both left and right context in all layers.", | ||
"The quick brown fox jumps over the lazy dog.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个case是不是最好还是改下呢
return [ | ||
self._token_to_idx[token] if tokens in self._token_to_idx else | ||
self._token_to_idx[self.unk_token] for token in tokens | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就是为啥要修改这里,是Vocab及_token_to_idx对unk的处理有bug吗,看代码Vocab及_token_to_idx是能够处理unk的
|
||
if token in self.all_special_tokens: | ||
token = token.lower() if hasattr( | ||
self, "do_lower_case") and self.do_lower_case else token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
【产生split_tokens之前就用lower_case】 和 【special_token不应该受到do_lower_case影响】 是两个事情,上面那里的代码就是同时做这两个的,而且那个是tokenize里的代码,按理来说 get_offset_mapping 这里是要和tokenize完全一致的,而这里实际上和tokenize里的处理不太一样,所以会问这里是否会有区别和影响。
self.type_vocab_size) | ||
|
||
config = self.get_config() | ||
return config, input_ids, token_type_ids, input_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个TODO是说要后面加上更多的输入内容
return_overflowing_tokens: bool = False, | ||
return_special_tokens_mask: bool = False, | ||
return_offsets_mapping: bool = False, | ||
return_length: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有造成已知不兼容的模型示例吗,这种要起码保证check了咱们自己代码库中没有break的
就是这种起码要保证咱们自己在IDE中搜下.encode(
确认没有不兼容的使用情况,不是说没有新增参数,只是将参数的顺序调整为和 HF 一致
就不会break
token_type_padding_idx = tokenizer.pad_token_type_id | ||
|
||
encoded_sequence = tokenizer.encode( | ||
sequence, return_special_tokens_mask=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
咱们也可以加上encode_plus的
# return_tensors="pd") | ||
# self.assertEqual(batch_encoder_only.input_ids.shape[1], 3) | ||
# self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3) | ||
# self.assertNotIn("decoder_input_ids", batch_encoder_only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以上这些再清理下吧
|
||
if token in self.all_special_tokens: | ||
token = token.lower() if hasattr( | ||
self, "do_lower_case") and self.do_lower_case else token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果回归没有问题的话就尽快合入吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
from tests.testing_utils import slow | ||
from tests.transformers.test_tokenizer_common import TokenizerTesterMixin, filter_non_english |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里路径问题,会出现代码覆盖率计算错误。建议修改成:
from testing_utils import slow
from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
from paddlenlp.transformers import BertModel, BertForQuestionAnswering, BertForSequenceClassification,\ | ||
BertForTokenClassification, BertForPretraining, BertForMultipleChoice, BertForMaskedLM, BertPretrainedModel | ||
from tests.transformers.test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin | ||
from tests.testing_utils import slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改为:
from ..test_modeling_common import ids_tensor, floats_tensor, random_attention_mask, ModelTesterMixin
from testing_utils import slow
BertTokenizer, PretrainedTokenizer) | ||
from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase | ||
from paddlenlp.transformers.tokenizer_utils import AddedToken, Trie | ||
from tests.testing_utils import get_tests_dir, slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上:
from testing_utils import get_tests_dir, slow
|
||
class TestBertFromPretrain(CommonTest): | ||
@slow | ||
def test_inference_no_attention(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里test_inference_no_attention与 test_inference_with_attention case一样
PR types
Test
PR changes
Models
Description
Add unittest for bert modeling and tokenizer