-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add mobilebert model #1160
add mobilebert model #1160
Conversation
需要添加一下docstring哈~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参照 PaddleNLP 中其他模型的 docstring 再做下补充
可以再补充下 example
] | ||
|
||
|
||
def gelu_new(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/functional/gelu_cn.html#gelu
可以参考文档,gelu 这里不需要额外实现一个 gelu_new,PaddlePaddle 提供了对应的实现
MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"] | ||
|
||
|
||
class NoNorm(nn.Layer): #paddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除无明确意义的注释。
pretrained_resource_files_map = { | ||
"vocab_file": { | ||
"mobilebert-uncased": | ||
"https://huggingface.co/google/mobilebert-uncased/resolve/main/vocab.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingyibiao 可以提供下 paddlenlp 的链接,然后这里做下对应修改。
"tanh": F.tanh, | ||
} | ||
|
||
MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的是否有用到?没有用到可以删除。
>>> input_ids = paddle.to_tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 | ||
>>> outputs = model(input_ids) | ||
>>> prediction_logits = outputs[0] | ||
>>> seq_relationship_logits = outputs[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以参考下 BERT 这里的文档修改下 docstring 的格式。
权重已上传到对应bos链接~ |
ACT2FN = { | ||
"relu": F.relu, | ||
"gelu": F.gelu, | ||
"gelu_new": gelu_new, | ||
"tanh": F.tanh, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不再支持近似计算的 gelu 和 tanh 的原因是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模型没用到
encoded_inputs["input_ids"] = sequence | ||
if return_token_type_ids: | ||
encoded_inputs["token_type_ids"] = token_type_ids | ||
if return_special_tokens_mask: | ||
encoded_inputs[ | ||
"special_tokens_mask"] = self.get_special_tokens_mask( | ||
ids, pair_ids) | ||
if return_length: | ||
encoded_inputs["seq_len"] = len(encoded_inputs[ | ||
"input_ids"]) | ||
|
||
# Check lengths | ||
assert max_seq_len is None or len(encoded_inputs[ | ||
"input_ids"]) <= max_seq_len | ||
|
||
# Padding | ||
needs_to_be_padded = pad_to_max_seq_len and \ | ||
max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len | ||
|
||
encoded_inputs['offset_mapping'] = offset_mapping | ||
|
||
if needs_to_be_padded: | ||
difference = max_seq_len - len(encoded_inputs[ | ||
"input_ids"]) | ||
if self.padding_side == 'right': | ||
if return_attention_mask: | ||
encoded_inputs["attention_mask"] = [1] * len( | ||
encoded_inputs[ | ||
"input_ids"]) + [0] * difference | ||
if return_token_type_ids: | ||
# 0 for padding token mask | ||
encoded_inputs["token_type_ids"] = ( | ||
encoded_inputs["token_type_ids"] + | ||
[self.pad_token_type_id] * difference) | ||
if return_special_tokens_mask: | ||
encoded_inputs[ | ||
"special_tokens_mask"] = encoded_inputs[ | ||
"special_tokens_mask"] + [1 | ||
] * difference | ||
encoded_inputs["input_ids"] = encoded_inputs[ | ||
"input_ids"] + [self.pad_token_id] * difference | ||
encoded_inputs['offset_mapping'] = encoded_inputs[ | ||
'offset_mapping'] + [(0, 0)] * difference | ||
elif self.padding_side == 'left': | ||
if return_attention_mask: | ||
encoded_inputs["attention_mask"] = [ | ||
0 | ||
] * difference + [1] * len(encoded_inputs[ | ||
"input_ids"]) | ||
if return_token_type_ids: | ||
# 0 for padding token mask | ||
encoded_inputs["token_type_ids"] = ( | ||
[self.pad_token_type_id] * difference + | ||
encoded_inputs["token_type_ids"]) | ||
if return_special_tokens_mask: | ||
encoded_inputs["special_tokens_mask"] = [ | ||
1 | ||
] * difference + encoded_inputs[ | ||
"special_tokens_mask"] | ||
encoded_inputs["input_ids"] = [ | ||
self.pad_token_id | ||
] * difference + encoded_inputs["input_ids"] | ||
encoded_inputs['offset_mapping'] = [ | ||
(0, 0) | ||
] * difference + encoded_inputs['offset_mapping'] | ||
else: | ||
if return_attention_mask: | ||
encoded_inputs["attention_mask"] = [1] * len( | ||
encoded_inputs["input_ids"]) | ||
|
||
if return_position_ids: | ||
encoded_inputs["position_ids"] = list( | ||
range(len(encoded_inputs["input_ids"]))) | ||
|
||
encoded_inputs['overflow_to_sample'] = example_id | ||
batch_encode_inputs.append(encoded_inputs) | ||
|
||
if len(second_ids) <= max_len_for_pair: | ||
break | ||
else: | ||
second_ids = second_ids[max_len_for_pair - stride:] | ||
token_pair_offset_mapping = token_pair_offset_mapping[ | ||
max_len_for_pair - stride:] | ||
|
||
else: | ||
batch_encode_inputs.append( | ||
self.encode( | ||
first_ids, | ||
second_ids, | ||
max_seq_len=max_seq_len, | ||
pad_to_max_seq_len=pad_to_max_seq_len, | ||
truncation_strategy=truncation_strategy, | ||
return_position_ids=return_position_ids, | ||
return_token_type_ids=return_token_type_ids, | ||
return_attention_mask=return_attention_mask, | ||
return_length=return_length, | ||
return_overflowing_tokens=return_overflowing_tokens, | ||
return_special_tokens_mask=return_special_tokens_mask)) | ||
|
||
return batch_encode_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我确认一下,MobileBertTokenizer是和BertTokenizer一模一样吗?如果是的话,这里的batch_encode方法为什么需要重写呢~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp的BertTokenizer在SQuAD任务上输出的与hugginface没有对齐,在复现SQuADV1, SQuADV2 两个数据集的结果时 ,用BertTokenizer得到的模型结果总是差一个点, 重写batch_encode是为了对齐SQuAD任务
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Models
Description
添加MobileBERT 模型,model以及tokenizer
相关验证与复现见 https://github.com/nosaydomore/MobileBert_paddle
预训练模型百度云链接:
链接: https://pan.baidu.com/s/1kQN6k29tRhO2I-5QEV2yHw 提取码: imtf 复制这段内容后打开百度网盘手机App,操作更方便哦