Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mobilebert model #1160

Merged
merged 19 commits into from
Dec 8, 2021
Merged

add mobilebert model #1160

merged 19 commits into from
Dec 8, 2021

Conversation

nosaydomore
Copy link
Contributor

PR types

New features

PR changes

Models

Description

添加MobileBERT 模型,model以及tokenizer
相关验证与复现见 https://github.com/nosaydomore/MobileBert_paddle

预训练模型百度云链接:
链接: https://pan.baidu.com/s/1kQN6k29tRhO2I-5QEV2yHw 提取码: imtf 复制这段内容后打开百度网盘手机App,操作更方便哦

@yingyibiao yingyibiao self-requested a review October 13, 2021 11:07
@yingyibiao yingyibiao self-assigned this Oct 13, 2021
@yingyibiao yingyibiao removed their request for review October 13, 2021 11:10
@yingyibiao
Copy link
Contributor

需要添加一下docstring哈~

Copy link
Contributor

@FrostML FrostML left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参照 PaddleNLP 中其他模型的 docstring 再做下补充
可以再补充下 example

]


def gelu_new(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/functional/gelu_cn.html#gelu
可以参考文档,gelu 这里不需要额外实现一个 gelu_new,PaddlePaddle 提供了对应的实现

MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"]


class NoNorm(nn.Layer): #paddle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除无明确意义的注释。

pretrained_resource_files_map = {
"vocab_file": {
"mobilebert-uncased":
"https://huggingface.co/google/mobilebert-uncased/resolve/main/vocab.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingyibiao 可以提供下 paddlenlp 的链接,然后这里做下对应修改。

"tanh": F.tanh,
}

MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的是否有用到?没有用到可以删除。

>>> input_ids = paddle.to_tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids)
>>> prediction_logits = outputs[0]
>>> seq_relationship_logits = outputs[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以参考下 BERT 这里的文档修改下 docstring 的格式。

@ZeyuChen ZeyuChen added contributions transformers Transformer-based pre-trained models labels Dec 4, 2021
@yingyibiao
Copy link
Contributor

权重已上传到对应bos链接~

ACT2FN = {
"relu": F.relu,
"gelu": F.gelu,
"gelu_new": gelu_new,
"tanh": F.tanh,
Copy link
Contributor

@FrostML FrostML Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不再支持近似计算的 gelu 和 tanh 的原因是?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模型没用到

paddlenlp/transformers/mobilebert/modeling.py Show resolved Hide resolved
Comment on lines 6 to 288
encoded_inputs["input_ids"] = sequence
if return_token_type_ids:
encoded_inputs["token_type_ids"] = token_type_ids
if return_special_tokens_mask:
encoded_inputs[
"special_tokens_mask"] = self.get_special_tokens_mask(
ids, pair_ids)
if return_length:
encoded_inputs["seq_len"] = len(encoded_inputs[
"input_ids"])

# Check lengths
assert max_seq_len is None or len(encoded_inputs[
"input_ids"]) <= max_seq_len

# Padding
needs_to_be_padded = pad_to_max_seq_len and \
max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len

encoded_inputs['offset_mapping'] = offset_mapping

if needs_to_be_padded:
difference = max_seq_len - len(encoded_inputs[
"input_ids"])
if self.padding_side == 'right':
if return_attention_mask:
encoded_inputs["attention_mask"] = [1] * len(
encoded_inputs[
"input_ids"]) + [0] * difference
if return_token_type_ids:
# 0 for padding token mask
encoded_inputs["token_type_ids"] = (
encoded_inputs["token_type_ids"] +
[self.pad_token_type_id] * difference)
if return_special_tokens_mask:
encoded_inputs[
"special_tokens_mask"] = encoded_inputs[
"special_tokens_mask"] + [1
] * difference
encoded_inputs["input_ids"] = encoded_inputs[
"input_ids"] + [self.pad_token_id] * difference
encoded_inputs['offset_mapping'] = encoded_inputs[
'offset_mapping'] + [(0, 0)] * difference
elif self.padding_side == 'left':
if return_attention_mask:
encoded_inputs["attention_mask"] = [
0
] * difference + [1] * len(encoded_inputs[
"input_ids"])
if return_token_type_ids:
# 0 for padding token mask
encoded_inputs["token_type_ids"] = (
[self.pad_token_type_id] * difference +
encoded_inputs["token_type_ids"])
if return_special_tokens_mask:
encoded_inputs["special_tokens_mask"] = [
1
] * difference + encoded_inputs[
"special_tokens_mask"]
encoded_inputs["input_ids"] = [
self.pad_token_id
] * difference + encoded_inputs["input_ids"]
encoded_inputs['offset_mapping'] = [
(0, 0)
] * difference + encoded_inputs['offset_mapping']
else:
if return_attention_mask:
encoded_inputs["attention_mask"] = [1] * len(
encoded_inputs["input_ids"])

if return_position_ids:
encoded_inputs["position_ids"] = list(
range(len(encoded_inputs["input_ids"])))

encoded_inputs['overflow_to_sample'] = example_id
batch_encode_inputs.append(encoded_inputs)

if len(second_ids) <= max_len_for_pair:
break
else:
second_ids = second_ids[max_len_for_pair - stride:]
token_pair_offset_mapping = token_pair_offset_mapping[
max_len_for_pair - stride:]

else:
batch_encode_inputs.append(
self.encode(
first_ids,
second_ids,
max_seq_len=max_seq_len,
pad_to_max_seq_len=pad_to_max_seq_len,
truncation_strategy=truncation_strategy,
return_position_ids=return_position_ids,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_length=return_length,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask))

return batch_encode_inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我确认一下,MobileBertTokenizer是和BertTokenizer一模一样吗?如果是的话,这里的batch_encode方法为什么需要重写呢~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp的BertTokenizer在SQuAD任务上输出的与hugginface没有对齐,在复现SQuADV1, SQuADV2 两个数据集的结果时 ,用BertTokenizer得到的模型结果总是差一个点, 重写batch_encode是为了对齐SQuAD任务

Copy link
Contributor

@yingyibiao yingyibiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yingyibiao yingyibiao merged commit ae02c31 into PaddlePaddle:develop Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
transformers Transformer-based pre-trained models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants