add mobilebert model #1160

nosaydomore · 2021-10-13T10:59:39Z

PR types

New features

PR changes

Models

Description

添加MobileBERT 模型，model以及tokenizer
相关验证与复现见 https://github.com/nosaydomore/MobileBert_paddle

预训练模型百度云链接：
链接: https://pan.baidu.com/s/1kQN6k29tRhO2I-5QEV2yHw 提取码: imtf 复制这段内容后打开百度网盘手机App，操作更方便哦

…nto mobert

yingyibiao · 2021-10-13T12:18:20Z

需要添加一下docstring哈～

FrostML

参照 PaddleNLP 中其他模型的 docstring 再做下补充
可以再补充下 example

FrostML · 2021-11-02T03:49:02Z

paddlenlp/transformers/mobilebert/modeling.py

+]
+
+
+def gelu_new(x):


https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/functional/gelu_cn.html#gelu
可以参考文档，gelu 这里不需要额外实现一个 gelu_new，PaddlePaddle 提供了对应的实现

FrostML · 2021-11-02T03:49:51Z

paddlenlp/transformers/mobilebert/modeling.py

+MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"]
+
+
+class NoNorm(nn.Layer):  #paddle


删除无明确意义的注释。

FrostML · 2021-11-04T06:13:19Z

paddlenlp/transformers/mobilebert/tokenizer.py

+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "mobilebert-uncased":
+            "https://huggingface.co/google/mobilebert-uncased/resolve/main/vocab.txt"


@yingyibiao 可以提供下 paddlenlp 的链接，然后这里做下对应修改。

FrostML · 2021-11-04T06:17:44Z

paddlenlp/transformers/mobilebert/modeling.py

+    "tanh": F.tanh,
+}
+
+MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"]


这里的是否有用到？没有用到可以删除。

FrostML · 2021-11-04T06:27:00Z

paddlenlp/transformers/mobilebert/modeling.py

+            >>> input_ids = paddle.to_tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+            >>> outputs = model(input_ids)
+            >>> prediction_logits = outputs[0]
+            >>> seq_relationship_logits = outputs[1]


可以参考下 BERT 这里的文档修改下 docstring 的格式。

…nto mobert

…o mobert

yingyibiao · 2021-12-06T04:59:22Z

权重已上传到对应bos链接～

FrostML · 2021-12-06T05:11:56Z

paddlenlp/transformers/mobilebert/modeling.py

 ACT2FN = {
    "relu": F.relu,
    "gelu": F.gelu,
-    "gelu_new": gelu_new,
-    "tanh": F.tanh,


这里不再支持近似计算的 gelu 和 tanh 的原因是？

模型没用到

…nto mobert

…o mobert

paddlenlp/transformers/mobilebert/modeling.py

paddlenlp/transformers/mobilebert/tokenizer.py

yingyibiao · 2021-12-08T03:34:45Z

paddlenlp/transformers/mobilebert/tokenizer.py

+                    encoded_inputs["input_ids"] = sequence
+                    if return_token_type_ids:
+                        encoded_inputs["token_type_ids"] = token_type_ids
+                    if return_special_tokens_mask:
+                        encoded_inputs[
+                            "special_tokens_mask"] = self.get_special_tokens_mask(
+                                ids, pair_ids)
+                    if return_length:
+                        encoded_inputs["seq_len"] = len(encoded_inputs[
+                            "input_ids"])
+
+                    # Check lengths
+                    assert max_seq_len is None or len(encoded_inputs[
+                        "input_ids"]) <= max_seq_len
+
+                    # Padding
+                    needs_to_be_padded = pad_to_max_seq_len and \
+                                        max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len
+
+                    encoded_inputs['offset_mapping'] = offset_mapping
+
+                    if needs_to_be_padded:
+                        difference = max_seq_len - len(encoded_inputs[
+                            "input_ids"])
+                        if self.padding_side == 'right':
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [1] * len(
+                                    encoded_inputs[
+                                        "input_ids"]) + [0] * difference
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = (
+                                    encoded_inputs["token_type_ids"] +
+                                    [self.pad_token_type_id] * difference)
+                            if return_special_tokens_mask:
+                                encoded_inputs[
+                                    "special_tokens_mask"] = encoded_inputs[
+                                        "special_tokens_mask"] + [1
+                                                                  ] * difference
+                            encoded_inputs["input_ids"] = encoded_inputs[
+                                "input_ids"] + [self.pad_token_id] * difference
+                            encoded_inputs['offset_mapping'] = encoded_inputs[
+                                'offset_mapping'] + [(0, 0)] * difference
+                        elif self.padding_side == 'left':
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [
+                                    0
+                                ] * difference + [1] * len(encoded_inputs[
+                                    "input_ids"])
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = (
+                                    [self.pad_token_type_id] * difference +
+                                    encoded_inputs["token_type_ids"])
+                            if return_special_tokens_mask:
+                                encoded_inputs["special_tokens_mask"] = [
+                                    1
+                                ] * difference + encoded_inputs[
+                                    "special_tokens_mask"]
+                            encoded_inputs["input_ids"] = [
+                                self.pad_token_id
+                            ] * difference + encoded_inputs["input_ids"]
+                            encoded_inputs['offset_mapping'] = [
+                                (0, 0)
+                            ] * difference + encoded_inputs['offset_mapping']
+                    else:
+                        if return_attention_mask:
+                            encoded_inputs["attention_mask"] = [1] * len(
+                                encoded_inputs["input_ids"])
+
+                    if return_position_ids:
+                        encoded_inputs["position_ids"] = list(
+                            range(len(encoded_inputs["input_ids"])))
+
+                    encoded_inputs['overflow_to_sample'] = example_id
+                    batch_encode_inputs.append(encoded_inputs)
+
+                    if len(second_ids) <= max_len_for_pair:
+                        break
+                    else:
+                        second_ids = second_ids[max_len_for_pair - stride:]
+                        token_pair_offset_mapping = token_pair_offset_mapping[
+                            max_len_for_pair - stride:]
+
+            else:
+                batch_encode_inputs.append(
+                    self.encode(
+                        first_ids,
+                        second_ids,
+                        max_seq_len=max_seq_len,
+                        pad_to_max_seq_len=pad_to_max_seq_len,
+                        truncation_strategy=truncation_strategy,
+                        return_position_ids=return_position_ids,
+                        return_token_type_ids=return_token_type_ids,
+                        return_attention_mask=return_attention_mask,
+                        return_length=return_length,
+                        return_overflowing_tokens=return_overflowing_tokens,
+                        return_special_tokens_mask=return_special_tokens_mask))
+
+        return batch_encode_inputs


我确认一下，MobileBertTokenizer是和BertTokenizer一模一样吗？如果是的话，这里的batch_encode方法为什么需要重写呢～

paddlenlp的BertTokenizer在SQuAD任务上输出的与hugginface没有对齐，在复现SQuADV1, SQuADV2 两个数据集的结果时，用BertTokenizer得到的模型结果总是差一个点，重写batch_encode是为了对齐SQuAD任务

yingyibiao

LGTM

nosaydomore added 3 commits October 11, 2021 20:55

upd mobilebert

26ab623

upd mobilebert

97c905e

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

9c1cd25

…nto mobert

yingyibiao self-requested a review October 13, 2021 11:07

yingyibiao self-assigned this Oct 13, 2021

yingyibiao removed their request for review October 13, 2021 11:10

FrostML reviewed Nov 4, 2021

View reviewed changes

yingyibiao assigned FrostML and unassigned yingyibiao Nov 15, 2021

FrostML assigned yingyibiao Nov 29, 2021

nosaydomore and others added 6 commits November 29, 2021 11:34

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

1558196

…nto mobert

add docstring.

fc503c2

add docstring.

280b1f4

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

e486a8c

…nto mobert

Merge branch 'PaddlePaddle:develop' into mobert

ca95a84

Merge branch 'mobert' of https://github.com/nosaydomore/PaddleNLP int…

4f1f904

…o mobert

ZeyuChen added contributions transformers Transformer-based pre-trained models labels Dec 4, 2021

FrostML reviewed Dec 6, 2021

View reviewed changes

nosaydomore and others added 6 commits December 6, 2021 21:49

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

abdf7ca

…nto mobert

Merge branch 'PaddlePaddle:develop' into mobert

1455d81

Merge branch 'mobert' of https://github.com/nosaydomore/PaddleNLP int…

e95e842

…o mobert

upd docstring

a2c3df0

merge_to__develop

76d949a

Merge branch 'develop' into mobert

4001e21

yingyibiao reviewed Dec 8, 2021

View reviewed changes

Merge branch 'PaddlePaddle:develop' into mobert

25da5d7

nosaydomore and others added 3 commits December 8, 2021 17:05

add license

65ced71

Update tokenizer.py

578fed4

Update modeling.py

cd9e617

yingyibiao approved these changes Dec 8, 2021

View reviewed changes

yingyibiao merged commit ae02c31 into PaddlePaddle:develop Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mobilebert model #1160

add mobilebert model #1160

nosaydomore commented Oct 13, 2021

yingyibiao commented Oct 13, 2021

FrostML left a comment •

edited

Loading

FrostML Nov 2, 2021

FrostML Nov 2, 2021

FrostML Nov 4, 2021

FrostML Nov 4, 2021

FrostML Nov 4, 2021

yingyibiao commented Dec 6, 2021

FrostML Dec 6, 2021 •

edited

Loading

nosaydomore Dec 6, 2021

yingyibiao Dec 8, 2021

nosaydomore Dec 8, 2021

yingyibiao left a comment

		MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = ["google/mobilebert-uncased"]


		class NoNorm(nn.Layer): #paddle

		]


		def gelu_new(x):

add mobilebert model #1160

add mobilebert model #1160

Conversation

nosaydomore commented Oct 13, 2021

PR types

PR changes

Description

yingyibiao commented Oct 13, 2021

FrostML left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingyibiao commented Dec 6, 2021

FrostML Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingyibiao left a comment

Choose a reason for hiding this comment

FrostML left a comment •

edited

Loading

FrostML Dec 6, 2021 •

edited

Loading