fix resize token embedding method #2763

wj-Mcat · 2022-07-08T05:01:54Z

PR types

Bug fixes

PR changes

Models

Description

there are some latent bugs in the resize_token_embedding method according to the unit test.

wj-Mcat · 2022-07-08T05:02:09Z

ping @yingyibiao

yingyibiao · 2022-07-08T06:32:44Z

paddlenlp/transformers/model_utils.py

+        if not new_num_tokens or new_num_tokens == len(old_embeddings):
+            return old_embeddings


old_embeddings 的类型是 Embedding，应该没有 len() 方法。

Ok, 我用weight.shape[0]替换

guoshengCS · 2022-07-08T06:39:06Z

paddlenlp/transformers/model_utils.py

+        try:
+            old_embeddings = self.get_input_embeddings()
+        except NotImplementedError:
+            raise NotImplementedError(


It is not a good choice to catch a NotImplementedError and then throw a NotImplementedError, maybe what we need to do is just to add error message into get_input_embeddings

Yes, we should add error message into the get_input_embedding method.

guoshengCS · 2022-07-08T06:43:29Z

paddlenlp/transformers/model_utils.py

+        except NotImplementedError:
+            raise NotImplementedError(
+                f'model of {type(self)} has not implemented the `get_input_embedding` or `set_input_embedding` '
+                'method, please use the another model to call `resize_token_embeddings` method'


use the another model is not so as expected, maybe users can implement it.

wj-Mcat · 2022-07-11T02:12:24Z

I have completed all of above comments, please review it, thanks.

wj-Mcat · 2022-07-11T02:20:22Z

I will share the unit test code for you:

import numpy as np
import os
from typing import Type, List, Tuple
import shutil
import unittest
from multiprocessing import Process
from tempfile import TemporaryDirectory
from parameterized import parameterized

from paddle import nn
from paddlenlp.transformers.model_utils import PretrainedModel, MODEL_HOME
from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer

from paddlenlp.transformers import BertTokenizer, BasicTokenizer, WordpieceTokenizer

from paddlenlp.transformers.bert.modeling import BertForPretraining
from paddlenlp.transformers.gpt.modeling import GPTForPretraining
from paddlenlp.transformers.tinybert.modeling import TinyBertForPretraining

from paddlenlp.transformers.bert.tokenizer import BertTokenizer
from paddlenlp.transformers.gpt.tokenizer import GPTTokenizer, GPTChineseTokenizer
from paddlenlp.transformers.tinybert.tokenizer import TinyBertTokenizer

from tests.common_test import CpuCommonTest, CommonTest
from tests.util import slow, assert_raises
def get_pretrained_models_params() -> List[Tuple[str, Type[PretrainedModel]]]:
    """get all of pretrained model names in some PretrainedModels

    Returns:
        List[Tuple[str, Type[PretrainedModel]]]: the parameters of unit test method
    """
    # from paddlenlp.transformers.electra.modeling import ElectraForTotalPretraining
    from paddlenlp.transformers.ctrl.modeling import CTRLModel 
    model_types: List[PretrainedModel] = [
        BertForPretraining, GPTForPretraining, TinyBertForPretraining, CTRLModel
    ]
    name_class_tuples: List[Tuple[str, Type[PretrainedModel]]] = []
    for ModelType in model_types:
        for model_name in ModelType.pretrained_resource_files_map.get(
                'model_state', {}).keys():
            name_class_tuples.append([model_name, ModelType])
    return name_class_tuples


class TestPretrainedFromPretrained(CpuCommonTest):
    """module for test pretrained model"""

    @parameterized.expand(get_pretrained_models_params())
    def test_resize_token_embedding(self, model_name: str,
                              PretrainedModelClass: Type[PretrainedModel]):

        cache_dir = os.path.join(MODEL_HOME, model_name)

        model: PretrainedModelClass = PretrainedModelClass.from_pretrained(
            model_name)
        
        vocab_size = model.base_model.config['vocab_size']
        model.resize_token_embeddings(vocab_size + 10)
        assert model.base_model.config["vocab_size"] == vocab_size + 10

you should modify something to make it run in your code context. hope that can help you.

guoshengCS · 2022-07-11T08:59:31Z

paddlenlp/transformers/model_utils.py


-        # Update vocab_size
-        self.vocab_size = new_num_tokens


这个也保留吧，和HF一致

Done，冲突也已经解决了

guoshengCS · 2022-07-12T04:17:05Z

这里和最新的代码冲突了还需要解决下

wj-Mcat · 2022-07-12T05:31:47Z

Done @guoshengCS

…eNLP into fix-resize-token-embedding

wj-Mcat · 2022-07-12T06:27:31Z

ping @yingyibiao

yingyibiao

LGTM

fix resize token embedding

798edba

yingyibiao reviewed Jul 8, 2022

View reviewed changes

guoshengCS reviewed Jul 8, 2022

View reviewed changes

wj-Mcat added 2 commits July 11, 2022 02:03

fix embedding len & raise error issue

1cb5ada

fix linting issue

a0dbc84

guoshengCS reviewed Jul 11, 2022

View reviewed changes

Merge branch 'develop' into fix-resize-token-embedding

5359f4d

wj-Mcat added 2 commits July 12, 2022 06:24

add vocab_size

7eb7cc3

Merge branch 'fix-resize-token-embedding' of github.com:wj-Mcat/Paddl…

696911d

…eNLP into fix-resize-token-embedding

yingyibiao approved these changes Jul 14, 2022

View reviewed changes

yingyibiao and others added 2 commits July 14, 2022 15:59

Merge branch 'develop' into fix-resize-token-embedding

fed57fc

Merge branch 'develop' into fix-resize-token-embedding

0905ce9

guoshengCS merged commit 48d52c5 into PaddlePaddle:develop Jul 14, 2022

wj-Mcat mentioned this pull request Aug 24, 2022

PaddleNLP 2.3.6 Release Note Candidate #3122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix resize token embedding method #2763

fix resize token embedding method #2763

wj-Mcat commented Jul 8, 2022

wj-Mcat commented Jul 8, 2022

yingyibiao Jul 8, 2022

wj-Mcat Jul 11, 2022 •

edited

Loading

guoshengCS Jul 8, 2022

wj-Mcat Jul 8, 2022

guoshengCS Jul 8, 2022

wj-Mcat commented Jul 11, 2022

wj-Mcat commented Jul 11, 2022

guoshengCS Jul 11, 2022

wj-Mcat Jul 12, 2022

guoshengCS commented Jul 12, 2022

wj-Mcat commented Jul 12, 2022

wj-Mcat commented Jul 12, 2022

yingyibiao left a comment

		if not new_num_tokens or new_num_tokens == len(old_embeddings):
		return old_embeddings

fix resize token embedding method #2763

fix resize token embedding method #2763

Conversation

wj-Mcat commented Jul 8, 2022

PR types

PR changes

Description

wj-Mcat commented Jul 8, 2022

yingyibiao Jul 8, 2022

Choose a reason for hiding this comment

wj-Mcat Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

guoshengCS Jul 8, 2022

Choose a reason for hiding this comment

wj-Mcat Jul 8, 2022

Choose a reason for hiding this comment

guoshengCS Jul 8, 2022

Choose a reason for hiding this comment

wj-Mcat commented Jul 11, 2022

wj-Mcat commented Jul 11, 2022

guoshengCS Jul 11, 2022

Choose a reason for hiding this comment

wj-Mcat Jul 12, 2022

Choose a reason for hiding this comment

guoshengCS commented Jul 12, 2022

wj-Mcat commented Jul 12, 2022

wj-Mcat commented Jul 12, 2022

yingyibiao left a comment

Choose a reason for hiding this comment

wj-Mcat Jul 11, 2022 •

edited

Loading