Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prompt] add MaskedLMVerbalizer #3889

Merged
merged 4 commits into from
Nov 28, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion docs/advanced_guide/prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ template = AutoTemplate.create_from(prompt="{'prefix': None, 'encoder': 'mlp', '

### 离散型标签词映射

``ManualVerbalizer`` 支持构造 ``{'mask'}`` 对应的标签词映射,支持多``{'mask'}``,直接作用于 ``AutoMaskedLM`` 模型结构。当标签对应的预测词长度大于 ``1`` 时,默认取均值。
``ManualVerbalizer`` 支持构造 ``{'mask'}`` 对应的标签词映射,同一标签可对应多个不同长度的词,直接作用于 ``AutoMaskedLM`` 模型结构。当标签对应的预测词长度大于 ``1`` 时,默认取均值;当标签对应多个 `{'mask'}` 时,默认与单个 `{mask}` 效果等价

**调用 API**

Expand All @@ -373,6 +373,22 @@ verbalizer = ManualVerbalizer(tokenizer=tokenizer,
- ``label_words`` : 原标签到预测词之间的映射字典。
- ``tokenizer`` : 预训练模型的 tokenizer,用于预测词的编码。

``MaskedLMVerbalizer`` 同样支持构造 ``{'mask'}`` 对应的标签词映射,映射词与模板中的 `{'mask'}` 逐字对应,因此,映射词长度应与 `{'mask'}` 数量保持一致。当定义的标签词映射中同一标签对应多个词时,仅有第一个映射词生效。在自定义的 `compute_metric` 函数中需先调用 `verbalizer.aggregate_multiple_mask` 将多 `{'mask'}` 合并后再计算评估函数,默认使用乘积的方式。

**调用 API**
from paddlenlp.prompt import MaskedLMVerbalizer
from paddlenlp.transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
verbalizer = MaskedLMVerbalizer(tokenizer=tokenizer,
label_words={'负向': '不', '正向': '很'})
```

其中初始化参数定义如下

- ``label_words`` : 原标签到预测词之间的映射字典。
- ``tokenizer`` : 预训练模型的 tokenizer,用于预测词的编码。

### 连续型标签词映射

标签词映射分类器 ``SoftVerbalizer`` 修改了原 ``AutoMaskedLM`` 的模型结构,将预训练模型最后一层“隐藏层-词表”替换为“隐藏层-标签”的映射。该层网络的初始化参数由标签词映射中的预测词词向量来决定,如果预测词长度大于 ``1`` ,则使用词向量均值进行初始化。当前支持的预训练模型包括 ``ErnieForMaskedLM`` 、 ``BertForMaskedLM`` 、 ``AlbertForMaskedLM`` 和 ``RobertaForMaskedLM`` 。可用于实现 WARP 算法。
Expand Down
4 changes: 3 additions & 1 deletion paddlenlp/prompt/prompt_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,10 @@ def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
length = len(value)
new_values[index][0, :length, :length] = value
values = new_values
elif key != "labels":
elif key in ("soft_token_ids", "encoder_ids"):
for index, value in enumerate(values):
values[index] = value + [0] * (max_length - len(value))
elif key != "labels":
continue
batch[key] = self._convert_to_tensors(values)
return batch
80 changes: 71 additions & 9 deletions paddlenlp/prompt/verbalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@
from paddlenlp.transformers import PretrainedTokenizer, PretrainedModel
from paddlenlp.utils.log import logger

__all__ = ["Verbalizer", "ManualVerbalizer", "SoftVerbalizer"]
__all__ = [
"Verbalizer", "ManualVerbalizer", "SoftVerbalizer", "MaskedLMVerbalizer"
]

# Verbalizer used to be saved in a file.
VERBALIZER_CONFIG_FILE = "verbalizer_config.json"
Expand Down Expand Up @@ -263,9 +265,11 @@ class ManualVerbalizer(Verbalizer):
An instance of PretrainedTokenizer for label word tokenization.
"""

def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer):
def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer,
**kwargs):
super(ManualVerbalizer, self).__init__(label_words=label_words,
tokenizer=tokenizer)
tokenizer=tokenizer,
**kwargs)

def create_parameters(self):
return None
Expand All @@ -292,10 +296,7 @@ def aggregate_multiple_mask(self, outputs: Tensor, atype: str = None):
"tokens.".format(atype))
return outputs

def process_outputs(self,
outputs: Tensor,
masked_positions: Tensor = None,
**kwargs):
def process_outputs(self, outputs: Tensor, masked_positions: Tensor = None):
"""
Process outputs over the vocabulary, including the following steps:

Expand Down Expand Up @@ -364,10 +365,11 @@ class SoftVerbalizer(Verbalizer):
LAST_LINEAR = ["AlbertForMaskedLM", "RobertaForMaskedLM"]

def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer,
model: PretrainedModel):
model: PretrainedModel, **kwargs):
super(SoftVerbalizer, self).__init__(label_words=label_words,
tokenizer=tokenizer,
model=model)
model=modeli,
**kwargs)
del self.model
setattr(model, self.head_name[0], MaskedLMIdentity())

Expand Down Expand Up @@ -472,3 +474,63 @@ def _create_init_weight(self, weight, is_bias=False):
axis=1).reshape(word_shape)
weight = self.aggregate(weight, token_mask, aggr_type)
return weight


class MaskedLMVerbalizer(Verbalizer):
"""
MaskedLMVerbalizer defines mapping from labels to words manually and supports
multiple masks corresponding to multiple tokens in words.

Args:
label_words (`dict`):
Define the mapping from labels to a single word. Only the first word
is used if multiple words are defined.
tokenizer (`PretrainedTokenizer`):
An instance of PretrainedTokenizer for label word tokenization.
"""

def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer,
**kwargs):
super(MaskedLMVerbalizer, self).__init__(label_words=label_words,
tokenizer=tokenizer,
**kwargs)

def create_parameters(self):
return None

def aggregate_multiple_mask(self, outputs: Tensor, atype: str = "product"):
assert outputs.ndim == 3
token_ids = self.token_ids[:, 0, :].T
batch_size, num_token, num_pred = outputs.shape
results = paddle.index_select(outputs[:, 0, :], token_ids[0], axis=1)
if atype == "first":
return results

for index in range(1, num_token):
sub_results = paddle.index_select(outputs[:, index, :],
token_ids[index],
axis=1)
if atype in ("mean", "sum"):
results += sub_results
elif atype == "product":
results *= sub_results
elif atype == "max":
results = paddle.stack([results, sub_results], axis=-1)
results = results.max(axis=-1)
else:
raise ValueError(
"Strategy {} is not supported to aggregate multiple "
"tokens.".format(atype))
if atype == "mean":
results = results / num_token
return results

def process_outputs(self, outputs: Tensor, masked_positions: Tensor = None):
if masked_positions is None:
return outputs

batch_size, _, num_pred = outputs.shape
outputs = outputs.reshape([-1, num_pred])
outputs = paddle.gather(outputs, masked_positions)
outputs = outputs.reshape([batch_size, -1, num_pred])
return outputs