Simplify `Encoder`: Special Tokens, OOB, Batch Encoding #106

PetrochukM · 2020-10-11T02:35:30Z

The Encoder is unnecessarily complex:

It has OOB that's hard to reason with
The Encoder implementation is coupled with special tokens
The batch_encoding isn't helpful

We could break up the Encoder into a Tokenizer, TokenEnum, and torch.tensor.

For example:

pad_token = "<pad>"
unk_token = "<unk>"
data = ["abc", "def", "ghi"]

tokenizer = CharacterTokenizer()
token_enum = TokenEnum([pad_token, unk_token] + flatten([tokenizer.tokenize(t) for t in data]))

text = "abcd1"
encoded = torch.tensor([token_enum.get(t, unk_token) for t in tokenizer.tokenize(text)])

Notes:

Special tokens like "SEP", "CLS", "EOS", "SOS", "UNK", "PAD" can be easily added without overriding the current implementation.
This should be easy to understand. It's just a tokenizer and an index.
This idea can also be used without the tokenizer for labels.
The tokenizer can either be learned from data, or rule-based. It doesn't matter.
It still works with an "unknown token".

The text was updated successfully, but these errors were encountered:

PetrochukM · 2020-10-22T18:53:45Z

Furthermore, it'd be helpful if the vocab could be generated on demand during training time. This means that we'd need to create a related embedding table that would grow during training time, as need be.

With such a system, we wouldn't need to initially load and tokenize the data.

In order to accomplish this, either the Embedding or the TokenEnum would need to also handle updating the optimizer and any related object that depends on model parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `Encoder`: Special Tokens, OOB, Batch Encoding #106

Simplify `Encoder`: Special Tokens, OOB, Batch Encoding #106

PetrochukM commented Oct 11, 2020

PetrochukM commented Oct 22, 2020 •

edited

Loading

Simplify Encoder: Special Tokens, OOB, Batch Encoding #106

Simplify Encoder: Special Tokens, OOB, Batch Encoding #106

Comments

PetrochukM commented Oct 11, 2020

PetrochukM commented Oct 22, 2020 • edited Loading

Simplify `Encoder`: Special Tokens, OOB, Batch Encoding #106

Simplify `Encoder`: Special Tokens, OOB, Batch Encoding #106

PetrochukM commented Oct 22, 2020 •

edited

Loading