Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Simplify Encoder: Special Tokens, OOB, Batch Encoding #106

Open
PetrochukM opened this issue Oct 11, 2020 · 1 comment
Open

Simplify Encoder: Special Tokens, OOB, Batch Encoding #106

PetrochukM opened this issue Oct 11, 2020 · 1 comment

Comments

@PetrochukM
Copy link
Owner

The Encoder is unnecessarily complex:

  • It has OOB that's hard to reason with
  • The Encoder implementation is coupled with special tokens
  • The batch_encoding isn't helpful

We could break up the Encoder into a Tokenizer, TokenEnum, and torch.tensor.

For example:

pad_token = "<pad>"
unk_token = "<unk>"
data = ["abc", "def", "ghi"]

tokenizer = CharacterTokenizer()
token_enum = TokenEnum([pad_token, unk_token] + flatten([tokenizer.tokenize(t) for t in data]))

text = "abcd1"
encoded = torch.tensor([token_enum.get(t, unk_token) for t in tokenizer.tokenize(text)])

Notes:

  • Special tokens like "SEP", "CLS", "EOS", "SOS", "UNK", "PAD" can be easily added without overriding the current implementation.
  • This should be easy to understand. It's just a tokenizer and an index.
  • This idea can also be used without the tokenizer for labels.
  • The tokenizer can either be learned from data, or rule-based. It doesn't matter.
  • It still works with an "unknown token".
@PetrochukM
Copy link
Owner Author

PetrochukM commented Oct 22, 2020

Furthermore, it'd be helpful if the vocab could be generated on demand during training time. This means that we'd need to create a related embedding table that would grow during training time, as need be.

With such a system, we wouldn't need to initially load and tokenize the data.

In order to accomplish this, either the Embedding or the TokenEnum would need to also handle updating the optimizer and any related object that depends on model parameters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant