-
Notifications
You must be signed in to change notification settings - Fork 143
Tokenization and vocabulary
TencentPretrain supports multiple tokenization strategies. The most commonly used strategy is BertTokenizer (which is also the default strategy). There are two ways of using BertTokenizer: the first is to specify the vocabulary path through --vocab_path and then use BERT's original tokenization strategy to segment sentences according to the vocabulary; the second is to specify the sentencepiece model path by --spm_model_path . We import sentencepiece, load the sentencepiece model, and segment the sentence. If user specifies --spm_model_path, sentencepiece is used for tokenization. Otherwise, user must specify --vocab_path and BERT's original tokenization strategy is used for tokenization.
In addition, the project provides CharTokenizer and SpaceTokenizer. CharTokenizer tokenizes the text by character. If the text is all Chinese character, CharTokenizer and BertTokenizer are equivalent. CharTokenizer is simple, and is faster than BertTokenizer. SpaceTokenizer separates the text by space. One can preprocess the text in advance (such as word segmentation), separate the text by space, and then use SpaceTokenizer. For CharTokenizer and SpaceTokenizer, if user specifies --spm_model_path, then the vocabulary in sentencepiece model is used. Otherwise, user must specify the vocabulary through --vocab_path.
To support English RoBERTa and GPT-2 pre-training models, the project includes BPETokenizer. --vocab_path specifies the vocabulary path and --merges_path specifies the merge vocabulary file.
The project also supports XLMRobertaTokenizer (identical with the original implementation). XLMRobertaTokenizer uses the sentencepiece model and its path is denoted by --spm_model_path . Special tokens are added to the vocabulary in XLMRoBERTaTokenizer. Since XLMRoBERTaTokenizer uses different special tokens with the default case, we should change the special tokens with the method mentioned in the next paragraph.
The pre-processing, pre-training, fine-tuning, and inference stages all need to specify the vocabulary (provided through --vocab_path or --smp_model_path) and the tokenizer (provided through --tokenizer). If users use their own vocabularies, in default case, the padding character, starting character, separator character, and mask character are "[PAD]", "[CLS]", "[SEP]", "[MASK]" (the project read special tokens information from models/special_tokens_map.json by default). If the user's vocabulary has different special tokens, the user should provide special tokens mapping file, e.g. models/xlmroberta_special_tokens_map.json and then change the path of the special tokens mapping file in tencentpretrain/utils/constants.py .
TencentPretrain supports vision related tokenizers. ImageTokenizer is designed for image quantization models such as VQGAN and VQVAE. Its vocabulary size is determined by the model codebook. The code for image quantization is in image_tokenizer.py . TextImageTokenizer supports text-image pre-training models, such as DALL-E, ERNIE-ViLG, Talk2Face. It maps image and text tokens into a shared vocabulary where BertTokenizer is used for the text part and ImageTokenizer is used for the image part.