Handing token redefinitions during model conversion #7144
Replies: 3 comments 1 reply
-
Regarding added tokens, I think your pt.3 is the correct thing to do. This should also fix the issue that we have with DeepSeek models: #7036 (comment) I'm not sure about redefined tokens - will need some time to understand the issue better. But from the top of my head, if this is just a single instance of a model that is redefining tokens for some reason, isn't it better to drop support for that model instead of special-casing the inference libraries? For added tokens it seems too late and probably there is some argument to support those, but maybe for redefined it's not too late? |
Beta Was this translation helpful? Give feedback.
-
@ggerganov I found yet another model that redefined some tokens - InternLM2ForCausalLM. It's already supported in llama.cpp, but it looks like the problem with redefined tokens for the chat fine-tune was simply ignored, the only support for this is that the model conversion script looks for the id of the EOS token to know when to stop generation, while people used [UNUSED_TOKEN_X] tokens from the tokenizer.model instead of correct redefined tokens, at least that's what I see on the screenshots in the bug reports. @SolenoidWGT, can you confirm this? I checked the tokenizer.model and tokenizer.config from internlm2-chat-7b and this is the same situation as in ArcticForCauseLM. In the tokenizer.model you have:
But in tokenizer_config.json there is added_tokens_decoder field with:
I think this situation may be more common than you think - in vocabulary for base models there are ranges of unused tokens left that are later repurposed as the special tokens used in instruct or chat fine-tunes. The only difference in the ArcticForCausalLM is that they redefined "normal" tokens instead of unused ones. To sum up, not supporting this at all seems like a bad idea for me. |
Beta Was this translation helpful? Give feedback.
-
Interesting that the GGUF spec mentions an added_tokens field, any reason it hasn't been used yet? |
Beta Was this translation helpful? Give feedback.
-
I would like to start a discussion about the proper way of handling token redefinitions when doing model conversion to GGUF.
Problem introduction
In the snowflake-arctic-instruct model there is a situation where two tokens from the sentencepiece tokenizer model were reused as the special tokens BOS and EOS, namely tokens number 31998 (弘) and 31999 (给). They used added_tokens_decoder field from tokenizer_config.json file to do this and left the original sentencepiece tokenizer.model file from the snowflake-arctic base model unmodified.
I was wondering if token redefinition is even a correct use of added_tokens_decoder field, but apparently it is as confirmed by Arthur Zucker in this comment: huggingface/transformers#27974 (comment) He wrote that both added_tokens_decoder field from tokenizer_config.json and added_tokens from tokenizer.json shall be used for this purpose.
Problem implications
Since _set_vocab_sentencepiece() from convert-hf-to-gguf.py reads vocabulary from the tokenizer.model, it stores "弘" as token number 31998 and "给" as token number 31999 instead of respectively "<|im_start|>" and "<|im_end|>".
Possible solutions
I had several ideas evolving over time about how to handle it.
1. Read the vocabulary from a tokenizer created with HuggingFace transformers library.
This was my first solution, I simply used _set_vocab_llama_hf() for this. It worked fine after some minor tweaks (ArcticTokenizer is a "slow" tokenizer, and _set_vocab_llama_hf() contains
assert self.tokenizer.is_fast
), but this method doesn't preserve the token types and scores as @cebtenzzre pointed out when reviewing my code. He also noted that _set_vocab_llama_hf() is not intended for "slow" tokenizers, so I gave up the idea.2. Read the vocabulary from tokenizer.model with sentencepiece library and modify it based on the added_tokens_decoder field from tokenizer_config.json
This is my current solution. I guess in the future _set_vocab_sentencepiece() could be modified to handle added_tokens_decoder field from tokenizer_config.json or added_tokens from tokenizer.json (do we need both?) as a general solution in addition to existing legacy added_tokens.json support.
3. Handle added/redefined tokens separately.
The transformers library handles tokens from added_tokens_decoder separately - it first chop ups the text with a trie into pieces that are either tokens from added_tokens_decoder or other text fragments. Then only the pieces that are other text fragments are passed to the tokenizer. To do the same in llama.cpp I guess we would have to store the added tokens definitions separately in GGUF and process them independently. I noticed that there is a tokenizer.ggml.added_tokens field in a GGUF file format specification. Any idea what is the intended use of this field? But if we want to handle not only added, but also redefined tokens then I guess some additional fields would be needed (at least token ids).
I also noticed that the way transformers handles added_tokens_decoder leads to a weird behavior. Since the ArcitcTokenizer class is based on the LlamaTokenizer which uses sentencepiece internally, this results in both "<|im_start|>" and "弘" tokenized to id 31998 (the first one as a token from added_tokens_decoder, the second one as a token from the sentencepiece tokenizer model), and both "<|im_end|>" and "给" tokenized to id 31999.
Epilogue
This was quite a trip down the tokenization rabbit hole for me. I'd be grateful for any ideas about how to handle this properly.
Beta Was this translation helpful? Give feedback.
All reactions