-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331
Comments
You should link the exact model you used, make sure it's what you have locally and not something else. Make sure you have ALL files from the HF dump in your local directory and that nothing is cut or missing. What you describe sounds like an error in the tokenizer import, either a wrong tokenizer is used or the special tokens were not read in. Just as you noticed: if that tokenization is not working then the finetune is broken, so that really has to be one token. |
@cmp-nct I used this model: https://huggingface.co/Qwen/Qwen-72B-Chat (I edited to include that in the original post) I used this command to convert it into
The The model works fine. I just think it may not be working optimally because the tokens are not treated specially. |
Okay that looks bad at first glance.
So I am guessing, I do not know if we have special handling code for "QWEN" implemented, because it would be required.
Once that is done you'd have correct tokens for the 3 special characters, however there will be more issues. If you decide to continue and hack those 3 tokens into the file make sure you run the python demo implementation, then add a print to the tokenizer output (that's what is fed into the generation, one of the main lines). |
Also, for record, I am on commit |
https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee Try that, it might convert their tokenizer format to HF, if that was not already done. So that script is worth a check, if that doesn't help, try hack in the 3 special tokens with a text editor, then convert the model again and it might be good |
Regarding the
Funny, that's pretty much exactly what I did in the aforementioned pull-request. I can reproduce the original problem (described by @Noeda) when converting Qwen-1.8B-Chat with $ # Convert the model (well, only the vocab, since that's the only thing used in the following test)
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1941-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1941-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token = 151643 '[PAD151643]'
llm_load_print_meta: EOS token = 151643 '[PAD151643]'
llm_load_print_meta: UNK token = 151643 '[PAD151643]'
...
27 -> '<'
91 -> '|'
318 -> 'im'
4906 -> '_start'
91 -> '|'
29 -> '>'
8948 -> 'system' When I instead convert it with the more recent d6bd4d4 (tag $ # Convert the model's vocab
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1942-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1942-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token = 151643 '<|endoftext|>'
...
151644 -> ''
8948 -> 'system' So, In both cases, I used the same I don't know to what extent this affects the existing GGUF-converted Qwen models on HuggingFace, but I think most of them will need to be reconverted with a recent If anyone reading this is unsure whether their Qwen GGUF model(s) needs re-conversion, run the |
Hey, thanks for the fix @compilade . I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that Qwen models seems to have issues with llama.cpp and have a placeholder to discuss it (I guess that worked out 😄). I still seem to have the original Qwen-72B Small edit to clarify: Specifically I mean Qwen-72B-Chat for all that. I can't remember on top of my head if the base Qwen-72B also had these tokens special (but I would expect yes). |
My llama.cpp commit: I can confirm that the fixes @compilade mentions are doing their job. I replicated the same steps as I had in my original post on this issue and created a
Also did a quick dirty test that the model itself is not broken.
(The command line tool for generating text doesn't seem to print out Thank you for the fixes! |
You did the right thing, apparently, since it was a real problem :)
Just a quick note: you can use the
It seems intended. The |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Description
(This is specifically for the latest 72B models. I have never tried the smaller ones).
I'm using this model: https://huggingface.co/Qwen/Qwen-72B-Chat
Commit:
33e171d1e9fc4903f9314b490d77fb8d58331b63
I think the current
convert-hf-to-gguf.py
does not produce a.gguf
file that treats these two tokens correctly for<|im_start|>
and<|im_end|>
.The prompt I used is "<|im_start|>system" for the examples below.
Following steps in #4281 to produce some
.gguf
files (I personally used the Q6_K on a MacStudio) I tried thetokenize
tool:Compare this to a Yi model with exact same prompt:
I saw the Qwen model code (https://huggingface.co/Qwen/Qwen-72B/blob/main/tokenization_qwen.py#L37) and I think these are intended to be single tokens. But the current script does not handle it properly.
Steps to Reproduce
convert-hf-to-gguf.py
script to convert one into a.gguf
file. (This is the exact command I found on my Mac Studio:python3 convert-hf-to-gguf.py --outfile /Volumes/T9/qwen_72b_chat_v3_f16.gguf --outtype f16 ~/text-generation-webui/models/Qwen_Qwen-72B-Chat
)tokenize
on them to see what tokens are interpreted.If I'm honest, I'm not sure if this would be a bug to
llama.cpp
repository or something Qwen team might want to fix in their repo. But I'm submitting it here for awareness.Also, the model seems to work fine despite this. But maybe it would work better if they were interpreted correctly? No idea.
The text was updated successfully, but these errors were encountered: