-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove bug in convert.py permute function #3364
Conversation
@TheBloke did you ever have issues with GQA 70B and hf models? |
No I haven't had issues with those for ages - not since the issues were fixed shortly after GGUF released. I've done loads in the last few weeks, all work fine. @jzhang38 it's not true to say that all 70B models come from Meta PTH weights. 99% of 70B conversions now are done from HF weights in pytorch_model.bin or model.safetensors format - because they're fine tuned models. Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually |
yea please do, it's kind of hard to "just" convert one of those for me 😅 |
@jzhang38 I can indeed confirm that this fixes the converted tinyllama models 👍 |
@TheBloke Yeah the actual reason would be Llama 2 70B use 64 heads and 8 key-value heads, which makes
the same as
So the bug is not triggered. |
ran perplexity on the first 300 chunks(batch 512) of wikitest on the f32 models q8_0 (gpu): it is safe to say that it works with this pr |
are there other GQA/MQA models we can test? |
The 70B Llama 2 model worked fine BTW |
Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already. |
I've notice https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1/discussions/4#651432a05d12b3abdd5d16bd |
this one has a different context management. "sliding window context", so the trained context of 32768 is gonna result in a wrong user experience. Window size should be 4096. |
@TheBloke how did you convert Mistral-7B-v0.1 ? |
I applied the GQA fix from this PR, and then I deleted Then I just ran convert.py as normal. Same with Mistral-7B-Instruct-v0.1, except I didn't need to delete added_tokens.json there, so I guess they realised it wasn't meant to be there. |
Oh, that easy... can you add a note that llama.cpp does currently not perform sliding window context, and that max context should be set to |
OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported |
this might be just like llama2 where, contrary to llama1, it does not immediately deteriorate when going past the trained size. |
from #3362
so you used this pr? |
Yes, changing \= to =. Before I applied that the ggufs produced gibberish after a few words |
…example * 'master' of github.com:ggerganov/llama.cpp: convert : remove bug in convert.py permute function (ggerganov#3364) make-ggml.py : compatibility with more models and GGUF (ggerganov#3290) gguf : fix a few general keys (ggerganov#3341) metal : reusing llama.cpp logging (ggerganov#3152) build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (ggerganov#3342) readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (ggerganov#3340) cmake : fix build-info.h on MSVC (ggerganov#3309) docs: Fix typo CLBlast_DIR var. (ggerganov#3330) nix : add cuda, use a symlinked toolkit for cmake (ggerganov#3202)
This bug will only be triggered by HuggingFace GQA models. Nobody realized it because
we never used convert.py to convert the HF llama2 70B modelLlama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)
jzhang38/TinyLlama#24