-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for GGUF Phi-3 #31844
Add support for GGUF Phi-3 #31844
Conversation
cc @SunMarc |
Hi @SunMarc @amyeroberts - while testing my changes I was running into a failure. I checked Qwen2's test and it also has same failure. I am attaching error below, I am running this on a colab T4 GPU, any ideas why qwen2's GGUF loading also failing for me? Command to repro below failure:
|
Hi @a8nova ! The latest version of ggml that was released a few days ago is not compatible with the integration. We still need to fix this ! #31725 (comment). I will try to find some time to fix this but feel free to have a look ! In the meantime, you can also install the previous version of ggml and continue the PR from there ! |
got it, I will try with previous ggml version, thank you @SunMarc! |
Hi @SunMarc - I am getting giberrish output for my test, i think it has to do with phi3 having a slightly different attention class where q, k, and v are one variable. I am seeing below output while loading GGUF weights and since the model output differs on runs it looks like some weights are not being correctly initialized. Right now I am mapping
For the tokenizer, I am using Llama's tokenizer since phi3 uses Llama tokenizer, that is what I read in the research paper, I have added the special tokens for phi3, when I dump the tokenizer from GGUF, it looks similar to tokenizer loaded by |
Hi! I’m interested in working on this issue. Is anyone currently working on it? If not, I’d love to take it on as my first one. |
Hi @kibru9399 - I am actively working on this, it is actually almost finalized. |
No worries, thanks for the update!
…On Mon, Aug 19, 2024 at 1:22 PM Alazar ***@***.***> wrote:
Hi @kibru9399 <https://github.com/kibru9399> - I am actively working on
this, it is actually almost finalized.
—
Reply to this email directly, view it on GitHub
<#31844 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYQVVC7ZQBLGOXE7UYMD773ZSISU5AVCNFSM6AAAAABKREQWKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJXGA3DINZTGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello @SunMarc @amyeroberts - This PR is ready for review! A few remaining TODO's on my end:
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks for adding @a8nova ! I left a comment.
The generated output from gguf model microsoft/Phi-3-mini-4k-instruct-gguf is different from non gguf model microsoft/Phi-3-mini-4k-instruct. This is also true for Qwen2. This can be due to many different reasons but I want to make sure it is not due to a bug in the conversion code in this PR.
Since the model is quantized, this is normal that we don't get the same output. I would advise you to check that we get approximately the same perplexity in transformers compared to llama.cpp !
There is a warning Merges were not in checkpoint, building merges on the fly. for microsoft/Phi-3-mini-4k-instruct-gguf, do you know if i need to look into this warning?
Since Phi3 uses a BPE tokenizer, it needs to have a merges
attribute. I'm not sure why we are not able to get it from the ggml file. For example see that we indeed that this in the model hosted on HF
|
||
class GGUFPhi3Converter(LlamaConverter): | ||
def __init__(self, tokenizer_dict): | ||
self.original_tokenizer = GGUFTokenizerSkeleton(tokenizer_dict) | ||
self.additional_kwargs = {} | ||
|
||
def converted(self) -> Tokenizer: | ||
vocab = {word: i for i, word in enumerate(self.original_tokenizer.tokens)} | ||
merges = self.original_tokenizer.merges | ||
tokenizer = Tokenizer(BPE(vocab, merges, unk_token="<unk>", fuse_unk=True, byte_fallback=True)) | ||
tokenizer.decoder = decoders.Sequence( | ||
[ | ||
decoders.ByteFallback(), | ||
decoders.Fuse(), | ||
decoders.Replace("▁", " "), | ||
] | ||
) | ||
# add the special tokens from phi3 tokenizer config | ||
tokenizer.add_special_tokens( | ||
[ | ||
AddedToken("</s>", rstrip=True, lstrip=False, normalized=False, special=True), | ||
AddedToken("<|endoftext|>", normalized=False, special=True), | ||
AddedToken("<|assistant|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|placeholder1|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|placeholder2|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|placeholder3|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|placeholder4|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|system|>", rstrip=True, normalized=False, special=True), | ||
AddedToken("<|end|>", rstrip=True, normalized=False, special=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you follow GGUFLlamaConverter
structure ? (e.g adding vocab, tokenizer, merges methods ... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
LMK when you finished your TODO's, so that I can ask for a final review from a core maintainer ! |
Hi @SunMarc - For trying phi3.5 weights: There seems to be something wrong with phi3.5 template because model gives weird output. For example with input:
And with input "Hello", model gives (HTML) output:
My understanding was that phi 3.5 is identical to phi3 for its architecture but there seems to be some differences and I don't see it documented anywhere.. I won't worry about this issue in current scope. For merges missing issue, I don't see a merges in the ggml file though it exists in the original checkpoints. I see the merges are built if they don't exist in the ggml checkpoints which slows things down. Is the merges missing a bug in the safetensors -> ggml conversion? how do you suggest we move forward? ( I also tried converting via ggml-my-repo a few phi3 variants but they are still missing the merges in the converted ggml files) if the merges missing from checkpoint is not a blocker then I think this PR is ready for a final review, if it is a blocker we can get to the bottom of it first. Thanks! |
Could you check quickly if we have the issue on a llama gguf model ? If not, I think that we should probably fix this issue with a PR in llama.cpp repo. We would have to fix this conversion script https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py. |
Hi @SunMarc! When I try via an already converted gguf llama like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF which is also used inside test_ggml.py, I don't see the missing merges. But a llama gguf I just converted using gguf-my-repo: a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF is missing merges. So it looks like a recent change (in llama.cpp repo?) broke the conversion for all models... |
Thanks for the investigation ! We should definitely try to fix this as this can be quite troublesome if we have to recreate the merges for all recent llama models. |
I am able to checkout llama.cpp and repro the missing merges bug locally and I also have a fix. The bug shows up for the llama family models only. I have narrowed it down to the
It looks like we will only go inside If this fix makes sense to you, I create a PR in the llama.cpp repo! Update: I think the bug can also happen for any model class that calls |
I would first open an issue on llama.cpp to see if anyone there have more insights around that + explain the potential fix ! Thanks for investigating ! I really appreciate that. |
Done! ggerganov/llama.cpp#9309
|
Well, it will just lead to increased loading time.
Not really. Can you check that the phi3.5 on transformers works correctly? If that works correctly, you can try to gguf it then load it in transformers. |
Hi @SunMarc! I would like to finalize this PR if possible, are there any blockers or can we get final reviews and get this merged in? Thanks! Regarding phi3.5, non gguf checkpoints have same behavior, something is off about the template/prompt |
- I missed one place when resolving conflict - I also made a mistake with tests_ggml.py and now has been fixed to reflect its master version.
No blockers @a8nova. I really appreciate your work ! As for the merges issue, this is something we need to tackle separately from this PR. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boom, awesome! Thanks for the contribution @a8nova
Thank you, @SunMarc! Before I file an issue for the phi3.5 bug, let me first confirm. I am suspicious of what I saw, maybe I did something wrong. transformers/tests/quantization/ggml/test_ggml.py Lines 33 to 36 in 7f112ca
|
I guess it is simply to test the model faster. Maybe you can try on GPU to see if this fixes anything ? |
* Update docs for GGUF supported models * Add tensor mappings and define class GGUFPhi3Converter * Fix tokenizer * Working version * Attempt to fix some CI failures * Run ruff format * Add vocab, merges, decoder methods like LlamaConverter * Resolve conflicts since Qwen2Moe was added to gguf - I missed one place when resolving conflict - I also made a mistake with tests_ggml.py and now has been fixed to reflect its master version.
* Update docs for GGUF supported models * Add tensor mappings and define class GGUFPhi3Converter * Fix tokenizer * Working version * Attempt to fix some CI failures * Run ruff format * Add vocab, merges, decoder methods like LlamaConverter * Resolve conflicts since Qwen2Moe was added to gguf - I missed one place when resolving conflict - I also made a mistake with tests_ggml.py and now has been fixed to reflect its master version.
* Update docs for GGUF supported models * Add tensor mappings and define class GGUFPhi3Converter * Fix tokenizer * Working version * Attempt to fix some CI failures * Run ruff format * Add vocab, merges, decoder methods like LlamaConverter * Resolve conflicts since Qwen2Moe was added to gguf - I missed one place when resolving conflict - I also made a mistake with tests_ggml.py and now has been fixed to reflect its master version.
* Update docs for GGUF supported models * Add tensor mappings and define class GGUFPhi3Converter * Fix tokenizer * Working version * Attempt to fix some CI failures * Run ruff format * Add vocab, merges, decoder methods like LlamaConverter * Resolve conflicts since Qwen2Moe was added to gguf - I missed one place when resolving conflict - I also made a mistake with tests_ggml.py and now has been fixed to reflect its master version.
What does this PR do?
Add support for GGUF Phi-3
Use #31175 as a guide for adding gguf support for phi3
Fixes #31826
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.