Add gguf support for bloom #33473

VladOS95-cyber · 2024-09-13T14:14:44Z

What does this PR do?

Add Bloom GGUF loading support

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Link: Community contribution: Adding GGUF support for more architectures #33260
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Regarding the task @SunMarc @LysandreJik @ArthurZucker .

src/transformers/convert_slow_tokenizer.py

src/transformers/integrations/ggml.py

VladOS95-cyber · 2024-09-17T13:53:43Z

Hi @SunMarc @LysandreJik @ArthurZucker! This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It generates a wrong sequence, not as expected when using a normal pretrained model. Instead of tensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

SunMarc · 2024-09-17T17:47:07Z

This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It genereates a wrong sequence, not as expected when using a normal pretrained model. Instead oftensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

Since the model was quantized, THis is normal that it is not behaving the same of the normal pretrained model. Dequantization doesn't recover the precision of the original model. Could you check that it behaves similarly as the original model that was converted to gguf in fp16 precision or full precision ? This way we have a way to compare the model loaded from gguf file.

VladOS95-cyber · 2024-09-18T16:32:47Z

This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It genereates a wrong sequence, not as expected when using a normal pretrained model. Instead oftensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

Since the model was quantized, THis is normal that it is not behaving the same of the normal pretrained model. Dequantization doesn't recover the precision of the original model. Could you check that it behaves similarly as the original model that was converted to gguf in fp16 precision or full precision ? This way we have a way to compare the model loaded from gguf file.

hi @SunMarc. Still odd behaviour, it does not matter which format I take. So, i debugged model_state loading and parsing logic several times and tried to compare it with working one (like llama). Everything looks good from this perspective and it seems that weights, params, config and so on are loaded correctly into the model. I'am just worring about quantized data itself.

VladOS95-cyber · 2024-09-23T15:30:19Z

Hi @SunMarc! I finally found the issue and the reason of bloom model strange behaviour. This is because of https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py (L972-998), this algorithm reshapes qkv data and that's why we got completely different weights in the end. I implemented reverse reshaping algorithm in modeling_gguf_pytorch_utils.py to place them back and after that, everything is ok, and model outputs expected values. Please, take a look on my changes, now, everything should be correct.

SunMarc

Nice job finding the issue ! Could you add a final test to check that the fp16 from transformers and the fp16 model from gguf share the same weights ? That would be a nice test to check that the conversion was done correctly ! Can you also check that the quantized model doesn't return gibberish ?

HuggingFaceDocBuilderDev · 2024-09-23T16:03:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

VladOS95-cyber · 2024-09-23T16:35:03Z

Nice job finding the issue ! Could you add a final test to check that the fp16 from transformers and the fp16 model from gguf share the same weights ? That would be a nice test to check that the conversion was done correctly ! Can you also check that the quantized model doesn't return gibberish ?

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

SunMarc · 2024-09-24T12:28:39Z

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

Just add a simple generation check for a q4 quants for example, just like what was done for other models !

VladOS95-cyber · 2024-09-24T13:00:14Z

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

Just add a simple generation check for a q4 quants for example, just like what was done for other models !

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

SunMarc · 2024-09-25T12:23:59Z

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

But we need one for a quantized model since most users uses these models. With fp16, we can't be sure that the dequantize step was done correctly.

VladOS95-cyber · 2024-09-25T14:21:51Z

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

But we need one for a quantized model since most users uses these models. With fp16, we can't be sure that the dequantize step was done correctly.

Hello! Ok, got it, I just added q8_0 test as well. Would it be sufficient?

LysandreJik

Awesome, thank you @VladOS95-cyber!

* add bloom arch support for gguf * apply format * small refactoring, bug fix in GGUF_TENSOR_MAPPING naming * optimize bloom GGUF_TENSOR_MAPPING * implement reverse reshaping for bloom gguf * add qkv weights test * add q_8 test for bloom

akx reviewed Sep 14, 2024

View reviewed changes

VladOS95-cyber force-pushed the add-GGUF-support-for-Bloom branch from 6f3e643 to c23788b Compare September 17, 2024 13:53

VladOS95-cyber force-pushed the add-GGUF-support-for-Bloom branch from c23788b to d49beb4 Compare September 18, 2024 16:26

VladOS95-cyber force-pushed the add-GGUF-support-for-Bloom branch from d49beb4 to 1a61d07 Compare September 23, 2024 15:30

SunMarc approved these changes Sep 23, 2024

View reviewed changes

VladOS95-cyber force-pushed the add-GGUF-support-for-Bloom branch from 173b862 to e69b87b Compare September 25, 2024 14:20

SunMarc mentioned this pull request Sep 26, 2024

Community contribution: Adding GGUF support for more architectures #33260

Open

15 tasks

VladOS95-cyber added 7 commits September 26, 2024 18:42

add bloom arch support for gguf

1907434

apply format

4b5d426

small refactoring, bug fix in GGUF_TENSOR_MAPPING naming

6f99820

optimize bloom GGUF_TENSOR_MAPPING

20672e8

implement reverse reshaping for bloom gguf

480e699

add qkv weights test

e98da74

add q_8 test for bloom

894d1a1

VladOS95-cyber force-pushed the add-GGUF-support-for-Bloom branch from e69b87b to 894d1a1 Compare September 26, 2024 16:42

SunMarc requested a review from LysandreJik September 26, 2024 17:02

LysandreJik approved these changes Sep 27, 2024

View reviewed changes

LysandreJik merged commit 9d200cf into huggingface:main Sep 27, 2024
22 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gguf support for bloom #33473

Add gguf support for bloom #33473

VladOS95-cyber commented Sep 13, 2024

VladOS95-cyber commented Sep 17, 2024 •

edited

Loading

SunMarc commented Sep 17, 2024

VladOS95-cyber commented Sep 18, 2024 •

edited

Loading

VladOS95-cyber commented Sep 23, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 23, 2024

VladOS95-cyber commented Sep 23, 2024 •

edited

Loading

SunMarc commented Sep 24, 2024

VladOS95-cyber commented Sep 24, 2024

SunMarc commented Sep 25, 2024 •

edited

Loading

VladOS95-cyber commented Sep 25, 2024

LysandreJik left a comment

Add gguf support for bloom #33473

Add gguf support for bloom #33473

Conversation

VladOS95-cyber commented Sep 13, 2024

What does this PR do?

Before submitting

Who can review?

VladOS95-cyber commented Sep 17, 2024 • edited Loading

SunMarc commented Sep 17, 2024

VladOS95-cyber commented Sep 18, 2024 • edited Loading

VladOS95-cyber commented Sep 23, 2024 • edited Loading

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 23, 2024

VladOS95-cyber commented Sep 23, 2024 • edited Loading

SunMarc commented Sep 24, 2024

VladOS95-cyber commented Sep 24, 2024

SunMarc commented Sep 25, 2024 • edited Loading

VladOS95-cyber commented Sep 25, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

VladOS95-cyber commented Sep 17, 2024 •

edited

Loading

VladOS95-cyber commented Sep 18, 2024 •

edited

Loading

VladOS95-cyber commented Sep 23, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

VladOS95-cyber commented Sep 23, 2024 •

edited

Loading

SunMarc commented Sep 25, 2024 •

edited

Loading