add exllamav2 arg #26437

SunMarc · 2023-09-27T11:53:03Z

What does this PR do ?

This PR adds the possibility to choose exllamav2 kernels for GPTQ model. This PR follows the integration of the kernels in auto-gptq and the integration in optimum.

Merge after the optimum PR is merged

HuggingFaceDocBuilderDev · 2023-09-27T12:10:22Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Looks really great thanks a lot !
What about changing disable_exllamav2 to disable_exllama_v2 ?
Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR

src/transformers/utils/quantization_config.py

SunMarc · 2023-09-27T13:59:58Z

What about changing disable_exllamav2 to disable_exllama_v2 ?

I've named it that way in auto-gptq, so let's keep disable_exllamav2

Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR

I added a link to the optimum benchmark and I will update the benchmark in a follow-up PR . It will be easier to maintain this way.

ArthurZucker

Thanks. a bit curious as to why we are naming this disable_ when it's disable by default for v1 and enabled for 2. Bit counter intuitive no?

ArthurZucker · 2023-09-28T09:51:40Z

docs/source/en/main_classes/quantization.md

+With the release of the exllamav2 kernel, you can get faster inference speed compared to the exllama kernels. You just need to 
+pass `disable_exllamav2` in [`GPTQConfig`]:


it doesn't make a lot of sense to have a disable_exllamav2 instead of enable_ or use_ which we have for most of the additional features like flash attention or use_cache. No?

It is named it this way because the goal was to enable it by default (False value). Since the v2 came out, I wanted to disable the v1 and enable v2 by default as v2 is just a faster version of v1. cc @fxmarty as you did the v1 integration.

Yes I agree the name is not very scalable. It could make sense to rename them.

ArthurZucker · 2023-09-28T09:52:01Z

src/transformers/utils/quantization_config.py

@@ -367,8 +369,9 @@ def __init__(
        module_name_preceding_first_block: Optional[List[str]] = None,
        batch_size: int = 1,
        pad_token_id: Optional[int] = None,
-        disable_exllama: bool = False,
+        disable_exllama: bool = True,


is this not breaking

Yeah, I was worried about that too. I would be great to force user to use exllamav2 instead of exllama. Otherwise, we can keep leave it to False and we can either:

set disable_exllamav2 to False and we check if indeed they have access to exllamav2 kernel, otherwise the user will fallback to exllama kernel.

set disable_exllamav2 to True and the user will have to enable it manually.

Compared to the second option, the first option will make it easy for current users and new users to discover and use exllamav2. However, it will be more verbose (fallback to exllama).
LMK what you think =) cc @younesbelkada

I decided to go with the second option and rename the variable

src/transformers/utils/quantization_config.py

ArthurZucker

Looks good, really think we should make things clearer (deprecate the disable_exllama) / try to think about exllamav3 and how not have to do the same PR again! But good to go otherwise

docs/source/en/main_classes/quantization.md

src/transformers/utils/quantization_config.py

ArthurZucker · 2023-10-24T09:39:57Z

src/transformers/utils/quantization_config.py

        self.post_init()

    def get_loading_attributes(self):
        attibutes_dict = copy.deepcopy(self.__dict__)
-        loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"]
+        loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"]


would be great if we can do a deprecation cycle for disable_exllama 👿

By great I meant can you do a deprecation cycle for this?

tests/quantization/gptq/test_gptq.py

younesbelkada

Thanks a mile! Just left one comment - LGTM

younesbelkada · 2023-10-26T10:37:29Z

src/transformers/utils/quantization_config.py

@@ -388,11 +391,14 @@ def __init__(
        self.pad_token_id = pad_token_id
        self.disable_exllama = disable_exllama
        self.max_input_length = max_input_length
+        self.use_exllama_v2 = use_exllama_v2
+        # needed for compatibility with optimum gptq config
+        self.disable_exllamav2 = not use_exllama_v2


Yeah let's just warn users that disable_exllama will be deprecated in future versions here ( or inside post_init) and re-think about a better API in the future, but for now I think all good!

i've added it in the post_init

ArthurZucker · 2023-10-27T08:49:43Z

src/transformers/utils/quantization_config.py

        self.post_init()

    def get_loading_attributes(self):
        attibutes_dict = copy.deepcopy(self.__dict__)
-        loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"]
+        loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"]


By great I meant can you do a deprecation cycle for this?

ArthurZucker · 2023-10-27T08:50:50Z

tests/quantization/gptq/test_gptq.py

@@ -178,6 +178,7 @@ def test_quantized_layers_class(self):
            group_size=self.group_size,
            bits=self.bits,
            disable_exllama=self.disable_exllama,
+            disable_exllamav2=True,


ouch hurts my eyes 🤣 . If you use one or the other, the other will be disabled

Oh for these tests, we don't use exllamav2 at all. It's just that the dynamically_import_QuantLinear have disable_exllamav2 = False by default. So I need to set it to True, so that I don't import the wrong Linear.

got it thanks!

This reverts commit 8214d6e.

Revert "add exllamav2 arg (#26437)" This reverts commit 8214d6e.

* add_ xllamav2 arg * add test * style * add check * add doc * replace by use_exllama_v2 * fix tests * fix doc * style * better condition * fix logic * add deprecate msg

Revert "add exllamav2 arg (huggingface#26437)" This reverts commit 8214d6e.

SunMarc added 3 commits September 27, 2023 11:21

add_ xllamav2 arg

ccf4017

add test

dc9ad60

style

bcd44ea

SunMarc requested review from ArthurZucker and younesbelkada September 27, 2023 11:53

younesbelkada reviewed Sep 27, 2023

View reviewed changes

src/transformers/utils/quantization_config.py Show resolved Hide resolved

SunMarc added 2 commits September 27, 2023 13:45

add check

3fa339d

add doc

cd6c0a9

SunMarc mentioned this pull request Sep 27, 2023

gptq_benchmark_update huggingface/optimum#1420

Merged

3 tasks

ArthurZucker reviewed Sep 28, 2023

View reviewed changes

SunMarc and others added 5 commits October 23, 2023 19:13

replace by use_exllama_v2

a1f53a1

fix tests

8a4d493

fix doc

e43d892

style

97f8531

better condition

73c482a

SunMarc requested review from ArthurZucker and younesbelkada October 23, 2023 20:21

ArthurZucker approved these changes Oct 24, 2023

View reviewed changes

SunMarc and others added 2 commits October 24, 2023 13:37

fix logic

b68d207

Merge remote-tracking branch 'upstream/main' into add_exllamav2_arg

4e2ff0b

younesbelkada approved these changes Oct 26, 2023

View reviewed changes

add deprecate msg

788412f

SunMarc merged commit 8214d6e into huggingface:main Oct 26, 2023
3 checks passed

ArthurZucker reviewed Oct 27, 2023

View reviewed changes

ArthurZucker added a commit that referenced this pull request Oct 27, 2023

Revert "add exllamav2 arg (#26437)"

b251b38

This reverts commit 8214d6e.

ArthurZucker mentioned this pull request Oct 27, 2023

Revert "add exllamav2 arg" #27102

Merged

ArthurZucker added a commit that referenced this pull request Oct 27, 2023

Revert "add exllamav2 arg" (#27102)

90ee9ce

Revert "add exllamav2 arg (#26437)" This reverts commit 8214d6e.

SunMarc mentioned this pull request Oct 27, 2023

Add exllamav2 better #27111

Merged

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023

Revert "add exllamav2 arg" (huggingface#27102)

a44371c

Revert "add exllamav2 arg (huggingface#26437)" This reverts commit 8214d6e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add exllamav2 arg #26437

add exllamav2 arg #26437

SunMarc commented Sep 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023 •

edited

Loading

younesbelkada left a comment

SunMarc commented Sep 27, 2023 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Sep 28, 2023

SunMarc Sep 28, 2023 •

edited

Loading

fxmarty Oct 2, 2023

SunMarc Oct 23, 2023

ArthurZucker Sep 28, 2023

SunMarc Sep 28, 2023 •

edited

Loading

SunMarc Oct 23, 2023

ArthurZucker left a comment

ArthurZucker Oct 24, 2023

ArthurZucker Oct 27, 2023

younesbelkada left a comment

younesbelkada Oct 26, 2023

SunMarc Oct 26, 2023

ArthurZucker Oct 27, 2023

ArthurZucker Oct 27, 2023 •

edited

Loading

SunMarc Oct 27, 2023

ArthurZucker Oct 27, 2023

		With the release of the exllamav2 kernel, you can get faster inference speed compared to the exllama kernels. You just need to
		pass `disable_exllamav2` in [`GPTQConfig`]:

add exllamav2 arg #26437

add exllamav2 arg #26437

Conversation

SunMarc commented Sep 27, 2023 • edited Loading

What does this PR do ?

HuggingFaceDocBuilderDev commented Sep 27, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

SunMarc commented Sep 27, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc commented Sep 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023 •

edited

Loading

SunMarc commented Sep 27, 2023 •

edited

Loading

SunMarc Sep 28, 2023 •

edited

Loading

SunMarc Sep 28, 2023 •

edited

Loading

ArthurZucker Oct 27, 2023 •

edited

Loading