Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add exllamav2 arg #26437

Merged
merged 13 commits into from
Oct 26, 2023
Merged

add exllamav2 arg #26437

merged 13 commits into from
Oct 26, 2023

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Sep 27, 2023

What does this PR do ?

This PR adds the possibility to choose exllamav2 kernels for GPTQ model. This PR follows the integration of the kernels in auto-gptq and the integration in optimum.

  • Merge after the optimum PR is merged

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 27, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great thanks a lot !
What about changing disable_exllamav2 to disable_exllama_v2 ?
Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR

src/transformers/utils/quantization_config.py Show resolved Hide resolved
@SunMarc
Copy link
Member Author

SunMarc commented Sep 27, 2023

What about changing disable_exllamav2 to disable_exllama_v2 ?

I've named it that way in auto-gptq, so let's keep disable_exllamav2

Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR

I added a link to the optimum benchmark and I will update the benchmark in a follow-up PR . It will be easier to maintain this way.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. a bit curious as to why we are naming this disable_ when it's disable by default for v1 and enabled for 2. Bit counter intuitive no?

Comment on lines 134 to 135
With the release of the exllamav2 kernel, you can get faster inference speed compared to the exllama kernels. You just need to
pass `disable_exllamav2` in [`GPTQConfig`]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't make a lot of sense to have a disable_exllamav2 instead of enable_ or use_ which we have for most of the additional features like flash attention or use_cache. No?

Copy link
Member Author

@SunMarc SunMarc Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is named it this way because the goal was to enable it by default (False value). Since the v2 came out, I wanted to disable the v1 and enable v2 by default as v2 is just a faster version of v1. cc @fxmarty as you did the v1 integration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree the name is not very scalable. It could make sense to rename them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed !

@@ -367,8 +369,9 @@ def __init__(
module_name_preceding_first_block: Optional[List[str]] = None,
batch_size: int = 1,
pad_token_id: Optional[int] = None,
disable_exllama: bool = False,
disable_exllama: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not breaking

Copy link
Member Author

@SunMarc SunMarc Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was worried about that too. I would be great to force user to use exllamav2 instead of exllama. Otherwise, we can keep leave it to False and we can either:

  • set disable_exllamav2 to False and we check if indeed they have access to exllamav2 kernel, otherwise the user will fallback to exllama kernel.
  • set disable_exllamav2 to True and the user will have to enable it manually.

Compared to the second option, the first option will make it easy for current users and new users to discover and use exllamav2. However, it will be more verbose (fallback to exllama).
LMK what you think =) cc @younesbelkada

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with the second option and rename the variable

src/transformers/utils/quantization_config.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, really think we should make things clearer (deprecate the disable_exllama) / try to think about exllamav3 and how not have to do the same PR again! But good to go otherwise

docs/source/en/main_classes/quantization.md Outdated Show resolved Hide resolved
src/transformers/utils/quantization_config.py Show resolved Hide resolved
self.post_init()

def get_loading_attributes(self):
attibutes_dict = copy.deepcopy(self.__dict__)
loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"]
loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if we can do a deprecation cycle for disable_exllama 👿

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By great I meant can you do a deprecation cycle for this?

tests/quantization/gptq/test_gptq.py Show resolved Hide resolved
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a mile! Just left one comment - LGTM

@@ -388,11 +391,14 @@ def __init__(
self.pad_token_id = pad_token_id
self.disable_exllama = disable_exllama
self.max_input_length = max_input_length
self.use_exllama_v2 = use_exllama_v2
# needed for compatibility with optimum gptq config
self.disable_exllamav2 = not use_exllama_v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's just warn users that disable_exllama will be deprecated in future versions here ( or inside post_init) and re-think about a better API in the future, but for now I think all good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added it in the post_init

@SunMarc SunMarc merged commit 8214d6e into huggingface:main Oct 26, 2023
3 checks passed
self.post_init()

def get_loading_attributes(self):
attibutes_dict = copy.deepcopy(self.__dict__)
loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"]
loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By great I meant can you do a deprecation cycle for this?

@@ -178,6 +178,7 @@ def test_quantized_layers_class(self):
group_size=self.group_size,
bits=self.bits,
disable_exllama=self.disable_exllama,
disable_exllamav2=True,
Copy link
Collaborator

@ArthurZucker ArthurZucker Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch hurts my eyes 🤣 . If you use one or the other, the other will be disabled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for these tests, we don't use exllamav2 at all. It's just that the dynamically_import_QuantLinear have disable_exllamav2 = False by default. So I need to set it to True, so that I don't import the wrong Linear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it thanks!

ArthurZucker added a commit that referenced this pull request Oct 27, 2023
ArthurZucker added a commit that referenced this pull request Oct 27, 2023
Revert "add exllamav2 arg (#26437)"

This reverts commit 8214d6e.
@SunMarc SunMarc mentioned this pull request Oct 27, 2023
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* add_ xllamav2 arg

* add test

* style

* add check

* add doc

* replace by use_exllama_v2

* fix tests

* fix doc

* style

* better condition

* fix logic

* add deprecate msg
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
Revert "add exllamav2 arg (huggingface#26437)"

This reverts commit 8214d6e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants