-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add exllamav2 arg #26437
add exllamav2 arg #26437
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great thanks a lot !
What about changing disable_exllamav2
to disable_exllama_v2
?
Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR
I've named it that way in auto-gptq, so let's keep
I added a link to the optimum benchmark and I will update the benchmark in a follow-up PR . It will be easier to maintain this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. a bit curious as to why we are naming this disable_
when it's disable by default for v1 and enabled for 2. Bit counter intuitive no?
With the release of the exllamav2 kernel, you can get faster inference speed compared to the exllama kernels. You just need to | ||
pass `disable_exllamav2` in [`GPTQConfig`]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't make a lot of sense to have a disable_exllamav2
instead of enable_
or use_
which we have for most of the additional features like flash attention or use_cache
. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is named it this way because the goal was to enable it by default (False
value). Since the v2 came out, I wanted to disable the v1 and enable v2 by default as v2 is just a faster version of v1. cc @fxmarty as you did the v1 integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree the name is not very scalable. It could make sense to rename them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed !
@@ -367,8 +369,9 @@ def __init__( | |||
module_name_preceding_first_block: Optional[List[str]] = None, | |||
batch_size: int = 1, | |||
pad_token_id: Optional[int] = None, | |||
disable_exllama: bool = False, | |||
disable_exllama: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not breaking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was worried about that too. I would be great to force user to use exllamav2 instead of exllama. Otherwise, we can keep leave it to False
and we can either:
- set
disable_exllamav2
toFalse
and we check if indeed they have access to exllamav2 kernel, otherwise the user will fallback to exllama kernel. - set
disable_exllamav2
toTrue
and the user will have to enable it manually.
Compared to the second option, the first option will make it easy for current users and new users to discover and use exllamav2. However, it will be more verbose (fallback to exllama).
LMK what you think =) cc @younesbelkada
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to go with the second option and rename the variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, really think we should make things clearer (deprecate the disable_exllama) / try to think about exllamav3 and how not have to do the same PR again! But good to go otherwise
self.post_init() | ||
|
||
def get_loading_attributes(self): | ||
attibutes_dict = copy.deepcopy(self.__dict__) | ||
loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"] | ||
loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great if we can do a deprecation cycle for disable_exllama
👿
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By great I meant can you do a deprecation cycle for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a mile! Just left one comment - LGTM
@@ -388,11 +391,14 @@ def __init__( | |||
self.pad_token_id = pad_token_id | |||
self.disable_exllama = disable_exllama | |||
self.max_input_length = max_input_length | |||
self.use_exllama_v2 = use_exllama_v2 | |||
# needed for compatibility with optimum gptq config | |||
self.disable_exllamav2 = not use_exllama_v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's just warn users that disable_exllama
will be deprecated in future versions here ( or inside post_init
) and re-think about a better API in the future, but for now I think all good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added it in the post_init
self.post_init() | ||
|
||
def get_loading_attributes(self): | ||
attibutes_dict = copy.deepcopy(self.__dict__) | ||
loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"] | ||
loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By great I meant can you do a deprecation cycle for this?
@@ -178,6 +178,7 @@ def test_quantized_layers_class(self): | |||
group_size=self.group_size, | |||
bits=self.bits, | |||
disable_exllama=self.disable_exllama, | |||
disable_exllamav2=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch hurts my eyes 🤣 . If you use one or the other, the other will be disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh for these tests, we don't use exllamav2
at all. It's just that the dynamically_import_QuantLinear
have disable_exllamav2 = False
by default. So I need to set it to True
, so that I don't import the wrong Linear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it thanks!
This reverts commit 8214d6e.
* add_ xllamav2 arg * add test * style * add check * add doc * replace by use_exllama_v2 * fix tests * fix doc * style * better condition * fix logic * add deprecate msg
Revert "add exllamav2 arg (huggingface#26437)" This reverts commit 8214d6e.
What does this PR do ?
This PR adds the possibility to choose exllamav2 kernels for GPTQ model. This PR follows the integration of the kernels in auto-gptq and the integration in optimum.