[`core` / `Quantization` ] AWQ integration #27045

younesbelkada · 2023-10-24T16:05:52Z

What does this PR do?

As per title, this PR adds the AWQ inference support in transformers.

Original paper: https://arxiv.org/abs/2306.00978
Original LLM-AWQ repository: https://github.com/mit-han-lab/llm-awq/
Auto-AWQ repository: https://github.com/casper-hansen/AutoAWQ

AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks

Contrary to GPTQ, in this integration we want to support inference only - since the ecosystem is quite mature with respect to quantizing a model, we will publicize different routes for that purpose, such as using auto-awq, the original repository or optimum Neural Compressor.

For now I have pushed a 'test' model under this repository: https://huggingface.co/ybelkada/test-mistral-7b-v0.1-awq but we plan to support all AWQ weights from TheBloke. For running experiments using this PR, you can first pip install autoawq then run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ybelkada/test-mistral-7b-v0.1-awq"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(0)

print(model)

text = ["Hello my name is", "hi"]
input_ids = tok.encode(text, return_tensors="pt").to(0)

output = model.generate(input_ids, max_new_tokens=40)
print(tok.batch_decode(output, skip_special_tokens=True))

TODO:

cc @fxmarty @SunMarc @casper-hansen @TheBloke @IlyasMoutawwakil

HuggingFaceDocBuilderDev · 2023-10-24T16:25:23Z

The documentation is not available anymore as the PR was closed or merged.

casper-hansen · 2023-10-24T17:36:30Z

This is super exciting to see! The original repository does not support TheBloke’s quants, they were made with AutoAWQ - perhaps an argument to route to AutoAWQ for compatibility.

src/transformers/modeling_utils.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

casper-hansen · 2023-10-25T13:39:35Z

src/transformers/integrations/awq.py

+                    in_features = module.in_features
+                    out_features = module.out_features
+
+                    model._modules[name] = WQLinear_GEMM(


If the version in config is GEMV, this will fail to load that version. Would it be appropriate to get the WQLinear based on the version in the config?

Makes sense yes!

let me know WDYT of 13abcf2 !

This looks good, and should work as intended. If you want to test GEMV without quantizing yourself, I have quantized Vicuna 7B v1.5 in GEMV:
https://huggingface.co/casperhansen/vicuna-7b-v1.5-awq-gemv

casper-hansen · 2023-10-25T13:54:49Z

src/transformers/modeling_utils.py

+                quantization_config = AWQConfig.from_dict(config.quantization_config)
+
+            model, _ = replace_with_awq_linear(
+                model, quantization_config=quantization_config, modules_to_not_convert=["lm_head"]


The modules_to_not_convert=["lm_head"] may end up causing issues eventually if the head is not named "lm_head" exactly. The way we deal with avoiding this is by only looking at the decoder layers of the model in AutoAWQ by calling the model's get_model_layers() function (e.g. llama)

Nice! For bnb we have the same issue and use this method: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L243 which should be quite generic for most transformers models. I planned to use that instead

SunMarc

LGTM ! Left a few comments.

docs/source/en/main_classes/quantization.md

src/transformers/modeling_utils.py

src/transformers/integrations/awq.py

src/transformers/utils/quantization_config.py

tests/quantization/autoawq/test_awq.py

src/transformers/modeling_utils.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

amyeroberts

Very nice! 🔥

Main comment is the need for tests to check that layers are replaced as expected. Once those are added it'll be good to go!

src/transformers/utils/quantization_config.py

amyeroberts · 2023-10-30T16:41:04Z

src/transformers/utils/quantization_config.py

+            if major < 8:
+                raise ValueError("LLM-AWQ backend is only supported on GPUs with compute capability >= 8.0")


amyeroberts · 2023-10-30T16:59:19Z

tests/quantization/autoawq/test_awq.py

Very nice set of tests :)

Only thing that should be added is a test which takes a dummy or real model and checks the recursive logic of replace_with_awq_linear acts as expected, as this is the most critical part. In particular, we should check that the expected layers are converted and modules_to_not_convert is respected

Makes sense, this is possible yes!

Added the conversion test here: 79cbbd3

Thank you! Just one last thing. The test will pass as soon as the first converted layer is found. However, what is important to test is the recursive properly i.e. are layers within a module converted? And are ones which are listed in modules_to_not_convert not converted?

Ah nice catch! OK will elaborate the test a bit more and let you know

Added a better test in df9e691 - LMK what do you think!

amyeroberts · 2023-10-30T17:00:14Z

src/transformers/modeling_utils.py

@@ -3224,6 +3262,17 @@ def from_pretrained(
        if quantization_method_from_config == QuantizationMethod.GPTQ:
            model = quantizer.convert_model(model)
            model._is_quantized_training_enabled = True
+        elif quantization_method_from_config == QuantizationMethod.AWQ:
+            from .integrations import get_keys_to_not_convert, replace_with_awq_linear


+1 on @younesbelkada's comment here

src/transformers/integrations/awq.py

amyeroberts · 2023-10-30T17:08:41Z

src/transformers/integrations/awq.py

+                if backend == AwqBackendPackingMethod.AUTOAWQ:
+                    target_cls = (
+                        WQLinear_GEMM if quantization_config.version == AWQLinearVersion.GEMM else WQLinear_GEMV
+                    )
+                else:
+                    target_cls = WQLinear


This should be out of the for-loop in the lines above - we don't need to keep redefining it

Moved in 597ab7f

amyeroberts · 2023-10-30T17:18:05Z

src/transformers/modeling_utils.py

+                logger.warning(
+                    "You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set "
+                    "your model on a GPU device in order to run your model."
+                )


It's OK to not throw and error in this case. However, we should still properly handle if the user tries to do inference on CPU. Is this caught with an exception somewhere?

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts

Nice - thanks for iterating with the test ❤️

There's a final iteration before we can merge to avoid using a large model in the test.

amyeroberts · 2023-10-31T10:08:35Z

tests/quantization/autoawq/test_awq.py

+        quantization_config = AwqConfig(bits=4)
+
+        with init_empty_weights():
+            model = OPTForCausalLM(config)


Do we need to use OPTForCausalLM specifically here. Why not AutoModelForCausalLM?

Yes we have to use OPTForCausalLM because initializing a model with a config file through AutoModelForCausalLM is not supported:

>>> AutoModelForCausalLM(config) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/younes_huggingface_co/code/transformers/src/transformers/models/auto/auto_factory.py", line 411, in __init__ raise EnvironmentError( OSError: AutoModelForCausalLM is designed to be instantiated using the `AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)` or `AutoModelForCausalLM.from_config(config)` methods.

amyeroberts · 2023-10-31T10:09:59Z

tests/quantization/autoawq/test_awq.py

+        config = AutoConfig.from_pretrained(model_id, revision="cb32f77e905cccbca1d970436fb0f5e6b58ee3c5")
+        quantization_config = AwqConfig(bits=4)
+
+        with init_empty_weights():


This test will need a @require_accelerate decorator if this is used

The require_accelerate decorator is already being set at AwqTest class

amyeroberts · 2023-10-31T10:15:08Z

tests/quantization/autoawq/test_awq.py

-        self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
+        from transformers.integrations.awq import replace_with_awq_linear
+
+        model_id = "facebook/opt-350m"


Why a checkpoint (we don't need trained weights here) and why this checkpoint when a smaller one exists?

Even though init_empty_weights is used here - this is just handling loading into ram, it still requires the checkpoint to be downloaded --> larger models make things slower.

If you replace this with a hf-internal-testing tiny model OR a v. small model defined with a model config this will be more lightweight. In this case we might even be able to remove the accelerate dependency.

For tiny model testing I use this - a 68M Llama model: https://huggingface.co/JackFram/llama-68m

@amyeroberts thanks!
the init_empty weights context manager will create a model on the meta device, it will cost 0 RAM regardless of the architecture. during the entire test I am only dealing with the model that is loaded on the meta device and I am not downloading any checkpoint.

it still requires the checkpoint to be downloaded --> larger models make things slower.

The only thing that I download is the config file, no model weight is being added here. The accelerate dependency is fine since we use it for all quantization integrations + always downloaded in our docker images for slow tests

My bad - I misread and thought from_pretrained was being used in the model creation i.e. model = XxxForCausalLM.from_pretrained

younesbelkada · 2023-10-31T13:00:36Z

Thanks @amyeroberts for your review! I have replied to your questions above, I believe that all the points are already being addressed. Let me know if I missed anything

amyeroberts

Thanks for adding and iterating on this!

younesbelkada · 2023-10-31T14:34:11Z

Thanks for all your reviews @amyeroberts @ArthurZucker @SunMarc @casper-hansen !

* working v1 * oops * Update src/transformers/modeling_utils.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fixup * oops * push * more changes * add docs * some fixes * fix copies * add v1 doc * added installation guide * relax constraints * revert * attempt llm-awq * oops * oops * fixup * raise error when incorrect cuda compute capability * nit * add instructions for llm-awq * fixup * fix copies * fixup and docs * change * few changes + add demo * add v1 tests * add autoawq in dockerfile * finalize * Update tests/quantization/autoawq/test_awq.py * fix test * fix * fix issue * Update src/transformers/integrations/awq.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/main_classes/quantization.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/main_classes/quantization.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/integrations/awq.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/integrations/awq.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * add link to example script * Update docs/source/en/main_classes/quantization.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * add more content * add more details * add link to quantization docs * camel case + change backend class name * change to string * fixup * raise errors if libs not installed * change to `bits` and `group_size` * nit * nit * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * disable training * address some comments and fix nits * fix * final nits and fix tests * adapt to our new runners * make fix-copies * Update src/transformers/utils/quantization_config.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/utils/quantization_config.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/integrations/awq.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/integrations/awq.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * move to top * add conversion test * final nit * add more elaborated test --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

younesbelkada added 2 commits October 24, 2023 12:42

working v1

efb1de8

oops

a0eee34

SunMarc reviewed Oct 24, 2023

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

Update src/transformers/modeling_utils.py

a06a604

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

younesbelkada mentioned this pull request Oct 25, 2023

HfQuantizer class for quantization-related stuff in modeling_utils.py #26610

Merged

casper-hansen reviewed Oct 25, 2023

View reviewed changes

younesbelkada added 2 commits October 25, 2023 13:49

Merge remote-tracking branch 'upstream/main' into add-awq

33f1134

fixup

4b29b0b

casper-hansen reviewed Oct 25, 2023

View reviewed changes

younesbelkada added 18 commits October 25, 2023 13:58

oops

ddc3ea2

push

13abcf2

more changes

1561463

add docs

433148e

some fixes

ee4d301

fix copies

d8616fb

Merge remote-tracking branch 'upstream/main' into add-awq

42cc2e5

add v1 doc

fd172bd

added installation guide

5e155e1

relax constraints

4a1413e

revert

014d901

attempt llm-awq

39b4e6a

oops

3dd93dd

oops

96bac28

fixup

d58c461

raise error when incorrect cuda compute capability

150a5ec

nit

477e17d

add instructions for llm-awq

717d044

younesbelkada mentioned this pull request Oct 26, 2023

[examples / integration ] Add Hugging Face transformers example mit-han-lab/llm-awq#103

Merged

Merge remote-tracking branch 'upstream/main' into add-awq

e97026c

SunMarc approved these changes Oct 30, 2023

View reviewed changes

younesbelkada and others added 6 commits October 30, 2023 16:53

Apply suggestions from code review

e4dcaca

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

disable training

bd3c37a

address some comments and fix nits

c353ace

fix

fa38b26

final nits and fix tests

53957f0

adapt to our new runners

790b2fe

younesbelkada requested a review from amyeroberts October 30, 2023 16:13

make fix-copies

6bdf5de

amyeroberts reviewed Oct 30, 2023

View reviewed changes

younesbelkada and others added 7 commits October 30, 2023 18:23

Update src/transformers/utils/quantization_config.py

64cbf01

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/utils/quantization_config.py

88aec66

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/integrations/awq.py

97d335a

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/integrations/awq.py

b0e2868

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

move to top

597ab7f

add conversion test

79cbbd3

final nit

ebf58e1

younesbelkada requested a review from amyeroberts October 30, 2023 17:57

add more elaborated test

df9e691

amyeroberts reviewed Oct 31, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into add-awq

3f4b1a1

younesbelkada requested a review from amyeroberts October 31, 2023 13:00

amyeroberts approved these changes Oct 31, 2023

View reviewed changes

casper-hansen mentioned this pull request Oct 31, 2023

AwqConfig class casper-hansen/AutoAWQ#132

Merged

younesbelkada merged commit ae093ee into huggingface:main Nov 1, 2023
22 checks passed

younesbelkada deleted the add-awq branch November 1, 2023 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core` / `Quantization` ] AWQ integration #27045

[`core` / `Quantization` ] AWQ integration #27045

younesbelkada commented Oct 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 24, 2023 •

edited

Loading

casper-hansen commented Oct 24, 2023

casper-hansen Oct 25, 2023

younesbelkada Oct 25, 2023

younesbelkada Oct 25, 2023

casper-hansen Oct 25, 2023

casper-hansen Oct 25, 2023

younesbelkada Oct 25, 2023

SunMarc left a comment •

edited

Loading

amyeroberts left a comment

amyeroberts Oct 30, 2023

amyeroberts Oct 30, 2023

younesbelkada Oct 30, 2023

younesbelkada Oct 30, 2023

amyeroberts Oct 30, 2023

younesbelkada Oct 30, 2023

younesbelkada Oct 31, 2023

amyeroberts Oct 30, 2023

amyeroberts Oct 30, 2023

younesbelkada Oct 30, 2023

amyeroberts Oct 30, 2023

amyeroberts left a comment

amyeroberts Oct 31, 2023

younesbelkada Oct 31, 2023 •

edited

Loading

amyeroberts Oct 31, 2023

younesbelkada Oct 31, 2023

amyeroberts Oct 31, 2023

TheBloke Oct 31, 2023

younesbelkada Oct 31, 2023 •

edited

Loading

amyeroberts Oct 31, 2023

younesbelkada commented Oct 31, 2023

amyeroberts left a comment

younesbelkada commented Oct 31, 2023

		if major < 8:
		raise ValueError("LLM-AWQ backend is only supported on GPUs with compute capability >= 8.0")

[core / Quantization ] AWQ integration #27045

[core / Quantization ] AWQ integration #27045

Conversation

younesbelkada commented Oct 24, 2023 • edited Loading

What does this PR do?

TODO:

HuggingFaceDocBuilderDev commented Oct 24, 2023 • edited Loading

casper-hansen commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Oct 31, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

younesbelkada commented Oct 31, 2023

[`core` / `Quantization` ] AWQ integration #27045

[`core` / `Quantization` ] AWQ integration #27045

younesbelkada commented Oct 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 24, 2023 •

edited

Loading

SunMarc left a comment •

edited

Loading

younesbelkada Oct 31, 2023 •

edited

Loading

younesbelkada Oct 31, 2023 •

edited

Loading