Efficient Inference Kernel for SpQR #34976

elvircrn · 2024-11-27T15:40:28Z

What does this PR do?

Adds support for efficient single-batch inference for SpQR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @MekkCyber

MekkCyber · 2024-11-29T11:05:57Z

Hey @elvircrn, Thanks for adding this quantization method ! A very smooth integration 🔥! Just left a few comments

src/transformers/integrations/spqr.py

MekkCyber · 2024-11-29T10:19:29Z

src/transformers/integrations/spqr.py

+        raise ValueError(
+            f"SpQR requires Accelerate to be installed: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`"
+        )
+


No need to add the checks here, validate_environment in the quantizer will take care of that

Resolved by removing the checks.

MekkCyber · 2024-11-29T10:27:46Z

src/transformers/quantizers/quantizer_spqr.py

+
+    requires_calibration = True
+    required_packages = ["spqr_quant"]
+    optimum_quantizer = None


Not used, to be removed

Resolved by removing

MekkCyber · 2024-11-29T10:29:44Z

src/transformers/quantizers/quantizer_spqr.py

+            raise ImportError("Using `spqr` quantization requires SpQR: `pip install spqr_quant`")
+
+    def update_torch_dtype(self, torch_dtype: "torch.dtype") -> "torch.dtype":
+        return torch.float16


let's raise a warning here if you need to override the torch_dtype like in other quantizers

Since SpQR really only supports float16, I raise an error if it's None or anything other than float16. Is this the desired behavior?

src/transformers/utils/quantization_config.py

MekkCyber · 2024-11-29T10:47:33Z

tests/quantization/spqr_integration/test_spqr.py

+@require_spqr
+@require_accelerate
+class SpQRTest(unittest.TestCase):
+    model_name = "BlackSamorez/Llama-2-7b-SPQR-2Bit-1x16-hf"


Sorry I can't find this model on the hub 🤔

Ah, didn't expect a review so soon. Thank you!

I've uploaded the model and updated this part of the test.

No worries 😊 take your time and ping me when you finish the integration

MekkCyber · 2024-11-29T10:52:23Z

tests/quantization/spqr_integration/test_spqr.py

+    @unittest.skipUnless(
+        is_spqr_available(),
+        "test requires `spqr_quant`",
+    )


No need to add the requirement here, it's already verified with the decorator require_spqr

Resolved by removing.

MekkCyber · 2024-11-29T10:57:12Z

docs/source/en/quantization/spqr.md

+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```


We can also add a small section about how to load a model quantized using SpQR using transformers

Resolved by updating the README.

MekkCyber · 2024-11-29T10:59:37Z

src/transformers/quantizers/quantizer_spqr.py

+            raise ImportError("Using `spqr` quantization requires Accelerate: `pip install accelerate`")
+
+        if not is_spqr_available():
+            raise ImportError("Using `spqr` quantization requires SpQR: `pip install spqr_quant`")


if the quantization method requires a gpu we need to check if it's availabe here

Resolved by adding an is_cuda check.

MekkCyber · 2024-11-29T11:03:30Z

tests/quantization/spqr_integration/test_spqr.py

+
+
+@slow
+@require_torch_gpu


Is it possible to add tests for multi_gpu setups ?

Resolving by bringing this back form AQLM. Will the CI be able to pick this up? I don't have the means of testing this.

Yes it does 😄

+1 on that, you can add the test and we will check to make sure it to pass the right expected output

elvircrn · 2024-12-02T15:17:40Z

@MekkCyber I've removed the Draft tag from the PR.

It seems that I am unable to pass the test_raise_if_non_quantized. Could you give me some pointers on this?

MekkCyber · 2024-12-06T12:47:27Z

Could you rebase @elvircrn to update the branch :)

elvircrn · 2024-12-06T13:10:27Z

@MekkCyber Rebase done.

MekkCyber · 2024-12-09T08:13:48Z

Hey @elvircrn, thanks for iterating ! It looks great I just left some minor comments

MekkCyber · 2024-12-09T07:37:29Z

src/transformers/integrations/spqr.py

+    if modules_to_not_convert is None:
+        modules_to_not_convert = []
+
+    from accelerate import init_empty_weights
+    from spqr_quant import QuantizedLinear
+


We need to add the checks here if the accelerate and spqr_quant packages are available before doing the import, what we don't need to do is raising errors, because the packages are assumed to be installed when executing the function

Not sure if I understood this, please double-check my newest change.

MekkCyber · 2024-12-09T07:40:19Z

src/transformers/quantizers/quantizer_spqr.py

+        if torch_dtype is None:
+            torch_dtype = torch.float16
+            logger.info(
+                "Assuming CUDA is available. Assuming SpQR inference on GPU and loading the model in `torch.float16`."
+            )


We are not assuming here because we already validated the environment 😉

MekkCyber · 2024-12-09T07:42:24Z

src/transformers/quantizers/quantizer_spqr.py

+        elif torch_dtype != torch.float16:
+            raise ValueError(
+                "You cannot any type other than torch.float16 for SpQR. Please either leave it None or set it to"
+                "torch.float16 explicitly."
+            )
+        return torch_dtype


Suggested change

elif torch_dtype != torch.float16:

raise ValueError(

"You cannot any type other than torch.float16 for SpQR. Please either leave it None or set it to"

"torch.float16 explicitly."

)

return torch_dtype

elif torch_dtype != torch.float16:

raise ValueError(

"You cannot use any type other than torch.float16 for SpQR. Please either leave it None or set it to"

"torch.float16 explicitly."

)

return torch_dtype

MekkCyber · 2024-12-09T07:55:44Z

src/transformers/integrations/spqr.py

+                    model._modules[name] = QuantizedLinear.create_placehodler(
+                        rows=out_features,
+                        cols=in_features,
+                        bits=quantization_config.bits,
+                        beta1=quantization_config.beta1,
+                        beta2=quantization_config.beta2,
+                        dense_weights_shape=dense_weights_shape,
+                        row_offsets_shape=row_offsets_shape,
+                        col_vals_shape=col_vals_shape,
+                        in_perm_shape=in_perm_shape,
+                    )
+                    has_been_replaced = True
+


Just FMI, are the beta and shapes parameters used solely for loading, or do they also play a role in quantization? In other words, if a model is quantized using specific beta or shapes values, can it only be loaded with those same parameters?

SpQR quantized weights are a product of tile-based (beta corresponds to the dimension of this tile where beta1, beta2 represent tile width and height respectively) bi-level quantization. You always specify beta before the quantization starts.

Each SpQR weight also comes with a sparse tensor representing the outlier weights. As this tensor is unstructured, one had to keep track of the size of this matrix during loading.

The following is a visualization of the compression format from the original publication (https://arxiv.org/pdf/2306.03078):

Thanks for the detailed explanation

MekkCyber · 2024-12-09T07:59:14Z

src/transformers/integrations/spqr.py

+                    tensor_name = ".".join(current_key_name)
+                    dense_weights_shape = quantization_config.shapes[f"{tensor_name}.dense_weights.shape"]
+                    row_offsets_shape = quantization_config.shapes[f"{tensor_name}.row_offsets.shape"]
+                    col_vals_shape = quantization_config.shapes[f"{tensor_name}.col_vals.shape"]
+                    in_perm_shape = quantization_config.shapes[f"{tensor_name}.in_perm.shape"]
+
+                    in_features = module.in_features


in the shapes attribute do we need to specify the shape for all the tensors ?

The colval size needs to be specified ahead of time for all weights. The rest can be computed from m, n and bits. Can we keep it the way it is now?

I can do it both ways easily, not sure what is preferred here.

I'm just worried about the quantization config, I'm not sure if we can verify that shapes contains all the necessary elements in the post_init. Maybe before trying to access the key in replace_with_spqr_linear we verify if it's there or we raise an error to inform the user there is something wrong with the config.json they are using, wdyt @SunMarc

In my estimate, these configurations are directly tied to the quantized weights. This would be a fatal error in quantization serialization if the shapes don't cover the full set of quantized weights. Perhaps this is something that should be done on the side of SpQR (https://pypi.org/project/spqr-quant/)?

If we were to do it on huggingface/transformers-side:

a) during replace_with_linear, we raise a fatal error (please let me know which one would be appropriate here) if the quantized weights don't have the corresponding key in the shapes config.
b) we do a full check before conducting the replacement in order to conserve memory/device cycles.

Let me know what best suits transformers here.

I didn't see the tag to @SunMarc - I thought the wdyt referred to me.

MekkCyber · 2024-12-09T08:00:17Z

tests/quantization/spqr_integration/test_spqr.py

+
+
+@slow
+@require_torch_gpu


Yes it does 😄

tests/quantization/spqr_integration/test_spqr.py

MekkCyber · 2024-12-09T08:10:59Z

tests/quantization/spqr_integration/test_spqr.py

+        model_id = "meta-llama/Llama-2-7b-hf"
+        config = AutoConfig.from_pretrained(model_id)
+        quantization_config = AutoConfig.from_pretrained(self.model_name, return_dict=False).quantization_config
+        quantization_config = SpQRConfig.from_dict(quantization_config)


If possible it's better to use small models for the CI like Qwen 0.5, Bloom 560M, SmolLM 135M...

Will have to get back to you on this (might require a significant time investment).

Is this a blocker for the merge?

Not really a blocker for now

elvircrn · 2024-12-09T10:04:52Z

@MekkCyber The latest batch of comments are addressed.

MekkCyber · 2024-12-09T12:58:47Z

Thanks for the quick iteration @elvircrn 🔥 ! can you re-rebase please 😅

MekkCyber · 2024-12-09T12:19:44Z

src/transformers/integrations/spqr.py

+                    tensor_name = ".".join(current_key_name)
+                    dense_weights_shape = quantization_config.shapes[f"{tensor_name}.dense_weights.shape"]
+                    row_offsets_shape = quantization_config.shapes[f"{tensor_name}.row_offsets.shape"]
+                    col_vals_shape = quantization_config.shapes[f"{tensor_name}.col_vals.shape"]
+                    in_perm_shape = quantization_config.shapes[f"{tensor_name}.in_perm.shape"]
+
+                    in_features = module.in_features


I'm just worried about the quantization config, I'm not sure if we can verify that shapes contains all the necessary elements in the post_init. Maybe before trying to access the key in replace_with_spqr_linear we verify if it's there or we raise an error to inform the user there is something wrong with the config.json they are using, wdyt @SunMarc

MekkCyber · 2024-12-09T12:42:02Z

src/transformers/integrations/spqr.py

+                    model._modules[name] = QuantizedLinear.create_placehodler(
+                        rows=out_features,
+                        cols=in_features,
+                        bits=quantization_config.bits,
+                        beta1=quantization_config.beta1,
+                        beta2=quantization_config.beta2,
+                        dense_weights_shape=dense_weights_shape,
+                        row_offsets_shape=row_offsets_shape,
+                        col_vals_shape=col_vals_shape,
+                        in_perm_shape=in_perm_shape,
+                    )
+                    has_been_replaced = True
+


Thanks for the detailed explanation

MekkCyber · 2024-12-09T12:43:28Z

src/transformers/integrations/spqr.py

+    if is_accelerate_available():
+        from accelerate import init_empty_weights
+    if is_spqr_available():
+        from spqr_quant import QuantizedLinear


Yes that's exactly what I meant, thank you for fixing it !

MekkCyber · 2024-12-09T12:46:26Z

tests/quantization/spqr_integration/test_spqr.py

+        model_id = "meta-llama/Llama-2-7b-hf"
+        config = AutoConfig.from_pretrained(model_id)
+        quantization_config = AutoConfig.from_pretrained(self.model_name, return_dict=False).quantization_config
+        quantization_config = SpQRConfig.from_dict(quantization_config)


Not really a blocker for now

elvircrn · 2024-12-09T13:01:21Z

@MekkCyber rebased.

elvircrn · 2024-12-09T13:10:13Z

@MekkCyber Resolved the latest batch of comments.

elvircrn · 2024-12-09T14:30:59Z

@SunMarc @MekkCyber
I've pushed a check for keys() in shapes in replace_with_spqr_linear. Let me know if something in this ballpark works.

elvircrn marked this pull request as draft November 27, 2024 15:40

elvircrn changed the title ~~Spqr quantizer~~ Efficient Inference Kernel for SpQR Nov 27, 2024

SunMarc requested a review from MekkCyber November 28, 2024 14:45

MekkCyber reviewed Nov 29, 2024

View reviewed changes

This was referenced Nov 30, 2024

Any update on inference code? Vahe1994/SpQR#44

Closed

has the inference code released? Vahe1994/SpQR#42

Closed

How to test inference speed? Vahe1994/SpQR#12

Closed

elvircrn force-pushed the spqr-quantizer branch from fda6b66 to ee10336 Compare December 2, 2024 15:08

elvircrn marked this pull request as ready for review December 2, 2024 15:09

elvircrn force-pushed the spqr-quantizer branch 4 times, most recently from 56c016d to e1a91fa Compare December 5, 2024 17:16

elvircrn force-pushed the spqr-quantizer branch from e1a91fa to 4ea16a9 Compare December 6, 2024 13:10

elvircrn force-pushed the spqr-quantizer branch from 4ea16a9 to 51c9f8d Compare December 6, 2024 13:52

MekkCyber reviewed Dec 9, 2024

View reviewed changes

elvircrn force-pushed the spqr-quantizer branch from 436697e to 16c0278 Compare December 9, 2024 10:04

MekkCyber reviewed Dec 9, 2024

View reviewed changes

elvircrn force-pushed the spqr-quantizer branch from 16c0278 to efd9f20 Compare December 9, 2024 13:01

MekkCyber requested a review from SunMarc December 9, 2024 13:04

elvircrn added 28 commits December 13, 2024 23:49

Get rid of aqlm mention

136f2a2

Start working on tests

d45b0fb

Resolve ruff code checks

d3cdcba

Ruff format

0eea6f2

Isort

a691819

Test updates

75d2735

Add gpu tag

9f994b6

Rename to modules_to_not_convert

f0bc9b5

Config update

6338b78

Docs and config update

e149657

Docs and config update

698bcfb

Update to update_torch_dtype

d8eec31

spqr config parameter validation

d0b30bb

Ruff update

f98c89f

Apply ruff fixes

876ebfa

Test fixes

c830a58

Ruff update

db7651d

Mark tests as @slow again; Ruff; Docstring update

36af82b

Ruff

238e691

Remove absolute path

745b5b1

Resolve typo

8bd1e3f

Remove redundandt log

8e2f346

Check accelerate/spqr availability

21cfa8b

Ruff fix

5dfb7bc

Check if the config contains proper shapes

90ddd40

Ruff test

03dbf9b

Documentation update

05bd492

overview update

91ca265

elvircrn force-pushed the spqr-quantizer branch from cfca9df to 91ca265 Compare December 13, 2024 22:49

MekkCyber requested a review from ArthurZucker December 14, 2024 09:44



		@slow
		@require_torch_gpu



		@slow
		@require_torch_gpu

Efficient Inference Kernel for SpQR #34976

Are you sure you want to change the base?

Efficient Inference Kernel for SpQR #34976

Conversation

elvircrn commented Nov 27, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

MekkCyber commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elvircrn commented Dec 2, 2024

MekkCyber commented Dec 6, 2024

elvircrn commented Dec 6, 2024

MekkCyber commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elvircrn commented Dec 9, 2024 • edited Loading

MekkCyber commented Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elvircrn commented Dec 9, 2024

elvircrn commented Dec 9, 2024

elvircrn commented Dec 9, 2024

elvircrn commented Nov 27, 2024 •

edited

Loading

elvircrn commented Dec 9, 2024 •

edited

Loading

MekkCyber commented Dec 9, 2024 •

edited

Loading