Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4 quantization support #197

Closed
wants to merge 2 commits into from
Closed

Q4 quantization support #197

wants to merge 2 commits into from

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Mar 17, 2023

Temporary PR, need to figure out a way to make sure this is usable in practice.

Either make the format work for llama.cpp &co (but the models over there include tokenization so...)
Or make something like smelt work with quantized data.

@Narsil Narsil requested review from McPatate and NouamaneTazi March 17, 2023 08:50
@Narsil Narsil marked this pull request as draft March 17, 2023 14:55
@Narsil
Copy link
Collaborator Author

Narsil commented Mar 17, 2023

Converted to draft, I will only merge this after being showcased in a real model example.

@philpax
Copy link

philpax commented Apr 14, 2023

I've been thinking about this some more from llama-rs's side - I think it would be quite nice for us to use safetensors as a first-class format that could support LLaMA/RWKV/BLOOM/etc in q4 format.

We'd need to store the hyperparameters and vocabulary ((string, f32)[]) - I assume that would be possible in the header?

@Narsil
Copy link
Collaborator Author

Narsil commented Apr 17, 2023

hyperparameters and vocabulary

Ideally the vocabulary would be in tokenizers https://github.com/huggingface/tokenizers/ which supports all of llama
(file being here) https://huggingface.co/hf-internal-testing/llama-tokenizer/blob/main/tokenizer.json

What do you mean by hyperparameters ?

Now supporting Q4 will require a bit more work, I've deep dived into it, and it's not exactly n-bits per parameter. It's more n-bits per group of 32 q4. And the number is not the same for q4_0 and q4_1 (which I think would be more correctly named q4_0_32, q4_1_32 since the packing size is quite critical (ggerganov/llama.cpp#1004)

@philpax
Copy link

philpax commented Apr 19, 2023

Ideally the vocabulary would be in tokenizers https://github.com/huggingface/tokenizers/ which supports all of llama

Hmm, fair enough. We prefer single-file deployments for the convenience, but it makes sense to have a standard here.

What do you mean by hyperparameters ?

Vocabulary size, dimensions, heads, layers. The usual. I imagine that's part of HF's config.json?

Now supporting Q4 will require a bit more work, I've deep dived into it, and it's not exactly n-bits per parameter. It's more n-bits per group of 32 q4. And the number is not the same for q4_0 and q4_1 (which I think would be more correctly named q4_0_32, q4_1_32 since the packing size is quite critical (ggerganov/llama.cpp#1004)

Yeah, that makes sense. No rush on this, we'll support it when it's ready :)

@Narsil
Copy link
Collaborator Author

Narsil commented Apr 20, 2023

Vocabulary size, dimensions, heads, layers. The usual. I imagine that's part of HF's config.json?

Currently yes.

We prefer single-file deployments for the convenience,

Afaik, you always need to graph of computation too, which is neither included in config.json nor in model.safetensors but in the program itself. (In transformers or ggml for instance).

It's currently not a goal to be single file deployments for that reason (and because you can write a different/better program with the same weights which we happen to do quite regularly).

Please let me know when you have q4 support in whatever format I'll take a look on how to enable here.
And if there are specific alignements required too (I don't think there's anything more than regular byte alignment, but I may have misread that).

@iacore
Copy link
Contributor

iacore commented May 2, 2023

ggml has added 2 packing formats. Those are better.

q4_0, q4_2: $\vec{x}w$
q4_1, q4_3: $\vec{x}w + b$

q4_0: 32 ints packed
q4_1: 32 ints
q4_2: 16 ints
q4_3: 16 ints

Maybe a more descriptive name is better?

@Narsil
Copy link
Collaborator Author

Narsil commented May 3, 2023

Maybe a more descriptive name is better?

I had in mind q4_0_32 and q4_0_16.

The thing is that this format packs the scale+zero point. GPTQ splits those in different tensors: https://github.com/qwopqwop200/GPTQ-for-LLaMa
Which make the current safetensors already valid.

I'm not sure how much the locality helps performance there.

Also adding New formats (especially a matrix of them, since currently there are 3, 4, 5 bits quantization schemes along 16, 32 packing (128 in gptq and full row) and (scale, scale+zero) ) makes a lot of added complexity on the types, and none of them would be loadable in torch, tf and numpy.

It's not at all a problem to add specific types, but since we have to maintain them until the end of time, I think it would be nice to do it when community settles on common grounds on them.

My current understanding is that ggml is recommending q5_1_32, while gptq recommends q4_1_128 ( GPT does different packing scheme which works better than the naive ggml hence the reduced bitsize iiuc)

@iacore
Copy link
Contributor

iacore commented May 4, 2023

Current safetensors support bfloat16, but is only supported by torch/tf, not numpy.

The problem is that the official loader is too restrictive. On meeting unknown types it just gives up. The upper case dtype naming due to serde is weird too. (F16 instead of f16)

See here: https://github.com/huggingface/safetensors/blob/752c1ab3b52463f4c4efda056e4c6a41e81a7ff3/safetensors/src/tensor.rs#LL594C1-L594C1

Maybe we should have a place to document custom types. Something like IANA registry for types.

Features:

  • Alignment
  • bytes per unit
  • elements per unit (important for quantized types, since it's packed)

The loader code is simple that applications can write their own.

I already made a tool to quantize safetensors models to every quantized format ggml supports: https://github.com/iacore/model-conversions/tree/main/quantize-wizard

@Narsil
Copy link
Collaborator Author

Narsil commented May 4, 2023

Current safetensors support bfloat16, but is only supported by torch/tf, not numpy.

I know and which is why I said it's not a blocker to add custom types. (Just still something to think about when adding things).
torch, tf (and jax sort of since it's mostly on par with tf for types) are primary targets (currently, I would love to see non Python alternatives definitely).

Here q4_{0,1} would be clearly made for llama.cpp (and friends)

Features:

Alignment
bytes per unit
elements per unit (important for quantized types, since it's packed)

I like the idea, but it wouldn't work for GPTQ for instance, since GPTQ splits the packed quantized unit and the scales and zeros into different tensors. Because the "unit" is not even in a single tensor there.

Maybe this splitting is just a bad idea, I haven't formally checked this yet. (Meaning we could stop thinking about GPTQ there and the idea you suggest works.)

@iacore
Copy link
Contributor

iacore commented May 4, 2023

Here q4_{0,1} would be clearly made for llama.cpp (and friends)

No. Quantization is useful for RWKV (not transformer). Maybe it's also for other ANN as well.

I like the idea, but it wouldn't work for GPTQ for instance, since GPTQ splits the packed quantized unit and the scales and zeros into different tensors. Because the "unit" is not even in a single tensor there.

How does it work? What's the quantized struct in C?

@Narsil
Copy link
Collaborator Author

Narsil commented May 4, 2023

Complete story : https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant/quant_linear.py#L73

Simple story, they store zeros and scales in a different tensor altogether than the 4bits packed "weights" tensor.
There is no "C" struct.

It allows for non linear mapping of packings, which is an important aspect of the method, where they pack quantization with respect to activations, which supposedly handles outliers better (and hence less variance in degradation when doing the quantization).

@iacore
Copy link
Contributor

iacore commented May 5, 2023

That seems easier. Just store them as different tensors inside .safetensors.

@Narsil Narsil mentioned this pull request May 25, 2023
@cztomsik
Copy link

FYI GGML just got the ability to load/export graphs. It's not exactly what was discussed here but it might be usable for inference.
ggerganov/ggml#108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants