Need help to understand q4_0, q4_1, q4_2, q4_3 quantization #1121

santapo · 2023-04-22T02:12:50Z

santapo
Apr 22, 2023

Is there any source that provides the detail of these q4_0, q4_1, q4_2, q4_3 method? I tried to read the C++ code but it's hard for me to understand how they work and difference between them.

Folko-Ven · 2023-04-22T04:42:19Z

Folko-Ven
Apr 22, 2023

Hi. You can see more about the different types of quantization here - #406. But in short, q4_0 - worse accuracy but higher speed, q4_1 - more accurate but slower. q4_2 and q4_3 are like new generations of q4_0 and q4_1. q4_2 should be more accurate q4_0 and just as fast, and q4_3 should be similarly more accurate than q4_1.

1 reply

LukeLIN-web Jul 16, 2024

Is there any definition of q4_0, q4_1, q4_2, q4_3 ? Thank you for any guide or link.

warren-lei · 2024-07-18T08:47:11Z

warren-lei
Jul 18, 2024

You can refer to this link: A Guide to Quantization in LLMs

0 replies

Vaibhavs10 · 2024-07-18T08:56:55Z

Vaibhavs10
Jul 18, 2024
Collaborator

Hi, I'm VB from Hugging Face, we put together a page explaining what different schemes in GGML mean: https://huggingface.co/docs/hub/en/gguf#quantization-types

Do let us know if we can add more details.

1 reply

hafezmg48 Nov 8, 2024

finally someone gave a list on that.

misutoneko · 2024-08-14T12:33:23Z

misutoneko
Aug 14, 2024

I found the piece of code at Q4_0 dequantize_row_q4_0() to be very helpful.
But what actually made it click for me was to take a look at the actual tensor data bytes and compare them between F32, F16 and Q4_0.
I also requantized from Q4_0 back to F16 and compared that to the original F16 version.

The data structure of the quantized data might not be that obvious from just by looking at the code.
The first two bytes contain the scaling factor as FP16.
The rest of it is organized so that the first half of the data is in the lower nibbles , and the latter half of the data is in the high nibbles.

Here's an example (little-endian). Assume Q4_0 quantized tensor data that starts with
38 b1 90 5b 6d b8 ...

So the scaling factor here is 0xb138, that's -0.163086 in F16.
The nibbles to process are 0x0, 0xb, 0xd, 0x8 and so on.
The first nibble gives us -8 * -0.163086 = 1.304688
Where the first -8 is actually 0-8, as the nibble value is 0x0. The -8 comes from code.

The second one -0.489258, third -0.815430, fourth -0.000031 etc.

2 replies

LukeLIN-web Aug 16, 2024

90 5b 6d b8 ... where is 9 5 6 b ? what are their usages?

misutoneko Aug 16, 2024

Yes, good question.
Those are the high nibbles, their turn comes after all the lower nibbles of the block have been processed first.
One block in Q4_0 consists of the scaling factor and 16 bytes (so we will end up with 32 floats from a single block).

Do note that Q4_0 is considered legacy nowadays. But it's fine for learning purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help to understand q4_0, q4_1, q4_2, q4_3 quantization #1121

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Need help to understand q4_0, q4_1, q4_2, q4_3 quantization #1121

Replies: 4 comments · 4 replies

Vaibhavs10 Jul 18, 2024 Collaborator

Replies: 4 comments 4 replies

Vaibhavs10
Jul 18, 2024
Collaborator