Replies: 4 comments 4 replies
-
Hi. You can see more about the different types of quantization here - #406. But in short, q4_0 - worse accuracy but higher speed, q4_1 - more accurate but slower. q4_2 and q4_3 are like new generations of q4_0 and q4_1. q4_2 should be more accurate q4_0 and just as fast, and q4_3 should be similarly more accurate than q4_1. |
Beta Was this translation helpful? Give feedback.
-
You can refer to this link: A Guide to Quantization in LLMs |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm VB from Hugging Face, we put together a page explaining what different schemes in GGML mean: https://huggingface.co/docs/hub/en/gguf#quantization-types Do let us know if we can add more details. |
Beta Was this translation helpful? Give feedback.
-
I found the piece of code at Q4_0 dequantize_row_q4_0() to be very helpful. The data structure of the quantized data might not be that obvious from just by looking at the code. Here's an example (little-endian). Assume Q4_0 quantized tensor data that starts with So the scaling factor here is 0xb138, that's -0.163086 in F16. The second one -0.489258, third -0.815430, fourth -0.000031 etc. |
Beta Was this translation helpful? Give feedback.
-
Is there any source that provides the detail of these q4_0, q4_1, q4_2, q4_3 method? I tried to read the C++ code but it's hard for me to understand how they work and difference between them.
Beta Was this translation helpful? Give feedback.
All reactions