Use different bit arrangement for quants (nibbles) #1241

ikawrakow · 2023-04-29T20:01:06Z

In the existing llama.cpp implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into a uint8_t. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (Q4_0), one can store the quants of the first 16 weights into the low 4 bits of the 16 uint8_t's, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight.

The performance gain is not earth-shattering: in a synthetic benchmark performing Q4_0_Q8_0 dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?

The text was updated successfully, but these errors were encountered:

sw · 2023-04-30T11:21:10Z

This could be done without breaking Q4_0/Q4_1 file compatibility, right? Just ensure you're doing Q8 in the right order.

Edit: Actually I'm not sure what you refer to. The AVX implementation of quantize_row_q8_1 do some shuffling with _mm256_permutevar8x32_epi32, I guess that could be avoided if you kept Q4 in the same order. Or do you mean vzip1q_s8/vzip2q_s8 on ARM?

ggerganov · 2023-05-01T09:12:19Z

The vzip calls on ARM should go away - it's an unfortunate artifact of the non-optimal bit arrangement that we started with. I guess we can just update the Q8 quantize calls to achieve that and keep the Q4 formats intact, but then we will be shuffling during Q8 quantization. Not as bad as it is now, but still not perfect. Better not shuffle anything

I realize it will be a complete mess to make all models incompatible, but at the same time I think getting the best performance is always highest priority.

unbounded · 2023-05-04T07:47:07Z

👍 agreed that vzip should be unnecessary.
I'll note here that #1073 does away with this shuffling.
It also lays out slightly larger blocks, so you can avoid shuffling with up to 512-bit vectors.

ikawrakow added the performance Speed related topics label Apr 29, 2023

ggerganov added this to ggml : improve integer quantization Apr 30, 2023

ggerganov moved this to Todo in ggml : improve integer quantization Apr 30, 2023

ggerganov self-assigned this May 3, 2023

ggerganov moved this from Todo to In Progress in ggml : improve integer quantization May 3, 2023

ggerganov mentioned this issue May 3, 2023

ggml : remove bit shuffling #1305

Closed

18 tasks

ggerganov mentioned this issue May 11, 2023

ggml : remove bit shuffling #1405

Merged

ggerganov closed this as completed in #1405 May 11, 2023

github-project-automation bot moved this from In Progress to Done in ggml : improve integer quantization May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use different bit arrangement for quants (nibbles) #1241

Use different bit arrangement for quants (nibbles) #1241

ikawrakow commented Apr 29, 2023

sw commented Apr 30, 2023 •

edited

Loading

ggerganov commented May 1, 2023

unbounded commented May 4, 2023

Use different bit arrangement for quants (nibbles) #1241

Use different bit arrangement for quants (nibbles) #1241

Comments

ikawrakow commented Apr 29, 2023

sw commented Apr 30, 2023 • edited Loading

ggerganov commented May 1, 2023

unbounded commented May 4, 2023

sw commented Apr 30, 2023 •

edited

Loading