Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use different bit arrangement for quants (nibbles) #1241

Closed
ikawrakow opened this issue Apr 29, 2023 · 3 comments · Fixed by #1405
Closed

Use different bit arrangement for quants (nibbles) #1241

ikawrakow opened this issue Apr 29, 2023 · 3 comments · Fixed by #1405
Assignees
Labels
performance Speed related topics

Comments

@ikawrakow
Copy link
Contributor

In the existing llama.cpp implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into a uint8_t. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (Q4_0), one can store the quants of the first 16 weights into the low 4 bits of the 16 uint8_t's, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight.

The performance gain is not earth-shattering: in a synthetic benchmark performing Q4_0_Q8_0 dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?

@sw
Copy link
Contributor

sw commented Apr 30, 2023

This could be done without breaking Q4_0/Q4_1 file compatibility, right? Just ensure you're doing Q8 in the right order.

Edit: Actually I'm not sure what you refer to. The AVX implementation of quantize_row_q8_1 do some shuffling with _mm256_permutevar8x32_epi32, I guess that could be avoided if you kept Q4 in the same order. Or do you mean vzip1q_s8/vzip2q_s8 on ARM?

@ggerganov
Copy link
Owner

The vzip calls on ARM should go away - it's an unfortunate artifact of the non-optimal bit arrangement that we started with. I guess we can just update the Q8 quantize calls to achieve that and keep the Q4 formats intact, but then we will be shuffling during Q8 quantization. Not as bad as it is now, but still not perfect. Better not shuffle anything

I realize it will be a complete mess to make all models incompatible, but at the same time I think getting the best performance is always highest priority.

@ggerganov ggerganov self-assigned this May 3, 2023
@ggerganov ggerganov moved this from Todo to In Progress in ggml : improve integer quantization May 3, 2023
@unbounded
Copy link
Contributor

👍 agreed that vzip should be unnecessary.
I'll note here that #1073 does away with this shuffling.
It also lays out slightly larger blocks, so you can avoid shuffling with up to 512-bit vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Development

Successfully merging a pull request may close this issue.

4 participants