You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current Q4_0 uses a single F32 floating-point scaling factor.
An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as Q4_1 and hopefully as fast as current Q4_0.
The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0 data format and potentially dropping support for Q4_1.
SIMD implementation progress
ARM NEON
AVX
WASM
I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.
This approach resulted in the new Q4_2 and Q4_3 which improve the perplexity results and maintain similar inference speeds as the original Q4_0 and Q4_1 approaches.
The remaining bits and pieces to complete this task will be summarized together with other things in a separate issue
F16_KV appears to have been removed here: ggerganov@af99c6f
This addresses two issues:
- ggerganov#995 which just requests to add the KV cache offloading param
- ggerganov#1006 a NULL ptr exception when using the embeddings (introduced by
leaving f16_kv in the fields struct)
The current
Q4_0
uses a single F32 floating-point scaling factor.An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as
Q4_1
and hopefully as fast as currentQ4_0
.The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current
Q4_0
data format and potentially dropping support forQ4_1
.SIMD implementation progress
I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.
Related
The text was updated successfully, but these errors were encountered: