Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

ggerganov · 2023-04-15T12:24:00Z

The current Q4_0 uses a single F32 floating-point scaling factor.

An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as Q4_1 and hopefully as fast as current Q4_0.

The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0 data format and potentially dropping support for Q4_1.

SIMD implementation progress

ARM NEON
AVX
WASM

I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

ggerganov commented Apr 15, 2023 •

edited by sw

Loading

ggerganov commented Apr 22, 2023

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Comments

ggerganov commented Apr 15, 2023 • edited by sw Loading

SIMD implementation progress

Related

ggerganov commented Apr 22, 2023

ggerganov commented Apr 15, 2023 •

edited by sw

Loading