A faster version for Q4_1 x Q8_0 dot products #1083

ikawrakow · 2023-04-20T17:01:00Z

The idea behind being that Q8_0 quantized
values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. This makes the Q4_1 * Q8_0 dot product significantly slower than Q4_0 * Q8_0 (~80%).

In the PR the sum of Q8_0 quants is computed during quantization and stored it in the
now modified block_q8_0 struct. It is then reused in the subsequent dot products.

In a synthetic benchmark (just compute a bunch of dot products, see q8dot.cpp), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0.

In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%).

I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.

ggerganov · 2023-04-20T17:55:02Z

Here are some results on M1 Pro:

Using 4 threads:

# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 4

# master
llama_print_timings:      sample time =    56.76 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   844.74 ms /     8 tokens (  105.59 ms per token)
llama_print_timings:        eval time =  5959.12 ms /    63 runs   (   94.59 ms per run)
llama_print_timings:       total time =  6870.81 ms

# faster_q41_q80_dot_product
llama_print_timings:      sample time =    46.55 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   547.04 ms /     8 tokens (   68.38 ms per token)
llama_print_timings:        eval time =  3842.15 ms /    63 runs   (   60.99 ms per run)
llama_print_timings:       total time =  4445.57 ms

Using 8 threads:

# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8

# master
llama_print_timings:      sample time =    56.65 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   521.86 ms /     8 tokens (   65.23 ms per token)
llama_print_timings:        eval time =  3471.47 ms /    63 runs   (   55.10 ms per run)
llama_print_timings:       total time =  4060.30 ms

# faster_q41_q80_dot_product
llama_print_timings:      sample time =    46.56 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   362.70 ms /     8 tokens (   45.34 ms per token)
llama_print_timings:        eval time =  3416.20 ms /    63 runs   (   54.23 ms per run)
llama_print_timings:       total time =  3835.39 ms

At 4 threads the performance gain is much more pronounced, even for token eval time.
The prompt eval time is indeed significantly faster - best measured with large prompt and LLAMA_NO_ACCELERATE=1 to avoid offloading to the AMX coprocessor

sw · 2023-04-20T21:04:27Z

Here on AVX2 / 4 cores this is looking good: master 232ms/token, your PR 223ms/token. Prompt eval seems to improve more, as you said, but I haven't looked at that closely.

But please clean up the commented-out code.

// There is not better way of doing this???

Horizontal sums really aren't what AVX is good at; I couldn't think of anything better. To anyone looking into this, here are the two decent stackoverflow threads:

pubby · 2023-04-21T05:30:19Z

Horizontal sums

This is slightly faster on my alder lake CPU. Dunno if it's faster in general.

static inline float horizontalSum(__m256i a) {
    __m256i b = _mm256_castps_si256(_mm256_movehdup_ps(_mm256_castsi256_ps(a)));
    __m256i sum = _mm256_add_epi32(a, b);
    __m256i hi = _mm256_unpackhi_epi64(sum, sum);
    sum = _mm256_add_epi32(sum, hi);
    return _mm256_cvtsi256_si32(sum) + _mm256_extract_epi32(sum, 4);
}

The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.

ggerganov

I'll merge it like this.
If somebody evaluates the horizontal sum suggested in the TODO - please send a PR that keeps just the better approach and removes the alternative

ikawrakow requested a review from ggerganov April 20, 2023 17:01

ikawrakow added the performance Speed related topics label Apr 20, 2023

ggerganov added the high priority Very important issue label Apr 20, 2023

Kawrakow added 2 commits April 21, 2023 14:20

Cleaning up

c542d5a

ikawrakow force-pushed the faster_q41_q80_dot_product branch from b51101a to c542d5a Compare April 21, 2023 12:45

ggerganov approved these changes Apr 21, 2023

View reviewed changes

ggerganov merged commit 1bfc153 into master Apr 21, 2023

ggerganov deleted the faster_q41_q80_dot_product branch April 21, 2023 15:18

ggerganov mentioned this pull request Apr 21, 2023

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

Merged

ggerganov assigned ikawrakow Apr 22, 2023

sw mentioned this pull request Apr 25, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A faster version for Q4_1 x Q8_0 dot products #1083

A faster version for Q4_1 x Q8_0 dot products #1083

ikawrakow commented Apr 20, 2023

ggerganov commented Apr 20, 2023 •

edited

Loading

sw commented Apr 20, 2023

pubby commented Apr 21, 2023

ggerganov left a comment

A faster version for Q4_1 x Q8_0 dot products #1083

A faster version for Q4_1 x Q8_0 dot products #1083

Conversation

ikawrakow commented Apr 20, 2023

ggerganov commented Apr 20, 2023 • edited Loading

sw commented Apr 20, 2023

pubby commented Apr 21, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Apr 20, 2023 •

edited

Loading