Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A faster version for Q4_1 x Q8_0 dot products #1083

Merged
merged 2 commits into from
Apr 21, 2023

Conversation

ikawrakow
Copy link
Contributor

The idea behind being that Q8_0 quantized
values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. This makes the Q4_1 * Q8_0 dot product significantly slower than Q4_0 * Q8_0 (~80%).

In the PR the sum of Q8_0 quants is computed during quantization and stored it in the
now modified block_q8_0 struct. It is then reused in the subsequent dot products.

In a synthetic benchmark (just compute a bunch of dot products, see q8dot.cpp), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0.

In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%).

I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.

@ikawrakow ikawrakow requested a review from ggerganov April 20, 2023 17:01
@ikawrakow ikawrakow added the performance Speed related topics label Apr 20, 2023
@ggerganov
Copy link
Owner

ggerganov commented Apr 20, 2023

Here are some results on M1 Pro:

Using 4 threads:

# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 4

# master
llama_print_timings:      sample time =    56.76 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   844.74 ms /     8 tokens (  105.59 ms per token)
llama_print_timings:        eval time =  5959.12 ms /    63 runs   (   94.59 ms per run)
llama_print_timings:       total time =  6870.81 ms

# faster_q41_q80_dot_product
llama_print_timings:      sample time =    46.55 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   547.04 ms /     8 tokens (   68.38 ms per token)
llama_print_timings:        eval time =  3842.15 ms /    63 runs   (   60.99 ms per run)
llama_print_timings:       total time =  4445.57 ms

Using 8 threads:

# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8

# master
llama_print_timings:      sample time =    56.65 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   521.86 ms /     8 tokens (   65.23 ms per token)
llama_print_timings:        eval time =  3471.47 ms /    63 runs   (   55.10 ms per run)
llama_print_timings:       total time =  4060.30 ms

# faster_q41_q80_dot_product
llama_print_timings:      sample time =    46.56 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   362.70 ms /     8 tokens (   45.34 ms per token)
llama_print_timings:        eval time =  3416.20 ms /    63 runs   (   54.23 ms per run)
llama_print_timings:       total time =  3835.39 ms

At 4 threads the performance gain is much more pronounced, even for token eval time.
The prompt eval time is indeed significantly faster - best measured with large prompt and LLAMA_NO_ACCELERATE=1 to avoid offloading to the AMX coprocessor

@ggerganov ggerganov added the high priority Very important issue label Apr 20, 2023
@sw
Copy link
Contributor

sw commented Apr 20, 2023

Here on AVX2 / 4 cores this is looking good: master 232ms/token, your PR 223ms/token. Prompt eval seems to improve more, as you said, but I haven't looked at that closely.

But please clean up the commented-out code.

// There is not better way of doing this???

Horizontal sums really aren't what AVX is good at; I couldn't think of anything better. To anyone looking into this, here are the two decent stackoverflow threads:

@pubby
Copy link

pubby commented Apr 21, 2023

Horizontal sums

This is slightly faster on my alder lake CPU. Dunno if it's faster in general.

static inline float horizontalSum(__m256i a) {
    __m256i b = _mm256_castps_si256(_mm256_movehdup_ps(_mm256_castsi256_ps(a)));
    __m256i sum = _mm256_add_epi32(a, b);
    __m256i hi = _mm256_unpackhi_epi64(sum, sum);
    sum = _mm256_add_epi32(sum, hi);
    return _mm256_cvtsi256_si32(sum) + _mm256_extract_epi32(sum, 4);
}

The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.

In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.

In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).

I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.
@ikawrakow ikawrakow force-pushed the faster_q41_q80_dot_product branch from b51101a to c542d5a Compare April 21, 2023 12:45
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge it like this.
If somebody evaluates the horizontal sum suggested in the TODO - please send a PR that keeps just the better approach and removes the alternative

@ggerganov ggerganov merged commit 1bfc153 into master Apr 21, 2023
@ggerganov ggerganov deleted the faster_q41_q80_dot_product branch April 21, 2023 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics
Development

Successfully merging this pull request may close these issues.

5 participants