-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A faster version for Q4_1 x Q8_0 dot products #1083
Conversation
Here are some results on M1 Pro: Using 4 threads: # command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 4
# master
llama_print_timings: sample time = 56.76 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 844.74 ms / 8 tokens ( 105.59 ms per token)
llama_print_timings: eval time = 5959.12 ms / 63 runs ( 94.59 ms per run)
llama_print_timings: total time = 6870.81 ms
# faster_q41_q80_dot_product
llama_print_timings: sample time = 46.55 ms / 64 runs ( 0.73 ms per run)
llama_print_timings: prompt eval time = 547.04 ms / 8 tokens ( 68.38 ms per token)
llama_print_timings: eval time = 3842.15 ms / 63 runs ( 60.99 ms per run)
llama_print_timings: total time = 4445.57 ms Using 8 threads: # command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
# master
llama_print_timings: sample time = 56.65 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 521.86 ms / 8 tokens ( 65.23 ms per token)
llama_print_timings: eval time = 3471.47 ms / 63 runs ( 55.10 ms per run)
llama_print_timings: total time = 4060.30 ms
# faster_q41_q80_dot_product
llama_print_timings: sample time = 46.56 ms / 64 runs ( 0.73 ms per run)
llama_print_timings: prompt eval time = 362.70 ms / 8 tokens ( 45.34 ms per token)
llama_print_timings: eval time = 3416.20 ms / 63 runs ( 54.23 ms per run)
llama_print_timings: total time = 3835.39 ms At 4 threads the performance gain is much more pronounced, even for token eval time. |
Here on AVX2 / 4 cores this is looking good: master 232ms/token, your PR 223ms/token. Prompt eval seems to improve more, as you said, but I haven't looked at that closely. But please clean up the commented-out code.
Horizontal sums really aren't what AVX is good at; I couldn't think of anything better. To anyone looking into this, here are the two decent stackoverflow threads: |
This is slightly faster on my alder lake CPU. Dunno if it's faster in general.
|
The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.
b51101a
to
c542d5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge it like this.
If somebody evaluates the horizontal sum suggested in the TODO - please send a PR that keeps just the better approach and removes the alternative
The idea behind being that
Q8_0
quantizedvalues get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. This makes the
Q4_1 * Q8_0
dot product significantly slower thanQ4_0 * Q8_0
(~80%).In the PR the sum of
Q8_0
quants is computed during quantization and stored it in thenow modified
block_q8_0
struct. It is then reused in the subsequent dot products.In a synthetic benchmark (just compute a bunch of dot products, see
q8dot.cpp
), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0.In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%).
I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.