-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : use 8-bit precision for Q4_1 intermediate results #1047
Conversation
As demonstrated by So I think AVX-512-wise it might be better to focus on the "default" quantized dot product function (e.g., the one used in the quantization method recommended in the README file for converting models), so that most users get the speedup, and the maintenance burden is not too bad. Which quantization method do you think is more likely to become the new default? I think we now have |
Do you think that maybe this one can be dropped right now in a separate PR, without waiting for SIMD implementations of |
Is the suggestion to keep Because, atm AVX512 is not used since
I think it is likely that we will end up using |
Uh, no, the exact opposite actually. :) The suggestion is to remove After the removal of
Thanks! So a
|
Sorry - had to rebase on top of So, shall we merge it like this and proceed with |
This is the AVX version, if you trust ChatGPT 😄 (need to add that into an #if defined of course) static void ggml_vec_dot_q4_1_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
const int nb = n / QK8_0;
//assert(n % QK8_0 == 0);
//assert(nb % 2 == 0);
const block_q4_1 * restrict x = vx;
const block_q8_0 * restrict y = vy;
float sumf = 0.0;
__m256 sum = _mm256_setzero_ps();
for (int i = 0; i < nb; i++) {
const float d0 = x[i].d;
const float m0 = x[i].m;
const float d1 = y[i].d;
const uint8_t * restrict p0 = x[i].qs;
const int8_t * restrict p1 = y[i].qs;
for (int j = 0; j < QK8_0/16; j++) {
const __m128i v0 = _mm_loadu_si128((__m128i const*)(p0 + j*16));
const __m256 f0 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(v0 & _mm_set1_epi32(0x0f)));
const __m256 f1 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(_mm_srli_epi32(v0, 4)));
const __m256 f2 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + 2*j*8))));
const __m256 f3 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + (2*j+1)*8))));
const __m256 f23 = _mm256_permute2f128_ps(f2, f3, 0x21);
const __m256 p = _mm256_mul_ps(_mm256_add_ps(_mm256_mul_ps(f0, _mm256_set1_ps(d0)), _mm256_set1_ps(m0)), _mm256_mul_ps(_mm256_set1_ps(d1), f23));
sum = _mm256_add_ps(sum, p);
}
}
sum = _mm256_hadd_ps(sum, sum);
sum = _mm256_hadd_ps(sum, sum);
sum = _mm256_hadd_ps(sum, sum);
sumf = _mm_cvtss_f32(_mm256_extractf128_ps(sum, 0));
*s = sumf;
} Godbolt for that: https://godbolt.org/z/TjK51b7vd |
This is the same as #951 but for
Q4_1
Also, in this PR we will retire the old
ggml_vec_dot_q4_0()
andggml_vec_dot_q4_1()
as they are no longer used.Please send PRs with AVX implementations into this branch.
Will merge when we have:
Perplexity
Without BLAS 655 iter: 6.1299
With BLAS 655 iter: 6.1286