-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simd support for fp16/bf16 #723
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cqy123456 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cqy123456 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
3761010
to
9760f4e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #723 +/- ##
=========================================
+ Coverage 0 71.97% +71.97%
=========================================
Files 0 70 +70
Lines 0 5160 +5160
=========================================
+ Hits 0 3714 +3714
- Misses 0 1446 +1446 |
any test to see the performance improvement before and after this code change ? |
/lgtm |
cmake/utils/compile_flags.cmake
Outdated
@@ -18,7 +18,7 @@ endif() | |||
set(CMAKE_CXX_FLAGS "-Wall -fPIC ${CMAKE_CXX_FLAGS}") | |||
|
|||
if(__X86_64) | |||
set(CMAKE_CXX_FLAGS "-msse4.2 ${CMAKE_CXX_FLAGS}") | |||
set(CMAKE_CXX_FLAGS "-mf16c -msse4.2 ${CMAKE_CXX_FLAGS}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F16C
instruction set requires at least Intel Ivy Bridge CPU (more details at https://en.wikipedia.org/wiki/F16C). May it affect any clients? @liliu-z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool cpu_support_f16c() { InstructionSet& instruction_set_inst = InstructionSet::GetInstance(); return (instruction_set_inst.F16C()); }
hook will check whether has f16c.
src/simd/distances_avx.cc
Outdated
@@ -55,6 +55,72 @@ fvec_inner_product_avx(const float* x, const float* y, size_t d) { | |||
} | |||
FAISS_PRAGMA_IMPRECISE_FUNCTION_END | |||
|
|||
float | |||
fp16_vec_inner_product_avx(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) { | |||
__m256 m_res = _mm256_setzero_ps(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call ones msum_0
and msum_1
for the consistency of the naming style, here and in all other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/simd/distances_avx512.cc
Outdated
y += 16; | ||
d -= 16; | ||
} | ||
float sum = _mm512_reduce_add_ps(m512_res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use masking.
while (d >= 32) {...}
if (d >= 16) {...}
if (d > 0) {
const __mmask16 mask = (1U << d) - 1U;
auto mx = _mm512_cvtph_ps(_mm256_maskz_load_epi16(mask, x));
auto my = _mm512_cvtph_ps(_mm256_maskz_load_epi16(mask, y));
mx = _mm512_sub_ps(mx, my);
m512_res = _mm512_fmadd_ps(mx, mx, m512_res);
}
return _mm512_reduce_add_ps(m512_res);
Please add -mavx512vl
to AVX512
CMake settings. It is safe to do, it won't affect our list of accepted CPU generations.
The comment applies to all proposed functions for AVX512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/simd/distances_neon.cc
Outdated
float | ||
fp16_vec_inner_product_neon(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) { | ||
float32x4x4_t res = { | ||
{{0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{vdupq_n_f32(0.0f), vdupq_n_f32(0.0f), vdupq_n_f32(0.0f), vdupq_n_f32(0.0f)}
is shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/simd/distances_neon.cc
Outdated
@@ -81,6 +83,134 @@ fvec_inner_product_neon(const float* x, const float* y, size_t d) { | |||
return vaddvq_f32(sum_); | |||
} | |||
|
|||
float | |||
fp16_vec_inner_product_neon(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) { | |||
float32x4x4_t res = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have time, please check whether the following function is faster than yours. At least, I see that clang
produced a small and reliable code and gcc
produces some meaningful code as well.
FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
float
fp16_vec_inner_product_neon(const knowhere::fp16* x_in, const knowhere::fp16* y_in, size_t d) {
const __fp16* x = reinterpret_cast<const __fp16*>(x_in);
const __fp16* y = reinterpret_cast<const __fp16*>(y_in);
float sum = 0;
FAISS_PRAGMA_IMPRECISE_LOOP
for (size_t i = 0; i < d; i++) {
sum += (float)(x[i]) * (float)(y[i]);
}
return sum;
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END
If not, then the code is fine.
You may wish to try the same trust the compiler
approach for all other fp16
-based functions for NEON.
This won't work for bf16 unfortunately :( because of clang crashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i use random data(dim = 768, nb = 100000, nq = 100) to search hnsw index:
main branch takes : 64ms;
this branch takes : 13ms;
trust the compiler takes: 18ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, maybe gcc is not that smart enough yet :)
|
||
TEST_CASE("Test fp16 distance", "[fp16]") { | ||
using Catch::Approx; | ||
auto dim = GENERATE(as<size_t>{}, 1, 2, 10, 69, 128, 141, 510, 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd test 1,2,4,5,10,13,21,29
as well
Testing high dimensionalities, larger than 96 or 128 does not make sense to me, because the code contains dimensionality granularities and if-then-else / while constructions for like values 32, 16, 8, 4.
src/simd/distances_avx.h
Outdated
@@ -12,6 +12,8 @@ | |||
#ifndef DISTANCES_AVX_H | |||
#define DISTANCES_AVX_H | |||
|
|||
#include <knowhere/operands.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use #include ""
instead of #include <>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/simd/distances_neon.h
Outdated
@@ -12,6 +12,8 @@ | |||
#ifndef DISTANCES_NEON_H | |||
#define DISTANCES_NEON_H | |||
|
|||
#include <knowhere/operands.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -55,6 +55,72 @@ fvec_inner_product_avx(const float* x, const float* y, size_t d) { | |||
} | |||
FAISS_PRAGMA_IMPRECISE_FUNCTION_END | |||
|
|||
float | |||
fp16_vec_inner_product_avx(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems no need to add knowhere::
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simd dir still in faiss namespace
/lgtm |
Please also change |
Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
/lgtm |
issue: #287
some test result with vdbench:(768d-1m,hnsw)
fp32:
fp16:
bf16:
fp16 before opt: