Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simd support for fp16/bf16 #723

Merged
merged 1 commit into from
Jul 25, 2024
Merged

Conversation

cqy123456
Copy link
Collaborator

@cqy123456 cqy123456 commented Jul 23, 2024

issue: #287
some test result with vdbench:(768d-1m,hnsw)
fp32:
image
fp16:
Pasted Graphic
bf16:
Pasted Graphic 2
fp16 before opt:
Pasted Graphic 1

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cqy123456

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

mergify bot commented Jul 23, 2024

@cqy123456 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Copy link

codecov bot commented Jul 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.97%. Comparing base (3c46f4c) to head (0f3d957).
Report is 128 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           main     #723       +/-   ##
=========================================
+ Coverage      0   71.97%   +71.97%     
=========================================
  Files         0       70       +70     
  Lines         0     5160     +5160     
=========================================
+ Hits          0     3714     +3714     
- Misses        0     1446     +1446     

see 70 files with indirect coverage changes

@cydrain
Copy link
Collaborator

cydrain commented Jul 23, 2024

any test to see the performance improvement before and after this code change ?

@foxspy
Copy link
Collaborator

foxspy commented Jul 23, 2024

/lgtm

@@ -18,7 +18,7 @@ endif()
set(CMAKE_CXX_FLAGS "-Wall -fPIC ${CMAKE_CXX_FLAGS}")

if(__X86_64)
set(CMAKE_CXX_FLAGS "-msse4.2 ${CMAKE_CXX_FLAGS}")
set(CMAKE_CXX_FLAGS "-mf16c -msse4.2 ${CMAKE_CXX_FLAGS}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F16C instruction set requires at least Intel Ivy Bridge CPU (more details at https://en.wikipedia.org/wiki/F16C). May it affect any clients? @liliu-z

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool cpu_support_f16c() { InstructionSet& instruction_set_inst = InstructionSet::GetInstance(); return (instruction_set_inst.F16C()); } hook will check whether has f16c.

@@ -55,6 +55,72 @@ fvec_inner_product_avx(const float* x, const float* y, size_t d) {
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END

float
fp16_vec_inner_product_avx(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) {
__m256 m_res = _mm256_setzero_ps();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call ones msum_0 and msum_1 for the consistency of the naming style, here and in all other functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

y += 16;
d -= 16;
}
float sum = _mm512_reduce_add_ps(m512_res);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use masking.

while (d >= 32) {...}

if (d >= 16) {...}

if (d > 0) {
    const __mmask16 mask = (1U << d) - 1U;
    auto mx = _mm512_cvtph_ps(_mm256_maskz_load_epi16(mask, x));
    auto my = _mm512_cvtph_ps(_mm256_maskz_load_epi16(mask, y));
    mx = _mm512_sub_ps(mx, my);
    m512_res = _mm512_fmadd_ps(mx, mx, m512_res);
}

return _mm512_reduce_add_ps(m512_res);

Please add -mavx512vl to AVX512 CMake settings. It is safe to do, it won't affect our list of accepted CPU generations.

The comment applies to all proposed functions for AVX512.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

float
fp16_vec_inner_product_neon(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) {
float32x4x4_t res = {
{{0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}, {0.0f, 0.0f, 0.0f, 0.0f}}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{vdupq_n_f32(0.0f), vdupq_n_f32(0.0f), vdupq_n_f32(0.0f), vdupq_n_f32(0.0f)} is shorter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -81,6 +83,134 @@ fvec_inner_product_neon(const float* x, const float* y, size_t d) {
return vaddvq_f32(sum_);
}

float
fp16_vec_inner_product_neon(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) {
float32x4x4_t res = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have time, please check whether the following function is faster than yours. At least, I see that clang produced a small and reliable code and gcc produces some meaningful code as well.

FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
float
fp16_vec_inner_product_neon(const knowhere::fp16* x_in, const knowhere::fp16* y_in, size_t d) {
    const __fp16* x = reinterpret_cast<const __fp16*>(x_in);
    const __fp16* y = reinterpret_cast<const __fp16*>(y_in);

    float sum = 0;
    FAISS_PRAGMA_IMPRECISE_LOOP
    for (size_t i = 0; i < d; i++) {
        sum += (float)(x[i]) * (float)(y[i]);
    }

    return sum;
} 
FAISS_PRAGMA_IMPRECISE_FUNCTION_END

If not, then the code is fine.
You may wish to try the same trust the compiler approach for all other fp16-based functions for NEON.

This won't work for bf16 unfortunately :( because of clang crashes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i use random data(dim = 768, nb = 100000, nq = 100) to search hnsw index:
main branch takes : 64ms;
this branch takes : 13ms;
trust the compiler takes: 18ms

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, maybe gcc is not that smart enough yet :)


TEST_CASE("Test fp16 distance", "[fp16]") {
using Catch::Approx;
auto dim = GENERATE(as<size_t>{}, 1, 2, 10, 69, 128, 141, 510, 1024);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd test 1,2,4,5,10,13,21,29 as well
Testing high dimensionalities, larger than 96 or 128 does not make sense to me, because the code contains dimensionality granularities and if-then-else / while constructions for like values 32, 16, 8, 4.

@@ -12,6 +12,8 @@
#ifndef DISTANCES_AVX_H
#define DISTANCES_AVX_H

#include <knowhere/operands.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use #include "" instead of #include <>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -12,6 +12,8 @@
#ifndef DISTANCES_NEON_H
#define DISTANCES_NEON_H

#include <knowhere/operands.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -55,6 +55,72 @@ fvec_inner_product_avx(const float* x, const float* y, size_t d) {
}
FAISS_PRAGMA_IMPRECISE_FUNCTION_END

float
fp16_vec_inner_product_avx(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems no need to add knowhere:: here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simd dir still in faiss namespace

@alexanderguzhva
Copy link
Collaborator

/lgtm

@alexanderguzhva
Copy link
Collaborator

Please also change _mm256_load_si256 into _mm256_loadu_si256

@mergify mergify bot added the ci-passed label Jul 25, 2024
Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
@alexanderguzhva
Copy link
Collaborator

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants