Translate x86_64 SSE to ppc64le VSX intrinsics #4807

JeremyRand · 2023-06-16T03:05:01Z

Yields a quite large speedup on POWER9. See this article for background.

Benchmarks (all done with -DNCNN_ENABLE_LTO=ON on a Talos II Workstation with 2x 18-core POWER9 CPU's):

Before this PR:

loop_count = 100
num_threads = 18
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =   61.09  max =  196.51  avg =   80.10
     squeezenet_int8  min =   43.01  max =  171.22  avg =   54.94
           mobilenet  min =  111.34  max =  280.33  avg =  123.60
      mobilenet_int8  min =   69.10  max =  172.20  avg =   78.12
        mobilenet_v2  min =   64.53  max =  228.19  avg =   79.75
        mobilenet_v3  min =   53.66  max =  196.57  avg =   65.48
          shufflenet  min =   30.34  max =  154.02  avg =   39.36
       shufflenet_v2  min =   31.82  max =  104.86  avg =   35.23
             mnasnet  min =   62.93  max =  159.05  avg =   70.84
     proxylessnasnet  min =   66.05  max =  173.80  avg =   76.22
     efficientnet_b0  min =   85.06  max =  260.51  avg =   96.11
   efficientnetv2_b0  min =  118.04  max =  337.97  avg =  138.86
        regnety_400m  min =   81.64  max =  280.86  avg =   94.12
           blazeface  min =    9.24  max =   51.09  avg =   11.35
           googlenet  min =  180.48  max =  411.27  avg =  209.57
      googlenet_int8  min =  134.68  max =  304.84  avg =  156.26
            resnet18  min =  159.92  max =  388.47  avg =  199.41
       resnet18_int8  min =  131.32  max =  329.29  avg =  175.57
             alexnet  min =   50.99  max =  147.26  avg =   63.51
               vgg16  min = 1567.90  max = 2049.77  avg = 1801.80
          vgg16_int8  min = 1139.54  max = 1904.75  avg = 1397.06
            resnet50  min =  555.13  max = 1108.02  avg =  644.54
       resnet50_int8  min =  373.25  max =  812.87  avg =  455.61
      squeezenet_ssd  min =  138.37  max =  390.43  avg =  228.91
 squeezenet_ssd_int8  min =  100.01  max =  266.46  avg =  150.76
       mobilenet_ssd  min =  226.49  max =  482.65  avg =  282.94
  mobilenet_ssd_int8  min =  142.00  max =  363.09  avg =  176.94
      mobilenet_yolo  min =  565.40  max =  923.64  avg =  623.31
  mobilenetv2_yolov3  min =  238.50  max =  578.40  avg =  331.47
         yolov4-tiny  min =  405.91  max =  664.19  avg =  478.14
           nanodet_m  min =   74.71  max =  242.68  avg =   85.00
    yolo-fastest-1.1  min =   39.97  max =  158.29  avg =   52.69
      yolo-fastestv2  min =   25.37  max =   67.04  avg =   31.72
  vision_transformer  min =  410.63  max =  630.04  avg =  510.37
          FastestDet  min =   29.12  max =  128.42  avg =   32.39

With this PR applied:

loop_count = 100
num_threads = 18
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =    6.09  max =   19.67  avg =    7.89
     squeezenet_int8  min =    6.26  max =    9.30  avg =    6.76
           mobilenet  min =   12.13  max =   30.03  avg =   13.74
      mobilenet_int8  min =    8.63  max =   21.62  avg =   10.80
        mobilenet_v2  min =    8.16  max =   95.63  avg =   11.02
        mobilenet_v3  min =    7.48  max =   11.15  avg =    7.68
          shufflenet  min =    8.26  max =   10.69  avg =    8.82
       shufflenet_v2  min =    6.55  max =    9.72  avg =    7.04
             mnasnet  min =    7.87  max =   68.94  avg =   10.80
     proxylessnasnet  min =    9.07  max =  113.80  avg =   11.84
     efficientnet_b0  min =   14.23  max =  106.71  avg =   19.11
   efficientnetv2_b0  min =   17.91  max =  123.81  avg =   20.61
        regnety_400m  min =   27.20  max =  134.10  avg =   33.29
           blazeface  min =    3.51  max =    5.31  avg =    3.80
           googlenet  min =   22.16  max =  121.97  avg =   25.90
      googlenet_int8  min =   20.46  max =   58.96  avg =   23.61
            resnet18  min =   17.63  max =   50.29  avg =   20.71
       resnet18_int8  min =   12.89  max =   36.11  avg =   13.93
             alexnet  min =   14.22  max =   39.14  avg =   16.91
               vgg16  min =  112.73  max =  221.44  avg =  160.76
          vgg16_int8  min =   44.35  max =  137.12  avg =   50.38
            resnet50  min =   47.34  max =  108.54  avg =   50.78
       resnet50_int8  min =   30.51  max =   44.48  avg =   31.25
      squeezenet_ssd  min =   19.22  max =  117.90  avg =   23.39
 squeezenet_ssd_int8  min =   18.22  max =   26.81  avg =   19.09
       mobilenet_ssd  min =   24.32  max =  136.99  avg =   29.31
  mobilenet_ssd_int8  min =   18.72  max =   53.61  avg =   21.54
      mobilenet_yolo  min =   78.38  max =  214.07  avg =   93.26
  mobilenetv2_yolov3  min =   29.38  max =  138.79  avg =   42.11
         yolov4-tiny  min =   45.67  max =  137.23  avg =   62.10
           nanodet_m  min =   14.41  max =   29.52  avg =   15.16
    yolo-fastest-1.1  min =   10.85  max =   13.64  avg =   11.00
      yolo-fastestv2  min =    9.55  max =   14.39  avg =   10.05
  vision_transformer  min =  396.60  max =  598.17  avg =  446.76
          FastestDet  min =    9.57  max =   14.16  avg =   10.11

(I think this definitely takes the cake for "most speedup per lines of code" of any patch I've written. :) )

tencent-adm · 2023-06-16T03:05:15Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Jeremy Rand seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2023-06-16T05:00:14Z

Codecov Report

Merging #4807 (ad5bf0e) into master (4b97730) will decrease coverage by 5.15%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #4807       +/-   ##
===========================================
- Coverage   94.90%   89.75%    -5.15%     
===========================================
  Files         779      309      -470     
  Lines      223166    84266   -138900     
===========================================
- Hits       211795    75637   -136158     
+ Misses      11371     8629     -2742

see 648 files with indirect coverage changes

JeremyRand · 2023-06-16T07:43:00Z

The GCC build fail on CI is interesting. Maybe an artifact of an older GCC version than I tested with? Curious what you'd recommend I do to avoid this; I guess I could test whether that function is available as part of the cmake step, and only enable SSE to VSX translation if it is? Let me know if that's a good approach or if you prefer some other workaround.

JeremyRand · 2023-06-17T04:19:19Z

It looks like VSX translation of _mm_packus_epi32 was added in GCC v12.1.0. The CI job uses Ubuntu 20.04, which packages GCC 10. So that explains the fail. _mm_packus_epi32 is part of SSE4.1, so I think I can just disable SSE4.1 if the compiler is too old, and leave the other optimizations there. I'll see if I can push a fix in the next few days.

(Feel free to review the rest of this PR in parallel though.)

JeremyRand · 2023-06-17T16:12:35Z

The linux-aarch64 CI fails look unrelated to this PR if I'm not mistaken.

JeremyRand · 2023-06-17T17:13:54Z

 14/110 Test  #15: test_binaryop_3 ..................***Failed   12.28 sec
value not match  at c:7 d:2 h:3 w:1    expect 3.141593 but got -3.141593
output blob 0 not match
test_layer_cpu failed
test_layer BinaryOp failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_binaryop failed a.dims=4 a=(2 7 3 31) b.dims=4 b=(2 7 3 31) op_type=11
CMake Error at /home/runner/work/ncnn/ncnn/cmake/run_test.cmake:4 (message):
  Test failed with return value '1'

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

JeremyRand · 2023-06-19T09:23:44Z

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

op_type=11 is OPERATION_RATAN. The atan function returns an angle; PI radians and -PI radians are the same thing. So this sounds like the VSX behavior is fine, and the tests should be modified to allow this. Thoughts?

nihui · 2023-06-19T12:01:07Z

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

op_type=11 is OPERATION_RATAN. The atan function returns an angle; PI radians and -PI radians are the same thing. So this sounds like the VSX behavior is fine, and the tests should be modified to allow this. Thoughts?

binaryop test fixed in 9022b71

nihui · 2023-06-21T07:25:46Z

Using x86-compatible intrinsics to compile performance on other architectures is also a practice in webassembly. It is great to see similar exciting results in power architectures 👍

I observed that you added quite a few hacks in the cmakelists, especially the modification in ncnn_add_layer

I think a good way is to create a dedicated cmake toolchain file, such as powerpc64le-linux-gnu-vsx.toolchain.cmake, declare CMAKE_SYSTEM_PROCESSOR as x86_64 in it, cheat ncnn's architecture judgment, and add the required global compilation parameters, such as - DNO_WARN_X86_INTRINSICS -mcpu=xxx -march=xxx etc.

This is also how emsdk implements x86 intrinsics for webassembly

cmake build system will automatically enter the x86 part of ncnn, and use the x86 optimized code

JeremyRand · 2023-06-23T06:29:36Z

Good feedback, thanks! I was not aware that similar approaches were used with WebAssembly. I'll see if I can refactor accordingly; may take me some days.

Yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html

_mm_packus_epi32 was added in GCC v12.1.

This reverts commit e7398ae.

This reverts commit 9b7ac8a.

Translating x86_64 SSE to ppc64le VSX intrinsics yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html

nihui

please add some brief instruction about building ncnn on powerpc

docs/how-to-build/how-to-build.md and README.md HowTo section

toolchains/power9le-linux-gnu-vsx.clang.toolchain.cmake

toolchains/power9le-linux-gnu-vsx.toolchain.cmake

CMakeLists.txt

src/CMakeLists.txt

.github/workflows/linux-ppc64-cpu-gcc.yml

Not sure why it was failing, will investigate later and try to fix and re-enable it.

JeremyRand · 2023-07-06T06:20:52Z

please add some brief instruction about building ncnn on powerpc

docs/how-to-build/how-to-build.md and README.md HowTo section

Added some docs; let me know if anything looks wrong.

nihui · 2023-07-06T08:02:11Z

Thanks for your contribution !

JeremyRand force-pushed the vsx branch from 1f96f60 to 7198a21 Compare June 20, 2023 01:00

Jeremy Rand added 4 commits July 3, 2023 01:21

Translate x86_64 SSE to ppc64le VSX intrinsics

9b7ac8a

Yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html

Only translate SSE4.1 to VSX if _mm_packus_epi32 available

e7398ae

_mm_packus_epi32 was added in GCC v12.1.

Revert "Only translate SSE4.1 to VSX if _mm_packus_epi32 available"

0af89b2

This reverts commit e7398ae.

Revert "Translate x86_64 SSE to ppc64le VSX intrinsics"

a7fa19c

This reverts commit 9b7ac8a.

JeremyRand force-pushed the vsx branch from 7198a21 to 880eb24 Compare July 3, 2023 06:50

Add POWER9 VSX toolchains

6adef41

Translating x86_64 SSE to ppc64le VSX intrinsics yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html

JeremyRand force-pushed the vsx branch from 880eb24 to 6adef41 Compare July 3, 2023 10:39

nihui reviewed Jul 5, 2023

View reviewed changes

Jeremy Rand added 5 commits July 6, 2023 05:00

power9le clang toolchain: Fix missing C++ include path

7335b6a

Add power9le docs

8c9feca

Rename NCNN_SSE4_1 to NCNN_SSE41

049b4cf

power9le toolchains: Remove redundant NCNN_TARGET_ARCH

659d71e

Remove linux-clang-power9le-vsx CI job

ad5bf0e

Not sure why it was failing, will investigate later and try to fix and re-enable it.

power9le clang toolchain: Document Clang 13+ requirement

46b7a2a

nihui approved these changes Jul 6, 2023

View reviewed changes

nihui merged commit 47e0daf into Tencent:master Jul 6, 2023

Porkepix mentioned this pull request Aug 16, 2023

ncnn 20230816 Homebrew/homebrew-core#139679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate x86_64 SSE to ppc64le VSX intrinsics #4807

Translate x86_64 SSE to ppc64le VSX intrinsics #4807

JeremyRand commented Jun 16, 2023

tencent-adm commented Jun 16, 2023

codecov-commenter commented Jun 16, 2023 •

edited

Loading

JeremyRand commented Jun 16, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 19, 2023

nihui commented Jun 19, 2023

nihui commented Jun 21, 2023

JeremyRand commented Jun 23, 2023

nihui left a comment

JeremyRand commented Jul 6, 2023

nihui commented Jul 6, 2023

Translate x86_64 SSE to ppc64le VSX intrinsics #4807

Translate x86_64 SSE to ppc64le VSX intrinsics #4807

Conversation

JeremyRand commented Jun 16, 2023

tencent-adm commented Jun 16, 2023

codecov-commenter commented Jun 16, 2023 • edited Loading

Codecov Report

JeremyRand commented Jun 16, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 17, 2023

JeremyRand commented Jun 19, 2023

nihui commented Jun 19, 2023

nihui commented Jun 21, 2023

JeremyRand commented Jun 23, 2023

nihui left a comment

Choose a reason for hiding this comment

JeremyRand commented Jul 6, 2023

nihui commented Jul 6, 2023

codecov-commenter commented Jun 16, 2023 •

edited

Loading