-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate x86_64 SSE to ppc64le VSX intrinsics #4807
Conversation
Jeremy Rand seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Codecov Report
@@ Coverage Diff @@
## master #4807 +/- ##
===========================================
- Coverage 94.90% 89.75% -5.15%
===========================================
Files 779 309 -470
Lines 223166 84266 -138900
===========================================
- Hits 211795 75637 -136158
+ Misses 11371 8629 -2742 |
The GCC build fail on CI is interesting. Maybe an artifact of an older GCC version than I tested with? Curious what you'd recommend I do to avoid this; I guess I could test whether that function is available as part of the cmake step, and only enable SSE to VSX translation if it is? Let me know if that's a good approach or if you prefer some other workaround. |
It looks like VSX translation of (Feel free to review the rest of this PR in parallel though.) |
The linux-aarch64 CI fails look unrelated to this PR if I'm not mistaken. |
Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain. |
|
binaryop test fixed in 9022b71 |
Using x86-compatible intrinsics to compile performance on other architectures is also a practice in webassembly. It is great to see similar exciting results in power architectures 👍 I observed that you added quite a few hacks in the cmakelists, especially the modification in I think a good way is to create a dedicated cmake toolchain file, such as This is also how emsdk implements x86 intrinsics for webassembly cmake build system will automatically enter the x86 part of ncnn, and use the x86 optimized code |
Good feedback, thanks! I was not aware that similar approaches were used with WebAssembly. I'll see if I can refactor accordingly; may take me some days. |
Yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html
_mm_packus_epi32 was added in GCC v12.1.
This reverts commit e7398ae.
This reverts commit 9b7ac8a.
Translating x86_64 SSE to ppc64le VSX intrinsics yields a quite large speedup on POWER9. See this article for background: https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some brief instruction about building ncnn on powerpc
docs/how-to-build/how-to-build.md and README.md HowTo section
Not sure why it was failing, will investigate later and try to fix and re-enable it.
Added some docs; let me know if anything looks wrong. |
Thanks for your contribution ! |
Yields a quite large speedup on POWER9. See this article for background.
Benchmarks (all done with
-DNCNN_ENABLE_LTO=ON
on a Talos II Workstation with 2x 18-core POWER9 CPU's):Before this PR:
With this PR applied:
(I think this definitely takes the cake for "most speedup per lines of code" of any patch I've written. :) )