-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider enabling AVX2 or SSE2 features at runtime #143
Comments
There is a runtime check to disable AVX and SSE, but unfortunately the way the C++ code is typically built on *nix systems this will not prevent AVX/SSE optimizations done by the compiler. That is the problem, at least with *nix. Windows is OK, because I've built ugrep.exe without AVX/SSE optimizations, but with specific code that uses AVX and SSE intrinsics to speed up searching, as intended. Unfortunately, it is a known problem that this approach will not work with GNU or clang C++ on *nix because the AVX/SSE intrinsics require AVX/SSE compiler options enabled. Those options are then globally applied, essentially to all code. We're not alone facing this issue and this was discussed before as an issue here #103, with as resolution to disable AVX to build ugrep with I will talk to an old friend of mine at Google who used to work at Intel on the Intel C++ compiler's auto-vectorization algorithms and see what he recommends. |
Thank you for the explanation! That is nifty that you already handle it at runtime as well. But a bummer that the build tooling is the wrench in the gears here. |
There are two possibilities, neither is ideal:
IMHO option 1 is preferable. I suppose the code duplication can limit the call-return overhead by hoisting the runtime CPU check up and out of loops (not a bad idea anyway). The structure of the code right now uses both compile-time and runtime CPU checking guards to execute SIMD-optimized parts with intrinsics:
The goal is to prevent the compiler from using SIMD instructions to optimize the unguarded fallback parts. The idea is to create four or five specialized versions of the entire source code file, because I don't want to create new functions to call to execute a guarded branch, which would be easier, but adds call-return overhead to this performance-critical part. Having multiple copies of essentially the same source code is not helping with maintainability. It is surprisingly dumb of Clang and GCC to require |
Checking with Compiler Explorer's assembly code output, clang already enables SSE2 instructions in plain x86 builds with
should produce a clean x86 executable that is binary portable to all (modern) x86 systems. The I had some time today to work on refactoring the code to effectively enable AVX/SSE2 runtime optimizations only when the CPU supports them. All tests pass with different AVX/SSE configs. But I don't have machines to test if the binary is indeed movable from an AVX machine to a non-AVX-capable machine (all my machines are AVX-capable). There is a cascade of small changes to the source code and the build scrips. But none should negatively impact the performance. |
It will help a lot if you could give this beta dev version a try, since I don't have non-AVX machines to test this against: https://www.genivia.com/files/ugrep-v3.3.5-beta.zip |
More than happy to! You may have to coach me through any specific tests or information you need for debugging. Summary: Here are my steps.
CPU flags from
Let me know what would be most useful for you: build logs, core dump, or running with different flags, etc. |
Also, please let me know if I did the right build steps in my Dockerfile. I just did You also mentioned performance. If there are any cases you are worried about let me know and I can run benchmarks. |
I built it with
It no longer shows
|
I think that the illegal instruction is probably caused by vpbroadcastb in the SSE2 code part, the reason is that in the meantime I had ported this isolated SIMD part of the code with a bit of effort to a small stand-alone test program to compile with Compiler Explorer to verify the assembly: https://godbolt.org/z/Med3sWKav Lo and behold: there is an AVX2 instruction vpbroadcastb in the SSE2 version of the code! Sure, it is compiled with -mavx2 but the code with SSE2 intrinsics is so simple I did not expect vpbroadcastb or any other AVX2 instruction to occur in the SSE2-specific code. Besides the SSE2 intrinsics, there are a few other calculations necessary. Those are optimized with AVX2 apparently? Alas, Compiler Explorer won't even tell me where the vpbroadcastb originates from in the source code. I was hoping that code duplication would not be necessary, so I just isolated the SIMD critical code in one new file (simd.cpp) instead of splitting this up in many files. The only way around this will be to create a simd.cpp code version for SSE2 compiled with -msse2, a code version for AVX2 compiled with -mavx2 and a code version for AVX512BW. Then select one of these three at runtime to execute. What a pain... |
To explain the |
Second attempt. This time I've split up This approach to split up the code should make it robust to x86 transfers of the x86 binary from an x86 AVX-capable machine to an x86 non-AVX2-capable machine that supports at least SSE2 (if the build steps detect SSE2 or greater e.g. AVX2, then SSE2 will always be enabled in the generated x86 with |
Perhaps you could give this update a spin? I don't have a non-AVX machine to test: https://www.genivia.com/files/ugrep-v3.3.5-beta2.zip |
I went through the same steps as before, building ugrep in an image on my modern system ( # modern
$ md5sum /usr/local/bin/ugrep
2e2c465ae10ce3bd3a2ab27e7506a6e3 /usr/local/bin/ugrep
$ /usr/local/bin/ugrep --version
ugrep 3.3.5 x86_64-pc-linux-gnu +avx2 +pcre2_jit +zlib
# legacy
$ md5sum /usr/local/bin/ugrep
2e2c465ae10ce3bd3a2ab27e7506a6e3 /usr/local/bin/ugrep
$ /usr/local/bin/ugrep --version
ugrep 3.3.5 x86_64-pc-linux-gnu +sse2 +pcre2_jit +zlib Are there any other commands or tests you'd like me to run? |
Thanks! This would be expected to work, since the code is now split up into separate compilation units when built with If an x86 ugrep was built on a non-SSE2/AVX2 legacy *nix machine, then |
The tricky part was to force
The Easy does it, once you have a plan of attack! |
The v3.3.5 update is available. |
AVX2 is now checked on runtime: Genivia/ugrep#143
@ethack the ugrep v3.6.0 release casts a wider net for AVX optimizations applied to the code. The code base uses multi-versioning like before, but now with the entire matcher engine being optimized rather than a small portion of it. This was done to restore the faster speed to what it was before 3.3.5 was released. The latest version should work just as well as the previous versions (after 3.3.5) with respect to binary portability to non-AVX systems with runtime AVX checking and running the proper optimized version of the code. But one cannot be cautious enough. So if you have any trouble, then please report your observations. |
This might be a long-shot, but figured I'd suggest it.
How I got here:
I have a docker image where I compile ugrep and then I end up running it on variety of different systems. The other day ugrep crashed on a particular system and gave me a core dump. After some troubleshooting I concluded it was because ugrep was compiled with AVX2 support enabled, but the CPU on the system I was running on didn't support it.
I know I could use the
--disable-avx
flag while building but I'm assuming that comes with a performance hit when the CPU does actually support it. Or I could build ugrep each time on my target system, but that doesn't really work for my use case.I often switch between ugrep and ripgrep for different reasons. When ugrep crashed on that no-AVX2 system I switched to ripgrep and it worked fine. While investigating more later I noticed that ripgrep does something interesting: it enables certain features dynamically when the CPU supports AVX2 or SSSE3.
I'm wondering if it would be possible for ugrep to do the same? I realize it's likely not a simple thing to implement, so feel free to close this if the cost to implement is too high.
I should also point out that my docker image (and thus ugrep's compilation) happens on a Github runner in the cloud where I don't really have control over the CPU instruction sets. So if I happen to get unlucky and compile ugrep on a legacy system without AVX2 support then it would be disabled even if I ran the ugrep binary on systems that did support AVX2. Runtime support would help in this scenario as well.
Relevant links:
The text was updated successfully, but these errors were encountered: