-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local laplacian performance regression #7917
Comments
I have found a Halide commit about a year ago (102c059) where it's fast with llvm 14 and slow with llvm 16. So it's an llvm change, but an old one. The salient difference seems to the use of vgatherdps instructions, which are causing major stalls. If I target avx2 instead of (avx-512) it's fast again, and uses vinsertps instead of vpgatherdps. This is on skylake-x. |
LLVM's SLP vectorizer is pattern matching the gather op. If I turn it off, the problem goes away. |
Bug opened on llvm: llvm/llvm-project#70259 In the meantime we may need to turn off SLP vectorization for x86. Vector gathers are not uncommon in Halide code, as they show up in boundary conditions and luts. |
It was ~25ms about a year ago on my machine, and now it's 40ms. Will investigate now, but opening an issue to track in case I need to context switch.
The text was updated successfully, but these errors were encountered: