-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985
Comments
unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses |
@RKSimon It's a conscious decision to have some value for LoopMicroOpBufferSize. The value that we use is not really representing the actual buffer size that this parameter intends. I would prefer to remove the dependency on this parameter altogether rather than having incorrect values. Let me know your opinion. |
The result is the same with > clang.exe -std=c11 -O3 -march=znver1 ./src/perf.c && ./a.exe
24.384000 seconds, 78501 The disassembly looks to be the same as well regardless which |
@llvm/issue-subscribers-backend-x86 Author: Chris (Systemcluster)
The following code compiled with `-O3 -march=znver4` (or any other `znver`) runs around 25% slower on Zen hardware than when compiled with `-O3 -march=x86-64-v4` or the baseline `x86-64`.
bool check_prime(int64_t n) {
if (n < 2) {
return true;
}
int64_t lim = (int64_t)ceil((double)n / 2.0);
for (int64_t i = 2; i < lim; i++) {
if (n % i == 0) {
return false;
}
}
return true;
} <details> #include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
bool check_prime(int64_t n) {
if (n < 2) {
return true;
}
int64_t lim = (int64_t)ceil((double)n / 2.0);
for (int64_t i = 2; i < lim; i++) {
if (n % i == 0) {
return false;
}
}
return true;
}
int main() {
clock_t now = clock();
int sum = 0;
for (int i = 0; i < 1000000; i++) {
if (check_prime(i)) {
sum += 1;
}
}
printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
return 0;
} </details> Running on a Ryzen 7950X: > clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c && ./a.exe
24.225000 seconds, 78501
> clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c && ./a.exe
20.866000 seconds, 78501
> clang.exe -std=c11 -O3 ./src/perf.c && ./a.exe
20.819000 seconds, 78501 > clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin Disassembly here: https://godbolt.org/z/orssnKP74 I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K |
Related patch: #67657 |
OK, got an idea on whats going on now. This is a combo of things - as well as the LoopMicroOpBufferSize issue making this a whole lot messier, zen cpus don't include the TuningSlowDivide64 flag (meaning there's no attempt to check if the i64 div args can be represented with i32) - the 25% regression on znver4 makes sense as the r32 vs r64 latency is 14 vs 19cy on znver3/4 according to uops.info. I'll create PRs for this shortly. |
I'm confident TuningSlowDivide64 should be set, but less so about TuningSlow3OpsLEA - I'm mainly assuming because most other Intel CPUs set it. These appear to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for llvm#90985
This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for llvm#90985
There's no noticeable runtime difference between optimization targets when using I found another example where optimizing for |
That second case might be due to excessive gather instructions on znver4 codegen |
This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for #90985
…amilies Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible. All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - now matches most Intel cpu behaviour (and the x86-64/v2/3/4 levels). Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use a old-fashing CHECK-DAG check for divl/divq pairs. Fixes llvm#90985
…amilies Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible. All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - now matches most Intel cpu behaviour (and the x86-64/v2/3/4 levels). Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use a old-fashing CHECK-DAG check for divl/divq pairs. Fixes llvm#90985
@Systemcluster Please can you raise this as a separate issue? |
@RKSimon I will pick that up after @Systemcluster reports that. Thanks |
…amilies (#91277) Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible. All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - this patch now matches Intel cpu behaviour (and the x86-64/v2/3/4 levels). Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use old-fashioned CHECK-DAG checks for divl/divq pairs. Fixes #90985
The following code compiled with
-O3 -march=znver4
(or any otherznver
) runs around 25% slower on Zen hardware than when compiled with-O3 -march=x86-64-v4
or the baselinex86-64
.Full code
Running on a Ryzen 7950X:
> clang.exe --version clang version 18.1.4 Target: x86_64-pc-windows-msvc Thread model: posix InstalledDir: C:\Program Files\LLVM\bin
Disassembly here: https://godbolt.org/z/orssnKP74
I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K
The text was updated successfully, but these errors were encountered: