Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Closed
Systemcluster opened this issue May 3, 2024 · 10 comments · Fixed by #91277
Closed

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Systemcluster opened this issue May 3, 2024 · 10 comments · Fixed by #91277
Assignees

Comments

@Systemcluster
Copy link

The following code compiled with -O3 -march=znver4 (or any other znver) runs around 25% slower on Zen hardware than when compiled with -O3 -march=x86-64-v4 or the baseline x86-64.

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}
Full code
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i < 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

Running on a Ryzen 7950X:

> clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c && ./a.exe
24.225000 seconds, 78501

> clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c && ./a.exe
20.866000 seconds, 78501

> clang.exe -std=c11 -O3 ./src/perf.c && ./a.exe                  
20.819000 seconds, 78501
> clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

@RKSimon
Copy link
Collaborator

RKSimon commented May 3, 2024

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

@ganeshgit
Copy link
Contributor

ganeshgit commented May 3, 2024

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

@RKSimon It's a conscious decision to have some value for LoopMicroOpBufferSize. The value that we use is not really representing the actual buffer size that this parameter intends. I would prefer to remove the dependency on this parameter altogether rather than having incorrect values. Let me know your opinion.

@Systemcluster
Copy link
Author

The result is the same with znver1, I don't see LoopMicroOpBufferSize being set in in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ScheduleZnver1.td

> clang.exe -std=c11 -O3 -march=znver1 ./src/perf.c && ./a.exe
24.384000 seconds, 78501

The disassembly looks to be the same as well regardless which znver is targeted.

@llvmbot
Copy link
Member

llvmbot commented May 4, 2024

@llvm/issue-subscribers-backend-x86

Author: Chris (Systemcluster)

The following code compiled with `-O3 -march=znver4` (or any other `znver`) runs around 25% slower on Zen hardware than when compiled with `-O3 -march=x86-64-v4` or the baseline `x86-64`.
bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

<details>
<summary>Full code</summary>

#include &lt;stdbool.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;math.h&gt;
#include &lt;time.h&gt;

bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i &lt; 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

</details>

Running on a Ryzen 7950X:

&gt; clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c &amp;&amp; ./a.exe
24.225000 seconds, 78501

&gt; clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c &amp;&amp; ./a.exe
20.866000 seconds, 78501

&gt; clang.exe -std=c11 -O3 ./src/perf.c &amp;&amp; ./a.exe                  
20.819000 seconds, 78501
&gt; clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

@RKSimon RKSimon self-assigned this May 4, 2024
@nikic
Copy link
Contributor

nikic commented May 5, 2024

Related patch: #67657

@RKSimon
Copy link
Collaborator

RKSimon commented May 5, 2024

OK, got an idea on whats going on now. This is a combo of things - as well as the LoopMicroOpBufferSize issue making this a whole lot messier, zen cpus don't include the TuningSlowDivide64 flag (meaning there's no attempt to check if the i64 div args can be represented with i32) - the 25% regression on znver4 makes sense as the r32 vs r64 latency is 14 vs 19cy on znver3/4 according to uops.info.

I'll create PRs for this shortly.

RKSimon added a commit to RKSimon/llvm-project that referenced this issue May 5, 2024
I'm confident TuningSlowDivide64 should be set, but less so about TuningSlow3OpsLEA - I'm mainly assuming because most other Intel CPUs set it.

These appear to have been missed because later cpus don't inherit from Nehalem tuning much.

Noticed while cleaning up for llvm#90985
RKSimon added a commit to RKSimon/llvm-project that referenced this issue May 6, 2024
This appears to have been missed because later cpus don't inherit from Nehalem tuning much.

Noticed while cleaning up for llvm#90985
@Systemcluster
Copy link
Author

There's no noticeable runtime difference between optimization targets when using i32 instead of i64 in the check_prime example, it seems that indeed accounts for the majority of the regression there.

I found another example where optimizing for znver4 runs over 300% slower on Zen 4 than when optimizing for znver3, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7

@RKSimon
Copy link
Collaborator

RKSimon commented May 6, 2024

That second case might be due to excessive gather instructions on znver4 codegen

RKSimon added a commit that referenced this issue May 6, 2024
This appears to have been missed because later cpus don't inherit from Nehalem tuning much.

Noticed while cleaning up for #90985
RKSimon added a commit to RKSimon/llvm-project that referenced this issue May 6, 2024
…amilies

Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible.

All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - now matches most Intel cpu behaviour (and the x86-64/v2/3/4 levels).

Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use a old-fashing CHECK-DAG check for divl/divq pairs.

Fixes llvm#90985
RKSimon added a commit to RKSimon/llvm-project that referenced this issue May 7, 2024
…amilies

Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible.

All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - now matches most Intel cpu behaviour (and the x86-64/v2/3/4 levels).

Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use a old-fashing CHECK-DAG check for divl/divq pairs.

Fixes llvm#90985
@RKSimon
Copy link
Collaborator

RKSimon commented May 7, 2024

I found another example where optimizing for znver4 runs over 300% slower on Zen 4 than when optimizing for znver3, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7

@Systemcluster Please can you raise this as a separate issue?

@ganeshgit
Copy link
Contributor

@RKSimon I will pick that up after @Systemcluster reports that. Thanks

RKSimon added a commit that referenced this issue May 9, 2024
…amilies (#91277)

Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible.

All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - this patch now matches Intel cpu behaviour (and the x86-64/v2/3/4 levels).

Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use old-fashioned CHECK-DAG checks for divl/divq pairs.

Fixes #90985
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants