[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Systemcluster · 2024-05-03T17:15:01Z

The following code compiled with -O3 -march=znver4 (or any other znver) runs around 25% slower on Zen hardware than when compiled with -O3 -march=x86-64-v4 or the baseline x86-64.

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

Full code

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i < 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

Running on a Ryzen 7950X:

> clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c && ./a.exe
24.225000 seconds, 78501

> clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c && ./a.exe
20.866000 seconds, 78501

> clang.exe -std=c11 -O3 ./src/perf.c && ./a.exe                  
20.819000 seconds, 78501

> clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

The text was updated successfully, but these errors were encountered:

RKSimon · 2024-05-03T17:29:43Z

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

ganeshgit · 2024-05-03T18:05:42Z

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

@RKSimon It's a conscious decision to have some value for LoopMicroOpBufferSize. The value that we use is not really representing the actual buffer size that this parameter intends. I would prefer to remove the dependency on this parameter altogether rather than having incorrect values. Let me know your opinion.

Systemcluster · 2024-05-03T20:49:44Z

The result is the same with znver1, I don't see LoopMicroOpBufferSize being set in in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ScheduleZnver1.td

> clang.exe -std=c11 -O3 -march=znver1 ./src/perf.c && ./a.exe
24.384000 seconds, 78501

The disassembly looks to be the same as well regardless which znver is targeted.

llvmbot · 2024-05-04T13:23:07Z

@llvm/issue-subscribers-backend-x86

Author: Chris (Systemcluster)

The following code compiled with `-O3 -march=znver4` (or any other `znver`) runs around 25% slower on Zen hardware than when compiled with `-O3 -march=x86-64-v4` or the baseline `x86-64`.

bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

#include &lt;stdbool.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;math.h&gt;
#include &lt;time.h&gt;

bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i &lt; 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

</details>

Running on a Ryzen 7950X:

&gt; clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c &amp;&amp; ./a.exe
24.225000 seconds, 78501

&gt; clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c &amp;&amp; ./a.exe
20.866000 seconds, 78501

&gt; clang.exe -std=c11 -O3 ./src/perf.c &amp;&amp; ./a.exe                  
20.819000 seconds, 78501

&gt; clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

nikic · 2024-05-05T09:32:59Z

Related patch: #67657

RKSimon · 2024-05-05T11:24:54Z

OK, got an idea on whats going on now. This is a combo of things - as well as the LoopMicroOpBufferSize issue making this a whole lot messier, zen cpus don't include the TuningSlowDivide64 flag (meaning there's no attempt to check if the i64 div args can be represented with i32) - the 25% regression on znver4 makes sense as the r32 vs r64 latency is 14 vs 19cy on znver3/4 according to uops.info.

I'll create PRs for this shortly.

I'm confident TuningSlowDivide64 should be set, but less so about TuningSlow3OpsLEA - I'm mainly assuming because most other Intel CPUs set it. These appear to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for llvm#90985

This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for llvm#90985

Systemcluster · 2024-05-06T13:45:12Z

There's no noticeable runtime difference between optimization targets when using i32 instead of i64 in the check_prime example, it seems that indeed accounts for the majority of the regression there.

I found another example where optimizing for znver4 runs over 300% slower on Zen 4 than when optimizing for znver3, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7

RKSimon · 2024-05-06T17:24:18Z

That second case might be due to excessive gather instructions on znver4 codegen

This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for #90985

…amilies Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible. All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - now matches most Intel cpu behaviour (and the x86-64/v2/3/4 levels). Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use a old-fashing CHECK-DAG check for divl/divq pairs. Fixes llvm#90985

RKSimon · 2024-05-07T14:40:28Z

I found another example where optimizing for znver4 runs over 300% slower on Zen 4 than when optimizing for znver3, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7

@Systemcluster Please can you raise this as a separate issue?

ganeshgit · 2024-05-07T16:38:16Z

@RKSimon I will pick that up after @Systemcluster reports that. Thanks

…amilies (#91277) Despite most AMD cpus having a lower latency for i64 divisions that converge early, we are still better off testing for values representable as i32 and performing a i32 division if possible. All AMD cpus appear to have been missed when we added the "idivq-to-divl" attribute - this patch now matches Intel cpu behaviour (and the x86-64/v2/3/4 levels). Unfortunately the difference in code scheduling means I've had to stop using the update_llc_test_checks script and just use old-fashioned CHECK-DAG checks for divl/divq pairs. Fixes #90985

github-actions bot added the new issue label May 3, 2024

EugeneZelenko added backend:X86 and removed new issue labels May 3, 2024

RKSimon self-assigned this May 4, 2024

RKSimon mentioned this issue May 5, 2024

[X86] Add slow div64 tuning flag to Nehalem target #91129

Merged

RKSimon added a commit to RKSimon/llvm-project that referenced this issue May 6, 2024

[X86] Add slow div643 tuning flags to Nehalem target

fa14865

This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for llvm#90985

RKSimon added a commit that referenced this issue May 6, 2024

[X86] Add slow div64 tuning flag to Nehalem target (#91129)

d0be944

This appears to have been missed because later cpus don't inherit from Nehalem tuning much. Noticed while cleaning up for #90985

RKSimon mentioned this issue May 6, 2024

[X86] Enable TuningSlowDivide64 on Barcelona/Bobcat/Bulldozer/Ryzen Families #91277

Merged

Systemcluster mentioned this issue May 7, 2024

[X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4 or skylake #91370

Open

RKSimon closed this as completed in #91277 May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Systemcluster commented May 3, 2024

RKSimon commented May 3, 2024

ganeshgit commented May 3, 2024 •

edited

Loading

Systemcluster commented May 3, 2024

llvmbot commented May 4, 2024

nikic commented May 5, 2024

RKSimon commented May 5, 2024

Systemcluster commented May 6, 2024

RKSimon commented May 6, 2024

RKSimon commented May 7, 2024

ganeshgit commented May 7, 2024

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Comments

Systemcluster commented May 3, 2024

RKSimon commented May 3, 2024

ganeshgit commented May 3, 2024 • edited Loading

Systemcluster commented May 3, 2024

llvmbot commented May 4, 2024

nikic commented May 5, 2024

RKSimon commented May 5, 2024

Systemcluster commented May 6, 2024

RKSimon commented May 6, 2024

RKSimon commented May 7, 2024

ganeshgit commented May 7, 2024

ganeshgit commented May 3, 2024 •

edited

Loading