Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLP vectorization causes 1.75x slowdown with march=skylake-avx512 due to vgatherdps #70259

Open
abadams opened this issue Oct 25, 2023 · 11 comments

Comments

@abadams
Copy link

abadams commented Oct 25, 2023

I think the SLP cost model might be wrong for vector gathers on skylake.

Consider the following code which repeatedly permutes an array:

void f(const float *__restrict__ src, const int *__restrict__ idx, float *__restrict__ dst, int n) {
    for (int i = 0; i < n; i += 8) {
        dst[i] = src[idx[i]];
        dst[i + 1] = src[idx[i + 1]];
        dst[i + 2] = src[idx[i + 2]];
        dst[i + 3] = src[idx[i + 3]];
        dst[i + 4] = src[idx[i + 4]];
        dst[i + 5] = src[idx[i + 5]];
        dst[i + 6] = src[idx[i + 6]];
        dst[i + 7] = src[idx[i + 7]];
    }
}

int main(int argc, char **argv) {
    const int n = 16 * 1024;
    float src[n], dst[n];
    int idx[n];
    for (int i = 0; i < n; i++) {
        // Some arbitrary permutation
        idx[i] = (i * 17 + 37) % n;
        src[i] = dst[i] = 0;
    }
    src[17] = 17.0f;

    for (int i = 0; i < 100000; i++) {
        f(&src[0], &idx[0], &dst[0], n);
        f(&dst[0], &idx[0], &src[0], n);
    }

    // Introduce a dependence on the output
    return dst[0] == 17.0f ? 1 : 0;
}

Compiled with top of tree clang with -march=skylake -O3 it takes about 4.2 seconds to run on my i9-9960X. Compiled with -march=skylake -O3 -fno-slp-vectorization it takes 2.4 seconds.

The only salient difference in the assembly is that slp vectorization has packed the eight stores in f into a gather intrinsic. Here's the inner loop assembly with slp vectorization on (with unrolling off for brevity):

.LBB1_4:                                # %for.body.i
                                        #   Parent Loop BB1_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	vmovups	32(%rsp,%rcx,4), %ymm0
	vpxor	%xmm1, %xmm1, %xmm1
	vpcmpeqd	%ymm2, %ymm2, %ymm2
	vgatherdps	%ymm2, (%r14,%ymm0,4), %ymm1
	vmovups	%ymm1, 65568(%rsp,%rcx,4)
	addq	$8, %rcx
	cmpq	$16376, %rcx                    # imm = 0x3FF8
	jb	.LBB1_4

and here it is with slp vectorization off:

.LBB1_4:                                # %for.body.i
                                        #   Parent Loop BB1_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movslq	131104(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 32(%rsp,%rcx,4)
	movslq	131108(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 36(%rsp,%rcx,4)
	movslq	131112(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 40(%rsp,%rcx,4)
	movslq	131116(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 44(%rsp,%rcx,4)
	movslq	131120(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 48(%rsp,%rcx,4)
	movslq	131124(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 52(%rsp,%rcx,4)
	movslq	131128(%rsp,%rcx,4), %rdx
	vmovss	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 56(%rsp,%rcx,4)
	movslq	131132(%rsp,%rcx,4), %rdx
	vmovd	65536(%rsp,%rdx,4), %xmm0       # xmm0 = mem[0],zero,zero,zero
	vmovd	%xmm0, 60(%rsp,%rcx,4)
	addq	$8, %rcx
	cmpq	$16376, %rcx                    # imm = 0x3FF8
	jb	.LBB1_4

Interestingly, llvm-mca has the right idea. It says the version with SLP vectorization on is 2310 cycles per 100 iterations, and the version with it off is 813 cycles per 100 iterations.

@RKSimon
Copy link
Collaborator

RKSimon commented Oct 26, 2023

You mention skylake-avx512 (xeon) in the title (and refer to a i9-9960X) but use skylake (desktop) in the command line args - which are you targeting? https://godbolt.org/z/7j8YWenfM

@abadams
Copy link
Author

abadams commented Oct 26, 2023

Oops, I should have used -march=skylake-avx512 in my clang invocations, but it doesn't seem to matter much. Doing that reduces the runtime of the version with SLP vectorization on from 4.2s to 4.1s. With SLP vectorization off it's still 2.4s.

For reference here's the skylake-avx512 version of the inner loop (SLP vectorization on, unrolling off for brevity):

.LBB1_4:                                # %for.body.i
                                        #   Parent Loop BB1_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	vmovups	32(%rsp,%rcx,4), %ymm0
	kxnorw	%k0, %k0, %k1
	vpxor	%xmm1, %xmm1, %xmm1
	vgatherdps	(%r14,%ymm0,4), %ymm1 {%k1}
	vmovups	%ymm1, 65568(%rsp,%rcx,4)
	addq	$8, %rcx
	cmpq	$16376, %rcx                    # imm = 0x3FF8
	jb	.LBB1_4

@abadams
Copy link
Author

abadams commented Oct 26, 2023

To update my llvm-mca comment: llvm-mca thinks the skylake-avx512 version is fast (481 cycles for 100 iterations).

@abadams
Copy link
Author

abadams commented Oct 30, 2023

Another datapoint: The problem does not occur on Zen 4 (march=znver4). The asm is basically the same as above, but the vgatherdps version produced with slp vectorization on is indeed slightly faster (1.8s vs 2.0s)

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 1, 2023

Another datapoint: The problem does not occur on Zen 4 (march=znver4). The asm is basically the same as above, but the vgatherdps version produced with slp vectorization on is indeed slightly faster (1.8s vs 2.0s)

Zen4 (and Zen3 for that matter) have much faster gather than Zen1/2 - we should probably enable the 'TuningFastGather' flag for them (currently this is only used on Intel Skylake and later).

Unfortunately I think we need to keep that flag on for Skylake as for build_vector style patterns gather is still the better way - but for this ticket it might be as simple as we don't have great cost numbers to compare the gather+vectorstore vs scalarload+scalarstore sequences.

@TiborGY
Copy link

TiborGY commented Nov 11, 2023

Assuming you are on Linux, can you please check the status of /sys/devices/system/cpu/vulnerabilities/gather_data_sampling?
A few months ago Intel has found a silicon bug in their gather implementations that has security implications, affecting every Intel CPU that has AVX support and is newer than Broadwell but older than Alder Lake.

For this reason, if you are running an up-to-date kernel, by default it will be loading a microcode update upon boot. This turns gather isns into slow-but-secure microcoded uops. This would probably explain the poor gather performance.

If you have root, you can benchmark this by temporarily setting a kernel boot parameter in the GRUB config gather_data_sampling=off and rebooting. This will tell the kernel to flip an MSR bit upon boot, which restores the original gather implementation in the CPU. If you need the speed, you can leave this mitigation off, but make sure that you understand the security implications of doing so.

If this is indeed the source of the slowdown, I am not sure what llvm could possibly do about it, other than split all affected targets into two, eg. skylake-avx512 and skylake-avx512-nogather, tigerlake and tigerlake-nogather, because using the gather will be either profitable or not, depending on whether or not the system has the GDS silicon vulnerability mitigated or not.

@nikic
Copy link
Contributor

nikic commented Nov 11, 2023

@TiborGY I believe the -mno-gather option was added for that purpose.

@abadams
Copy link
Author

abadams commented Nov 11, 2023

Yes, it looks like I have microcode mitigation for gather! -mno-gather fixes it without me having to turn off slp vectorization. The reason I asked the original question is for a downstream compiler, rather than trying to optimize a specific piece of code.

I guess I'll add +prefer-no-gather as a function attribute if I don't know for sure the target is not one of the affected processors. The slowdown from using it on affected processors is much larger than the speed-up from using it on unaffected processors.

I will leave this issue open in case you want to figure out how to handle this differently in LLVM, but feel free to close if this is a wont-fix.

@abadams
Copy link
Author

abadams commented Nov 11, 2023

Although fwiw, given that my command-line flags targeted skylake-avx512 specifically, I think llvm should assume by default that the microcode mitigation is in effect, and have an option to enable gather emission for people who like to live life on the edge.

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 13, 2023

@phoebewang Should we disable TuningFastGather on pre-AlderLake Intel targets (and maybe x86-64-v4)?

@phoebewang
Copy link
Contributor

cc @williamweixiao @gpei-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants