[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

mcinally · 2023-05-26T21:57:41Z

Masked gather/scatter intrinsics could generate better code if the index vector's type is sunk into the gather/scatter's block.

For example:

; llc test.llvm -O3 -mcpu=neoverse-v1 -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 -o test.s
target triple = "aarch64-unknown-linux-gnu"

define void @vector_(ptr %a, ptr %b, ptr %c, ptr %j) {
L.entry:
  %0 = load i32, ptr %j, align 4
  %1 = insertelement <8 x i32> poison, i32 %0, i64 0
  %2 = shufflevector <8 x i32> %1, <8 x i32> poison, <8 x i32> zeroinitializer
  %3 = mul <8 x i32> %2, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %4 = sext i32 %0 to i64
  %5 = shl nsw i64 %4, 5
  %6 = getelementptr i8, ptr %b, i64 -4
  %7 = shl nsw i64 %4, 2
  %8 = getelementptr i8, ptr %a, i64 2
  %9 = sext <8 x i32> %3 to <8 x i64>
  br label %L.LB1_509

L.LB1_509:                                        ; preds = %L.LB1_509, %L.entry
  %x = phi ptr [ null, %L.entry ], [ %21, %L.LB1_509 ]
  %y = phi i32 [ 9, %L.entry ], [ %22, %L.LB1_509 ]
  %z = phi i32 [ 0, %L.entry ], [ %20, %L.LB1_509 ]
  %10 = zext i32 %z to i64
  %11 = shl nuw nsw i64 %10, 2
  %12 = getelementptr i8, ptr %c, i64 %11
  %13 = load <8 x float>, ptr %12, align 4
  %14 = ptrtoint ptr %x to i64
  %15 = getelementptr i8, ptr %8, i64 %14
  %16 = getelementptr float, ptr %15, <8 x i64> %9
  %17 = tail call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> %16, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef)
  %18 = fadd fast <8 x float> %17, %13
  %19 = getelementptr i8, ptr %a, i64 %11
  store <8 x float> %18, ptr %19, align 1
  %20 = add nuw nsw i32 %z, 8
  %21 = getelementptr i8, ptr %x, i64 %5
  %22 = add nsw i32 %y, -8
  %23 = icmp ugt i32 %y, 8
  br i1 %23, label %L.LB1_509, label %L.LB1_515

L.LB1_515:                                        ; preds = %L.LB1_509
  ret void
}

declare <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr>, i32 immarg, <8 x i1>, <8 x float>)

This currently generates:

	mul	z1.s, p0/m, z1.s, z0.s
	sunpklo	z0.d, z1.s
	ext	z1.b, z1.b, z1.b, #16
	sunpklo	z1.d, z1.s
.LBB0_1:
	ld1w	{ z2.s }, p0/z, [x2]
	ld1w	{ z3.d }, p1/z, [x9, z0.d, lsl #2]
	ld1w	{ z4.d }, p1/z, [x9, z1.d, lsl #2]
	sub	w10, w10, #8
	add	x9, x9, x8
	add	x2, x2, #32
	cmp	w10, #8
	uzp1	z3.s, z3.s, z3.s
	uzp1	z4.s, z4.s, z4.s
	splice	z3.s, p2, z3.s, z4.s
	fadd	z2.s, z3.s, z2.s
	st1w	{ z2.s }, p0, [x0]
	add	x0, x0, #32
	b.hi	.LBB0_1

The pointer arithmetic extend creates a double wide vector for the index in the header, which later needs to be unzipped and spliced to recover the <8 x i32> vector in the loop.

But if we sink that extend into the loop, then ISel will pick up the index vector as <8 x i32> and do the right thing. E.g.:

	mul	z0.s, p0/m, z0.s, z1.s
.LBB0_1:
	ld1w	{ z1.s }, p0/z, [x2]
	ld1w	{ z2.s }, p0/z, [x9, z0.s, sxtw #2]
	add	x9, x9, x8
	sub	w10, w10, #8
	add	x2, x2, #32
	cmp	w10, #8
	fadd	z1.s, z2.s, z1.s
	st1w	{ z1.s }, p0, [x0]
	add	x0, x0, #32
	b.hi	.LBB0_1

I prototyped this with the following patch, but obviously it is a big hammer since it is sinking all extends and not just those that feed masked gather/scatter intrinsics. I'm not sure where this selective sinking should be done, so I'll turn it over to the experts at Arm for a real solution.

index dd431cc6f4f5..bfe1b9b60c1b 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -8058,9 +8058,12 @@ bool CodeGenPrepare::optimizeInst(Instruction *I, ModifyDT &ModifiedDT) {
     if (isa<ZExtInst>(I) || isa<SExtInst>(I)) {
       /// Sink a zext or sext into its user blocks if the target type doesn't
       /// fit in one register
-      if (TLI->getTypeAction(CI->getContext(),
-                             TLI->getValueType(*DL, CI->getType())) ==
-          TargetLowering::TypeExpandInteger) {
+      EVT VT = TLI->getValueType(*DL, CI->getType());
+      TargetLowering::LegalizeTypeAction Action =
+          TLI->getTypeAction(CI->getContext(), VT);
+
+      if (Action == TargetLowering::TypeExpandInteger ||
+          Action == TargetLowering::TypeSplitVector) {
         return SinkCast(CI);
       } else {
         if (TLI->optimizeExtendOrTruncateConversion(

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-05-26T21:57:56Z

@llvm/issue-subscribers-backend-aarch64

davemgreen · 2023-05-30T06:37:43Z

There is a AArch64TargetLowering::shouldSinkOperands method that could be taught to sink the address extends of a gather/scatter operation.

paulwalker-arm · 2023-09-21T11:19:36Z

Candidate fix: #66996

mcinally added the backend:AArch64 label May 26, 2023

mcinally assigned paulwalker-arm May 26, 2023

EugeneZelenko added the SVE ARM Scalable Vector Extensions label May 30, 2023

paulwalker-arm closed this as completed Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

mcinally commented May 26, 2023 •

edited by VoltrexKeyva

Loading

llvmbot commented May 26, 2023

davemgreen commented May 30, 2023

paulwalker-arm commented Sep 21, 2023

[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

Comments

mcinally commented May 26, 2023 • edited by VoltrexKeyva Loading

llvmbot commented May 26, 2023

davemgreen commented May 30, 2023

paulwalker-arm commented Sep 21, 2023

mcinally commented May 26, 2023 •

edited by VoltrexKeyva

Loading