Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVE] Sink Extends when they feed masked.gather/masked.scatter intrinsics #62960

Closed
mcinally opened this issue May 26, 2023 · 3 comments
Closed
Assignees
Labels
backend:AArch64 SVE ARM Scalable Vector Extensions

Comments

@mcinally
Copy link
Contributor

mcinally commented May 26, 2023

Masked gather/scatter intrinsics could generate better code if the index vector's type is sunk into the gather/scatter's block.

For example:

; llc test.llvm -O3 -mcpu=neoverse-v1 -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 -o test.s
target triple = "aarch64-unknown-linux-gnu"

define void @vector_(ptr %a, ptr %b, ptr %c, ptr %j) {
L.entry:
  %0 = load i32, ptr %j, align 4
  %1 = insertelement <8 x i32> poison, i32 %0, i64 0
  %2 = shufflevector <8 x i32> %1, <8 x i32> poison, <8 x i32> zeroinitializer
  %3 = mul <8 x i32> %2, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %4 = sext i32 %0 to i64
  %5 = shl nsw i64 %4, 5
  %6 = getelementptr i8, ptr %b, i64 -4
  %7 = shl nsw i64 %4, 2
  %8 = getelementptr i8, ptr %a, i64 2
  %9 = sext <8 x i32> %3 to <8 x i64>
  br label %L.LB1_509

L.LB1_509:                                        ; preds = %L.LB1_509, %L.entry
  %x = phi ptr [ null, %L.entry ], [ %21, %L.LB1_509 ]
  %y = phi i32 [ 9, %L.entry ], [ %22, %L.LB1_509 ]
  %z = phi i32 [ 0, %L.entry ], [ %20, %L.LB1_509 ]
  %10 = zext i32 %z to i64
  %11 = shl nuw nsw i64 %10, 2
  %12 = getelementptr i8, ptr %c, i64 %11
  %13 = load <8 x float>, ptr %12, align 4
  %14 = ptrtoint ptr %x to i64
  %15 = getelementptr i8, ptr %8, i64 %14
  %16 = getelementptr float, ptr %15, <8 x i64> %9
  %17 = tail call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> %16, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef)
  %18 = fadd fast <8 x float> %17, %13
  %19 = getelementptr i8, ptr %a, i64 %11
  store <8 x float> %18, ptr %19, align 1
  %20 = add nuw nsw i32 %z, 8
  %21 = getelementptr i8, ptr %x, i64 %5
  %22 = add nsw i32 %y, -8
  %23 = icmp ugt i32 %y, 8
  br i1 %23, label %L.LB1_509, label %L.LB1_515

L.LB1_515:                                        ; preds = %L.LB1_509
  ret void
}

declare <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr>, i32 immarg, <8 x i1>, <8 x float>)

This currently generates:

	mul	z1.s, p0/m, z1.s, z0.s
	sunpklo	z0.d, z1.s
	ext	z1.b, z1.b, z1.b, #16
	sunpklo	z1.d, z1.s
.LBB0_1:
	ld1w	{ z2.s }, p0/z, [x2]
	ld1w	{ z3.d }, p1/z, [x9, z0.d, lsl #2]
	ld1w	{ z4.d }, p1/z, [x9, z1.d, lsl #2]
	sub	w10, w10, #8
	add	x9, x9, x8
	add	x2, x2, #32
	cmp	w10, #8
	uzp1	z3.s, z3.s, z3.s
	uzp1	z4.s, z4.s, z4.s
	splice	z3.s, p2, z3.s, z4.s
	fadd	z2.s, z3.s, z2.s
	st1w	{ z2.s }, p0, [x0]
	add	x0, x0, #32
	b.hi	.LBB0_1

The pointer arithmetic extend creates a double wide vector for the index in the header, which later needs to be unzipped and spliced to recover the <8 x i32> vector in the loop.

But if we sink that extend into the loop, then ISel will pick up the index vector as <8 x i32> and do the right thing. E.g.:

	mul	z0.s, p0/m, z0.s, z1.s
.LBB0_1:
	ld1w	{ z1.s }, p0/z, [x2]
	ld1w	{ z2.s }, p0/z, [x9, z0.s, sxtw #2]
	add	x9, x9, x8
	sub	w10, w10, #8
	add	x2, x2, #32
	cmp	w10, #8
	fadd	z1.s, z2.s, z1.s
	st1w	{ z1.s }, p0, [x0]
	add	x0, x0, #32
	b.hi	.LBB0_1

I prototyped this with the following patch, but obviously it is a big hammer since it is sinking all extends and not just those that feed masked gather/scatter intrinsics. I'm not sure where this selective sinking should be done, so I'll turn it over to the experts at Arm for a real solution.

index dd431cc6f4f5..bfe1b9b60c1b 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -8058,9 +8058,12 @@ bool CodeGenPrepare::optimizeInst(Instruction *I, ModifyDT &ModifiedDT) {
     if (isa<ZExtInst>(I) || isa<SExtInst>(I)) {
       /// Sink a zext or sext into its user blocks if the target type doesn't
       /// fit in one register
-      if (TLI->getTypeAction(CI->getContext(),
-                             TLI->getValueType(*DL, CI->getType())) ==
-          TargetLowering::TypeExpandInteger) {
+      EVT VT = TLI->getValueType(*DL, CI->getType());
+      TargetLowering::LegalizeTypeAction Action =
+          TLI->getTypeAction(CI->getContext(), VT);
+
+      if (Action == TargetLowering::TypeExpandInteger ||
+          Action == TargetLowering::TypeSplitVector) {
         return SinkCast(CI);
       } else {
         if (TLI->optimizeExtendOrTruncateConversion(
@llvmbot
Copy link
Member

llvmbot commented May 26, 2023

@llvm/issue-subscribers-backend-aarch64

@davemgreen
Copy link
Collaborator

There is a AArch64TargetLowering::shouldSinkOperands method that could be taught to sink the address extends of a gather/scatter operation.

@EugeneZelenko EugeneZelenko added the SVE ARM Scalable Vector Extensions label May 30, 2023
@paulwalker-arm
Copy link
Collaborator

Candidate fix: #66996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 SVE ARM Scalable Vector Extensions
Projects
None yet
Development

No branches or pull requests

5 participants