Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Add intrinsics for atomic struct buffer loads #100140

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

OutOfCache
Copy link
Contributor

Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where
the load is considered invariant.

Similar to #97707, but for struct buffer loads.

Similar to llvm#97707,
but for struct buffer loads.

Mark these intrinsics as atomic loads within LLVM to
prevent hoisting out of loops in cases where
the load is considered invariant.
@llvmbot
Copy link
Member

llvmbot commented Jul 23, 2024

@llvm/pr-subscribers-llvm-ir

Author: Jessica Del (OutOfCache)

Changes

Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where
the load is considered invariant.

Similar to #97707, but for struct buffer loads.


Patch is 36.82 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100140.diff

6 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+35)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+6-2)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+364)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+364)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index ab2620fdcf6b3..8c25467cc5e4b 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -1200,6 +1200,23 @@ class AMDGPUStructBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntri
 def int_amdgcn_struct_buffer_load_format : AMDGPUStructBufferLoad;
 def int_amdgcn_struct_buffer_load : AMDGPUStructBufferLoad;
 
+class AMDGPUStructAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+  [data_ty],
+  [llvm_v4i32_ty,    // rsrc(SGPR)
+   llvm_i32_ty,      // vindex(VGPR)
+   llvm_i32_ty,      // offset(VGPR/imm, included in bounds checking and swizzling)
+   llvm_i32_ty,      // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+   llvm_i32_ty],     // auxiliary/cachepolicy(imm):
+                     //                bit 0 = glc, bit 1 = slc, bit 2 = dlc (gfx10/gfx11),
+                     //                bit 3 = swz, bit 4 = scc (gfx90a)
+                     //        gfx940: bit 0 = sc0, bit 1 = nt, bit 3 = swz, bit 4 = sc1
+                     //        gfx12+: bits [0-2] = th, bits [3-4] = scope,
+                     //                bit 6 = swz
+                     //           all: volatile op (bit 31, stripped at lowering)
+  [ImmArg<ArgIndex<4>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+  AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_struct_atomic_buffer_load : AMDGPUStructAtomicBufferLoad;
+
 class AMDGPUStructPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
   [data_ty],
   [AMDGPUBufferRsrcTy,    // rsrc(SGPR)
@@ -1219,6 +1236,24 @@ class AMDGPUStructPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIn
 def int_amdgcn_struct_ptr_buffer_load_format : AMDGPUStructPtrBufferLoad;
 def int_amdgcn_struct_ptr_buffer_load : AMDGPUStructPtrBufferLoad;
 
+class AMDGPUStructPtrAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+  [data_ty],
+  [AMDGPUBufferRsrcTy,    // rsrc(SGPR)
+   llvm_i32_ty,           // vindex(VGPR)
+   llvm_i32_ty,           // offset(VGPR/imm, included in bounds checking and swizzling)
+   llvm_i32_ty,           // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+   llvm_i32_ty],          // auxiliary/cachepolicy(imm):
+                          //                bit 0 = glc, bit 1 = slc, bit 2 = dlc (gfx10/gfx11),
+                          //                bit 3 = swz, bit 4 = scc (gfx90a)
+                          //        gfx940: bit 0 = sc0, bit 1 = nt, bit 3 = swz, bit 4 = sc1
+                          //        gfx12+: bits [0-2] = th, bits [3-4] = scope,
+                          //                bit 6 = swz
+                          //           all: volatile op (bit 31, stripped at lowering)
+  [IntrArgMemOnly, NoCapture<ArgIndex<0>>,
+   ImmArg<ArgIndex<4>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+  AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_struct_ptr_atomic_buffer_load : AMDGPUStructPtrAtomicBufferLoad;
+
 class AMDGPURawBufferStore<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
   [],
   [data_ty,          // vdata(VGPR)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 74e93b0620d26..90cee9905feba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7370,6 +7370,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load:
   case Intrinsic::amdgcn_struct_buffer_load:
   case Intrinsic::amdgcn_struct_ptr_buffer_load:
+  case Intrinsic::amdgcn_struct_atomic_buffer_load:
+  case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load:
     return legalizeBufferLoad(MI, MRI, B, false, false);
   case Intrinsic::amdgcn_raw_buffer_load_format:
   case Intrinsic::amdgcn_raw_ptr_buffer_load_format:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index aa329a58547f3..4a3f327e4c591 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5020,7 +5020,9 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_struct_buffer_load:
     case Intrinsic::amdgcn_struct_ptr_buffer_load:
     case Intrinsic::amdgcn_struct_tbuffer_load:
-    case Intrinsic::amdgcn_struct_ptr_tbuffer_load: {
+    case Intrinsic::amdgcn_struct_ptr_tbuffer_load:
+    case Intrinsic::amdgcn_struct_atomic_buffer_load:
+    case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
       OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
       OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
       OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 7f95442401dbc..8a811f7a7c02d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1278,7 +1278,9 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
         return true;
       }
       case Intrinsic::amdgcn_raw_atomic_buffer_load:
-      case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load: {
+      case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load:
+      case Intrinsic::amdgcn_struct_atomic_buffer_load:
+      case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
         Info.memVT =
             memVTFromLoadIntrReturn(*this, MF.getDataLayout(), CI.getType(),
                                     std::numeric_limits<unsigned>::max());
@@ -8925,7 +8927,9 @@ SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
   case Intrinsic::amdgcn_struct_buffer_load:
   case Intrinsic::amdgcn_struct_ptr_buffer_load:
   case Intrinsic::amdgcn_struct_buffer_load_format:
-  case Intrinsic::amdgcn_struct_ptr_buffer_load_format: {
+  case Intrinsic::amdgcn_struct_ptr_buffer_load_format:
+  case Intrinsic::amdgcn_struct_atomic_buffer_load:
+  case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
     const bool IsFormat =
         IntrID == Intrinsic::amdgcn_struct_buffer_load_format ||
         IntrID == Intrinsic::amdgcn_struct_ptr_buffer_load_format;
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll
new file mode 100644
index 0000000000000..e300c4d2f5a15
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll
@@ -0,0 +1,364 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=0 | FileCheck %s -check-prefix=CHECK
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=1 | FileCheck %s -check-prefix=CHECK
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB0_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB0_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_const_idx(<4 x i32> %addr) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_const_idx:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    v_dual_mov_b32 v1, 15 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB1_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB1_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 15, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_off(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_off:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB2_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB2_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_soff(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_soff:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB3_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 4 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB3_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 4, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_dlc(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_dlc:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB4_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen offset:4 dlc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB4_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 4)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_nonatomic_buffer_load_i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_nonatomic_buffer_load_i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    buffer_load_b32 v1, v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT:  .LBB5_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_and_b32 s1, exec_lo, vcc_lo
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_or_b32 s0, s1, s0
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
+; CHECK-NEXT:    s_cbranch_execnz .LBB5_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i64(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i64:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_mov_b32_e32 v2, s4
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB6_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[3:4], v2, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc_lo, v[3:4], v[0:1]
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB6_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  %id.zext = zext i32 %id to i64
+  br label %bb1
+bb1:
+  %load = call i64 @llvm.amdgcn.struct.atomic.buffer.load.i64(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %cmp = icmp eq i64 %load, %id.zext
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v2i16(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v2i16:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB7_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB7_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <2 x i16> @llvm.amdgcn.struct.atomic.buffer.load.v2i16(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %bitcast = bitcast <2 x i16> %load to i32
+  %cmp = icmp eq i32 %bitcast, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v4i16(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v4i16:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB8_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[2:3], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_and_b32_e32 v2, 0xffff, v2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_lshl_or_b32 v2, v3, 16, v2
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB8_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <4 x i16> @llvm.amdgcn.struct.atomic.buffer.load.v4i16(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %shortened = shufflevector <4 x i16> %load, <4 x i16> poison, <2 x i32> <i32 0, i32 2>
+  %bitcast = bitcast <2 x i16> %shortened to i32
+  %cmp = icmp eq i32 %bitcast, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v4i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v4i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB9_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b128 v[2:5], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v5, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB9_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <4 x i32> @llvm.amdgcn.struct.atomic.buffer.load.v4i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %extracted = extractelement <4 x i32> %load, i32 3
+  %cmp = icmp eq i32 %extracted, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_ptr(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_ptr:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB10_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[2:3], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    flat_load_b32 v2, v[2:3]
+; CHECK-N...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jul 23, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jessica Del (OutOfCache)

Changes

Mark these intrinsics as atomic loads within LLVM to prevent hoisting out of loops in cases where
the load is considered invariant.

Similar to #97707, but for struct buffer loads.


Patch is 36.82 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100140.diff

6 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+35)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+6-2)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+364)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+364)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index ab2620fdcf6b3..8c25467cc5e4b 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -1200,6 +1200,23 @@ class AMDGPUStructBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntri
 def int_amdgcn_struct_buffer_load_format : AMDGPUStructBufferLoad;
 def int_amdgcn_struct_buffer_load : AMDGPUStructBufferLoad;
 
+class AMDGPUStructAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+  [data_ty],
+  [llvm_v4i32_ty,    // rsrc(SGPR)
+   llvm_i32_ty,      // vindex(VGPR)
+   llvm_i32_ty,      // offset(VGPR/imm, included in bounds checking and swizzling)
+   llvm_i32_ty,      // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+   llvm_i32_ty],     // auxiliary/cachepolicy(imm):
+                     //                bit 0 = glc, bit 1 = slc, bit 2 = dlc (gfx10/gfx11),
+                     //                bit 3 = swz, bit 4 = scc (gfx90a)
+                     //        gfx940: bit 0 = sc0, bit 1 = nt, bit 3 = swz, bit 4 = sc1
+                     //        gfx12+: bits [0-2] = th, bits [3-4] = scope,
+                     //                bit 6 = swz
+                     //           all: volatile op (bit 31, stripped at lowering)
+  [ImmArg<ArgIndex<4>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+  AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_struct_atomic_buffer_load : AMDGPUStructAtomicBufferLoad;
+
 class AMDGPUStructPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
   [data_ty],
   [AMDGPUBufferRsrcTy,    // rsrc(SGPR)
@@ -1219,6 +1236,24 @@ class AMDGPUStructPtrBufferLoad<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIn
 def int_amdgcn_struct_ptr_buffer_load_format : AMDGPUStructPtrBufferLoad;
 def int_amdgcn_struct_ptr_buffer_load : AMDGPUStructPtrBufferLoad;
 
+class AMDGPUStructPtrAtomicBufferLoad<LLVMType data_ty = llvm_any_ty> : Intrinsic <
+  [data_ty],
+  [AMDGPUBufferRsrcTy,    // rsrc(SGPR)
+   llvm_i32_ty,           // vindex(VGPR)
+   llvm_i32_ty,           // offset(VGPR/imm, included in bounds checking and swizzling)
+   llvm_i32_ty,           // soffset(SGPR/imm, excluded from bounds checking and swizzling)
+   llvm_i32_ty],          // auxiliary/cachepolicy(imm):
+                          //                bit 0 = glc, bit 1 = slc, bit 2 = dlc (gfx10/gfx11),
+                          //                bit 3 = swz, bit 4 = scc (gfx90a)
+                          //        gfx940: bit 0 = sc0, bit 1 = nt, bit 3 = swz, bit 4 = sc1
+                          //        gfx12+: bits [0-2] = th, bits [3-4] = scope,
+                          //                bit 6 = swz
+                          //           all: volatile op (bit 31, stripped at lowering)
+  [IntrArgMemOnly, NoCapture<ArgIndex<0>>,
+   ImmArg<ArgIndex<4>>, IntrWillReturn, IntrNoCallback, IntrNoFree], "", [SDNPMemOperand]>,
+  AMDGPURsrcIntrinsic<0>;
+def int_amdgcn_struct_ptr_atomic_buffer_load : AMDGPUStructPtrAtomicBufferLoad;
+
 class AMDGPURawBufferStore<LLVMType data_ty = llvm_any_ty> : DefaultAttrsIntrinsic <
   [],
   [data_ty,          // vdata(VGPR)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 74e93b0620d26..90cee9905feba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7370,6 +7370,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load:
   case Intrinsic::amdgcn_struct_buffer_load:
   case Intrinsic::amdgcn_struct_ptr_buffer_load:
+  case Intrinsic::amdgcn_struct_atomic_buffer_load:
+  case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load:
     return legalizeBufferLoad(MI, MRI, B, false, false);
   case Intrinsic::amdgcn_raw_buffer_load_format:
   case Intrinsic::amdgcn_raw_ptr_buffer_load_format:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index aa329a58547f3..4a3f327e4c591 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5020,7 +5020,9 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_struct_buffer_load:
     case Intrinsic::amdgcn_struct_ptr_buffer_load:
     case Intrinsic::amdgcn_struct_tbuffer_load:
-    case Intrinsic::amdgcn_struct_ptr_tbuffer_load: {
+    case Intrinsic::amdgcn_struct_ptr_tbuffer_load:
+    case Intrinsic::amdgcn_struct_atomic_buffer_load:
+    case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
       OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
       OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
       OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 7f95442401dbc..8a811f7a7c02d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1278,7 +1278,9 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
         return true;
       }
       case Intrinsic::amdgcn_raw_atomic_buffer_load:
-      case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load: {
+      case Intrinsic::amdgcn_raw_ptr_atomic_buffer_load:
+      case Intrinsic::amdgcn_struct_atomic_buffer_load:
+      case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
         Info.memVT =
             memVTFromLoadIntrReturn(*this, MF.getDataLayout(), CI.getType(),
                                     std::numeric_limits<unsigned>::max());
@@ -8925,7 +8927,9 @@ SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
   case Intrinsic::amdgcn_struct_buffer_load:
   case Intrinsic::amdgcn_struct_ptr_buffer_load:
   case Intrinsic::amdgcn_struct_buffer_load_format:
-  case Intrinsic::amdgcn_struct_ptr_buffer_load_format: {
+  case Intrinsic::amdgcn_struct_ptr_buffer_load_format:
+  case Intrinsic::amdgcn_struct_atomic_buffer_load:
+  case Intrinsic::amdgcn_struct_ptr_atomic_buffer_load: {
     const bool IsFormat =
         IntrID == Intrinsic::amdgcn_struct_buffer_load_format ||
         IntrID == Intrinsic::amdgcn_struct_ptr_buffer_load_format;
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll
new file mode 100644
index 0000000000000..e300c4d2f5a15
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll
@@ -0,0 +1,364 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=0 | FileCheck %s -check-prefix=CHECK
+; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=1 | FileCheck %s -check-prefix=CHECK
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB0_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB0_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_const_idx(<4 x i32> %addr) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_const_idx:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    v_dual_mov_b32 v1, 15 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB1_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB1_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 15, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_off(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_off:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB2_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB2_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_soff(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_soff:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB3_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 4 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB3_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 4, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+define amdgpu_kernel void @struct_atomic_buffer_load_i32_dlc(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i32_dlc:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB4_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen offset:4 dlc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB4_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.atomic.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 4)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_nonatomic_buffer_load_i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_nonatomic_buffer_load_i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    buffer_load_b32 v1, v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v1, v0
+; CHECK-NEXT:  .LBB5_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_and_b32 s1, exec_lo, vcc_lo
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-NEXT:    s_or_b32 s0, s1, s0
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
+; CHECK-NEXT:    s_cbranch_execnz .LBB5_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call i32 @llvm.amdgcn.struct.buffer.load.i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %cmp = icmp eq i32 %load, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_i64(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_i64:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_mov_b32_e32 v2, s4
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB6_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[3:4], v2, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc_lo, v[3:4], v[0:1]
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB6_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  %id.zext = zext i32 %id to i64
+  br label %bb1
+bb1:
+  %load = call i64 @llvm.amdgcn.struct.atomic.buffer.load.i64(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %cmp = icmp eq i64 %load, %id.zext
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v2i16(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v2i16:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB7_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b32 v2, v1, s[0:3], 0 idxen glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB7_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <2 x i16> @llvm.amdgcn.struct.atomic.buffer.load.v2i16(<4 x i32> %addr, i32 %index, i32 0, i32 0, i32 1)
+  %bitcast = bitcast <2 x i16> %load to i32
+  %cmp = icmp eq i32 %bitcast, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v4i16(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v4i16:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB8_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[2:3], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_and_b32_e32 v2, 0xffff, v2
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_lshl_or_b32 v2, v3, 16, v2
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v2, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB8_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <4 x i16> @llvm.amdgcn.struct.atomic.buffer.load.v4i16(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %shortened = shufflevector <4 x i16> %load, <4 x i16> poison, <2 x i32> <i32 0, i32 2>
+  %bitcast = bitcast <2 x i16> %shortened to i32
+  %cmp = icmp eq i32 %bitcast, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_v4i32(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_v4i32:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB9_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b128 v[2:5], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, v5, v0
+; CHECK-NEXT:    s_or_b32 s4, vcc_lo, s4
+; CHECK-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; CHECK-NEXT:    s_cbranch_execnz .LBB9_1
+; CHECK-NEXT:  ; %bb.2: ; %bb2
+; CHECK-NEXT:    s_endpgm
+bb:
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  br label %bb1
+bb1:
+  %load = call <4 x i32> @llvm.amdgcn.struct.atomic.buffer.load.v4i32(<4 x i32> %addr, i32 %index, i32 4, i32 0, i32 1)
+  %extracted = extractelement <4 x i32> %load, i32 3
+  %cmp = icmp eq i32 %extracted, %id
+  br i1 %cmp, label %bb1, label %bb2
+bb2:
+  ret void
+}
+
+define amdgpu_kernel void @struct_atomic_buffer_load_ptr(<4 x i32> %addr, i32 %index) {
+; CHECK-LABEL: struct_atomic_buffer_load_ptr:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b32 s4, s[2:3], 0x34
+; CHECK-NEXT:    s_load_b128 s[0:3], s[2:3], 0x24
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_dual_mov_b32 v1, s4 :: v_dual_and_b32 v0, 0x3ff, v0
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB10_1: ; %bb1
+; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    buffer_load_b64 v[2:3], v1, s[0:3], 0 idxen offset:4 glc
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    flat_load_b32 v2, v[2:3]
+; CHECK-N...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with nits

ret void
}

; Function Attrs: nounwind readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is misleading/dead?

Comment on lines 2 to 3
; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=0 | FileCheck %s -check-prefix=CHECK
; RUN: llc < %s -march=amdgcn -mcpu=gfx1100 -global-isel=1 | FileCheck %s -check-prefix=CHECK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-global-isel args should be first , < %s should be last

@OutOfCache OutOfCache merged commit 6a1b119 into llvm:main Jul 24, 2024
7 checks passed
@OutOfCache OutOfCache deleted the struct-atomic-loads branch July 24, 2024 09:05
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
Mark these intrinsics as atomic loads within LLVM to prevent hoisting
out of loops in cases where
the load is considered invariant.

Similar to #97707, but for
struct buffer loads.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250668
Harini0924 pushed a commit to Harini0924/llvm-project that referenced this pull request Aug 1, 2024
Mark these intrinsics as atomic loads within LLVM to prevent hoisting
out of loops in cases where
the load is considered invariant.

Similar to llvm#97707, but for
struct buffer loads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants