Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Support postRA vsetvl insertion pass #70549

Merged
merged 3 commits into from
May 21, 2024

Conversation

BeMg
Copy link
Contributor

@BeMg BeMg commented Oct 28, 2023

This patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by moving the vsetvl insertion pass after RVV register allocation

It will gain the benefit for the following optimization from

  1. unblock scheduler's constraints by removing vl/vtype def-use chain
  2. Support RVV re-materialization
  3. Support partial spill

This patch add a new option -riscv-vsetvl-after-rvv-regalloc=<1|0> to control this feature and default set as disable.

@github-actions
Copy link

github-actions bot commented Oct 28, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@llvmbot
Copy link
Member

llvmbot commented Oct 28, 2023

@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-backend-risc-v

Author: Piyou Chen (BeMg)

Changes

This patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by

  1. move the vsetvl insertion pass after RVV register allocation
  2. split register allocation into RVV and other

It will gain the benefit for the following optimization from

  1. unblock scheduler's constraints by removing vl/vtype def-use chain
  2. Support RVV re-materialization
  3. Support partial spill

This patch add a new option -riscv-split-RA=&lt;1|0&gt; to control this feature and default set as disable.

It can compile the the local testsuite and some relate big benchmark, but it still generate worse code in some case. (llvm regression test)


Note

Register allocation

  • Allocator be split by pass the filter function to control which VReg should be allocated.

VSETVL insertion pass

  • Because it no longer SSA form, there are some place need to update.
  • The LiveInterval need to maintain the correctness because this pass will insert/remove instruction.
  • LiveDebugVariables also should to perserve, otherwise it will trigger the assertion fail.

Patch is 31.97 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70549.diff

9 Files Affected:

  • (renamed) llvm/include/llvm/CodeGen/LiveDebugVariables.h (+1-1)
  • (modified) llvm/lib/CodeGen/LiveDebugVariables.cpp (+1-1)
  • (modified) llvm/lib/CodeGen/RegAllocBasic.cpp (+1-1)
  • (modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+1-1)
  • (modified) llvm/lib/CodeGen/VirtRegMap.cpp (+1-1)
  • (modified) llvm/lib/Target/RISCV/RISCV.h (+1-1)
  • (modified) llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp (+208-44)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+110-1)
  • (added) llvm/test/CodeGen/RISCV/rvv/splitRA-vsetvl.ll (+116)
diff --git a/llvm/lib/CodeGen/LiveDebugVariables.h b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
similarity index 96%
rename from llvm/lib/CodeGen/LiveDebugVariables.h
rename to llvm/include/llvm/CodeGen/LiveDebugVariables.h
index 9998ce9e8dad861..3643a2cd0981f19 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.h
+++ b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
@@ -29,7 +29,7 @@ template <typename T> class ArrayRef;
 class LiveIntervals;
 class VirtRegMap;
 
-class LLVM_LIBRARY_VISIBILITY LiveDebugVariables : public MachineFunctionPass {
+class LiveDebugVariables : public MachineFunctionPass {
   void *pImpl = nullptr;
 
 public:
diff --git a/llvm/lib/CodeGen/LiveDebugVariables.cpp b/llvm/lib/CodeGen/LiveDebugVariables.cpp
index 9603c1f01e08569..5f5a858c7bade6e 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.cpp
+++ b/llvm/lib/CodeGen/LiveDebugVariables.cpp
@@ -18,7 +18,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/IntervalMap.h"
diff --git a/llvm/lib/CodeGen/RegAllocBasic.cpp b/llvm/lib/CodeGen/RegAllocBasic.cpp
index 6661991396302ec..a4d4a218fea5584 100644
--- a/llvm/lib/CodeGen/RegAllocBasic.cpp
+++ b/llvm/lib/CodeGen/RegAllocBasic.cpp
@@ -12,7 +12,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "AllocationOrder.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "RegAllocBase.h"
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/CalcSpillWeights.h"
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 349d8b0975f3a10..4c25fcfeba89073 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -14,7 +14,7 @@
 #include "RegAllocGreedy.h"
 #include "AllocationOrder.h"
 #include "InterferenceCache.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "RegAllocBase.h"
 #include "RegAllocEvictionAdvisor.h"
 #include "RegAllocPriorityAdvisor.h"
diff --git a/llvm/lib/CodeGen/VirtRegMap.cpp b/llvm/lib/CodeGen/VirtRegMap.cpp
index 48f4ee29fbe95d4..a2b046a59c9d773 100644
--- a/llvm/lib/CodeGen/VirtRegMap.cpp
+++ b/llvm/lib/CodeGen/VirtRegMap.cpp
@@ -16,7 +16,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/CodeGen/VirtRegMap.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/LiveInterval.h"
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 3d8e33dc716ea44..9c9906ff0bc6430 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -60,7 +60,7 @@ void initializeRISCVPreRAExpandPseudoPass(PassRegistry &);
 FunctionPass *createRISCVExpandAtomicPseudoPass();
 void initializeRISCVExpandAtomicPseudoPass(PassRegistry &);
 
-FunctionPass *createRISCVInsertVSETVLIPass();
+FunctionPass *createRISCVInsertVSETVLIPass(bool IsSplitRA = false);
 void initializeRISCVInsertVSETVLIPass(PassRegistry &);
 
 FunctionPass *createRISCVPostRAExpandPseudoPass();
diff --git a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
index 9d2b2c3b3f59264..03c8d72a32dc714 100644
--- a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
@@ -26,6 +26,7 @@
 
 #include "RISCV.h"
 #include "RISCVSubtarget.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include <queue>
@@ -44,6 +45,86 @@ static cl::opt<bool> UseStrictAsserts(
 
 namespace {
 
+// Return true when CandidateDefMI can arrive UserMI without reach other defMI
+// thought CFG. Otherwise, if it:
+//    1. Encounter the DefMI again
+//    2. Seen the all BasicBlock can reach
+// return false
+static bool isReachingDef(const MachineInstr &UserMI,
+                          const MachineInstr &CandidateDefMI, Register Reg,
+                          const MachineRegisterInfo *MRI) {
+  std::queue<std::pair<const MachineBasicBlock *, bool>> MBBQ;
+  llvm::DenseMap<int, bool> SeenMBB;
+
+  MBBQ.push(std::make_pair(CandidateDefMI.getParent(), false));
+  while (!MBBQ.empty()) {
+
+    const MachineBasicBlock *CurrMBB = MBBQ.front().first;
+    bool SeenDef = MBBQ.front().second;
+    MBBQ.pop();
+    SeenMBB[CurrMBB->getNumber()] = true;
+    bool NeedSkip = false;
+    for (auto &MI : *CurrMBB) {
+      // If we encounter DefMI more than once, this CandidateDefMI is not
+      // reaching definition for UserMI.
+      if (SeenDef && llvm::any_of(MRI->def_instructions(Reg),
+                                  [&MI](const MachineInstr &DefMI) {
+                                    return DefMI.isIdenticalTo(MI);
+                                  })) {
+        NeedSkip = true;
+        break;
+      }
+
+      if (MI.isIdenticalTo(CandidateDefMI))
+        SeenDef = true;
+
+      if (SeenDef && MI.isIdenticalTo(UserMI))
+        return true;
+    }
+
+    if (NeedSkip)
+      continue;
+
+    for (auto *Succ : CurrMBB->successors()) {
+      if (SeenMBB[Succ->getNumber()])
+        continue;
+      MBBQ.push(std::make_pair(Succ, SeenDef));
+    }
+  }
+
+  return false;
+}
+
+// Return MachineInstr* when we found the unique reaching definition.
+// Otherwise, return nullptr.
+// FIXME: Could it detect the PHI node situation?
+static MachineInstr *findReachingDef(Register Reg, const MachineInstr *MI,
+                                     const MachineRegisterInfo *MRI) {
+
+  MachineInstr *ReachingDefMI = nullptr;
+  for (auto &DefMI : MRI->def_instructions(Reg)) {
+    if (isReachingDef(*MI, DefMI, Reg, MRI)) {
+      if (!ReachingDefMI)
+        ReachingDefMI = &DefMI;
+      else
+        return nullptr;
+    }
+  }
+
+  return ReachingDefMI;
+}
+
+// For the SSA form, we could just use the getVRegDef to take Reaching
+// definition. For the non-SSA, we create this pass own reaching definition
+// searching function.
+static MachineInstr *getReachingDefMI(Register VReg, const MachineInstr *MI,
+                                      const MachineRegisterInfo *MRI) {
+  if (MRI->isSSA())
+    return MRI->getVRegDef(VReg);
+
+  return findReachingDef(VReg, MI, MRI);
+}
+
 static unsigned getVLOpNum(const MachineInstr &MI) {
   return RISCVII::getVLOpNum(MI.getDesc());
 }
@@ -187,13 +268,17 @@ static bool hasUndefinedMergeOp(const MachineInstr &MI,
   if (UseMO.getReg() == RISCV::NoRegister)
     return true;
 
-  if (MachineInstr *UseMI = MRI.getVRegDef(UseMO.getReg())) {
+  if (!MRI.isSSA() && UseMO.isUndef())
+    return true;
+
+  if (MachineInstr *UseMI = getReachingDefMI(UseMO.getReg(), &MI, &MRI)) {
     if (UseMI->isImplicitDef())
       return true;
 
     if (UseMI->isRegSequence()) {
       for (unsigned i = 1, e = UseMI->getNumOperands(); i < e; i += 2) {
-        MachineInstr *SourceMI = MRI.getVRegDef(UseMI->getOperand(i).getReg());
+        MachineInstr *SourceMI =
+            getReachingDefMI(UseMI->getOperand(i).getReg(), UseMI, &MRI);
         if (!SourceMI || !SourceMI->isImplicitDef())
           return false;
       }
@@ -494,9 +579,9 @@ class VSETVLIInfo {
     if (hasAVLReg()) {
       if (getAVLReg() == RISCV::X0)
         return true;
-      if (MachineInstr *MI = MRI.getVRegDef(getAVLReg());
-          MI && MI->getOpcode() == RISCV::ADDI &&
-          MI->getOperand(1).isReg() && MI->getOperand(2).isImm() &&
+      if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg());
+          MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() &&
+          MI->getOperand(2).isImm() &&
           MI->getOperand(1).getReg() == RISCV::X0 &&
           MI->getOperand(2).getImm() != 0)
         return true;
@@ -727,20 +812,32 @@ class RISCVInsertVSETVLI : public MachineFunctionPass {
   const RISCVSubtarget *ST;
   const TargetInstrInfo *TII;
   MachineRegisterInfo *MRI;
+  LiveIntervals *LIS = nullptr;
 
   std::vector<BlockData> BlockInfo;
   std::queue<const MachineBasicBlock *> WorkList;
+  bool IsSplitRA = false;
 
 public:
   static char ID;
 
-  RISCVInsertVSETVLI() : MachineFunctionPass(ID) {
+  RISCVInsertVSETVLI(bool IsSplitRA = false)
+      : MachineFunctionPass(ID), IsSplitRA(IsSplitRA) {
     initializeRISCVInsertVSETVLIPass(*PassRegistry::getPassRegistry());
   }
   bool runOnMachineFunction(MachineFunction &MF) override;
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
+
+    if (IsSplitRA) {
+      AU.addRequired<LiveIntervals>();
+      AU.addPreserved<LiveIntervals>();
+      AU.addRequired<SlotIndexes>();
+      AU.addPreserved<SlotIndexes>();
+      AU.addPreserved<LiveDebugVariables>();
+    }
+
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -864,6 +961,30 @@ static VSETVLIInfo getInfoForVSETVLI(const MachineInstr &MI) {
   return NewInfo;
 }
 
+static void fixupModifyVRegLI(Register VReg, LiveIntervals *LIS) {
+  if (!LIS)
+    return;
+
+  if (LIS->hasInterval(VReg))
+    LIS->removeInterval(VReg);
+  LIS->createAndComputeVirtRegInterval(VReg);
+}
+
+static void fixupModifyVRegLIFromVSETVL(MachineInstr *MI, LiveIntervals *LIS) {
+
+  if (!LIS)
+    return;
+
+  if (LIS->isNotInMIMap(*MI))
+    LIS->InsertMachineInstrInMaps(*MI);
+  for (auto &MO : MI->operands()) {
+    if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+      continue;
+    Register Reg = MO.getReg();
+    fixupModifyVRegLI(Reg, LIS);
+  }
+}
+
 void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
                      MachineBasicBlock::iterator InsertPt, DebugLoc DL,
                      const VSETVLIInfo &Info, const VSETVLIInfo &PrevInfo) {
@@ -872,11 +993,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // Use X0, X0 form if the AVL is the same and the SEW+LMUL gives the same
     // VLMAX.
     if (Info.hasSameAVL(PrevInfo) && Info.hasSameVLMAX(PrevInfo)) {
-      BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-          .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-          .addReg(RISCV::X0, RegState::Kill)
-          .addImm(Info.encodeVTYPE())
-          .addReg(RISCV::VL, RegState::Implicit);
+      auto NeedFixupMI =
+          BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+              .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+              .addReg(RISCV::X0, RegState::Kill)
+              .addImm(Info.encodeVTYPE())
+              .addReg(RISCV::VL, RegState::Implicit);
+      fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
       return;
     }
 
@@ -885,15 +1008,18 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // same, we can use the X0, X0 form.
     if (Info.hasSameVLMAX(PrevInfo) && Info.hasAVLReg() &&
         Info.getAVLReg().isVirtual()) {
-      if (MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg())) {
+      if (MachineInstr *DefMI =
+              getReachingDefMI(Info.getAVLReg(), &(*InsertPt), MRI)) {
         if (isVectorConfigInstr(*DefMI)) {
           VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
           if (DefInfo.hasSameAVL(PrevInfo) && DefInfo.hasSameVLMAX(PrevInfo)) {
-            BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-                .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-                .addReg(RISCV::X0, RegState::Kill)
-                .addImm(Info.encodeVTYPE())
-                .addReg(RISCV::VL, RegState::Implicit);
+            auto NeedFixupMI =
+                BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+                    .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+                    .addReg(RISCV::X0, RegState::Kill)
+                    .addImm(Info.encodeVTYPE())
+                    .addReg(RISCV::VL, RegState::Implicit);
+            fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
             return;
           }
         }
@@ -902,10 +1028,12 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
   }
 
   if (Info.hasAVLImm()) {
-    BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
-        .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-        .addImm(Info.getAVLImm())
-        .addImm(Info.encodeVTYPE());
+    auto NeedFixupMI =
+        BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+            .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+            .addImm(Info.getAVLImm())
+            .addImm(Info.encodeVTYPE());
+    fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
     return;
   }
 
@@ -915,18 +1043,22 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // the previous vl to become invalid.
     if (PrevInfo.isValid() && !PrevInfo.isUnknown() &&
         Info.hasSameVLMAX(PrevInfo)) {
-      BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-          .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-          .addReg(RISCV::X0, RegState::Kill)
-          .addImm(Info.encodeVTYPE())
-          .addReg(RISCV::VL, RegState::Implicit);
+      auto NeedFixupMI =
+          BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+              .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+              .addReg(RISCV::X0, RegState::Kill)
+              .addImm(Info.encodeVTYPE())
+              .addReg(RISCV::VL, RegState::Implicit);
+      fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
       return;
     }
     // Otherwise use an AVL of 1 to avoid depending on previous vl.
-    BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
-        .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-        .addImm(1)
-        .addImm(Info.encodeVTYPE());
+    auto NeedFixupMI =
+        BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+            .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+            .addImm(1)
+            .addImm(Info.encodeVTYPE());
+    fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
     return;
   }
 
@@ -942,10 +1074,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     DestReg = MRI->createVirtualRegister(&RISCV::GPRRegClass);
     Opcode = RISCV::PseudoVSETVLIX0;
   }
-  BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
-      .addReg(DestReg, RegState::Define | RegState::Dead)
-      .addReg(AVLReg)
-      .addImm(Info.encodeVTYPE());
+  auto NeedFixupMI =
+      BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
+          .addReg(DestReg, RegState::Define | RegState::Dead)
+          .addReg(AVLReg,
+                  (IsSplitRA && MRI->def_empty(AVLReg)) ? RegState::Undef : 0)
+          .addImm(Info.encodeVTYPE());
+  fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
 }
 
 static bool isLMUL1OrSmaller(RISCVII::VLMUL LMUL) {
@@ -1007,7 +1142,7 @@ bool RISCVInsertVSETVLI::needVSETVLI(const MachineInstr &MI,
   // VSETVLI here.
   if (Require.hasAVLReg() && Require.getAVLReg().isVirtual() &&
       CurInfo.hasCompatibleVTYPE(Used, Require)) {
-    if (MachineInstr *DefMI = MRI->getVRegDef(Require.getAVLReg())) {
+    if (MachineInstr *DefMI = getReachingDefMI(Require.getAVLReg(), &MI, MRI)) {
       if (isVectorConfigInstr(*DefMI)) {
         VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
         if (DefInfo.hasSameAVL(CurInfo) && DefInfo.hasSameVLMAX(CurInfo))
@@ -1062,7 +1197,7 @@ void RISCVInsertVSETVLI::transferBefore(VSETVLIInfo &Info,
   // without being sure we can kill the original source reg entirely.
   if (!Info.hasAVLReg() || !Info.getAVLReg().isVirtual())
     return;
-  MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg());
+  const MachineInstr *DefMI = getReachingDefMI(Info.getAVLReg(), &MI, MRI);
   if (!DefMI || !isVectorConfigInstr(*DefMI))
     return;
 
@@ -1187,7 +1322,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
     return true;
 
   // We need the AVL to be produce by a PHI node in this basic block.
-  MachineInstr *PHI = MRI->getVRegDef(AVLReg);
+  MachineInstr *PHI = MRI->getUniqueVRegDef(AVLReg);
   if (!PHI || PHI->getOpcode() != RISCV::PHI || PHI->getParent() != &MBB)
     return true;
 
@@ -1202,7 +1337,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
       return true;
 
     // We need the PHI input to the be the output of a VSET(I)VLI.
-    MachineInstr *DefMI = MRI->getVRegDef(InReg);
+    MachineInstr *DefMI = MRI->getUniqueVRegDef(InReg);
     if (!DefMI || !isVectorConfigInstr(*DefMI))
       return true;
 
@@ -1258,8 +1393,12 @@ void RISCVInsertVSETVLI::emitVSETVLIs(MachineBasicBlock &MBB) {
         MachineOperand &VLOp = MI.getOperand(getVLOpNum(MI));
         if (VLOp.isReg()) {
           // Erase the AVL operand from the instruction.
+          Register VLOpReg = VLOp.getReg();
+          bool IsVirtVLOpReg = VLOp.getReg().isVirtual();
           VLOp.setReg(RISCV::NoRegister);
           VLOp.setIsKill(false);
+          if (IsVirtVLOpReg)
+            fixupModifyVRegLI(VLOpReg, LIS);
         }
         MI.addOperand(MachineOperand::CreateReg(RISCV::VL, /*isDef*/ false,
                                                 /*isImp*/ true));
@@ -1509,8 +1648,23 @@ void RISCVInsertVSETVLI::doLocalPostpass(MachineBasicBlock &MBB) {
     Used = getDemanded(MI, MRI, ST);
   }
 
-  for (auto *MI : ToDelete)
+  std::vector<Register> NeedFixup;
+
+  for (auto *MI : ToDelete) {
+    for (auto &MO : MI->operands()) {
+      if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+        continue;
+      Register Reg = MO.getReg();
+      NeedFixup.push_back(Reg);
+    }
     MI->eraseFromParent();
+    if (IsSplitRA)
+      LIS->RemoveMachineInstrFromMaps(*MI);
+  }
+
+  for (auto Reg : NeedFixup) {
+    fixupModifyVRegLI(Reg, LIS);
+  }
 }
 
 void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
@@ -1518,11 +1672,16 @@ void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
     MachineInstr &MI = *I++;
     if (RISCV::isFaultFirstLoad(MI)) {
       Register VLOutput = MI.getOperand(1).getReg();
-      if (!MRI->use_nodbg_empty(VLOutput))
-        BuildMI(MBB, I, MI.getDebugLoc(), TII->get(RISCV::PseudoReadVL),
-                VLOutput);
+      bool IsVirtual = MI.getOperand(1).getReg().isVirtual();
+      if (!MRI->use_nodbg_empty(VLOutput)) {
+        auto NeedFixupMI = BuildMI(MBB, I, MI.getDebugLoc(),
+                                   TII->get(RISCV::PseudoReadVL), VLOutput);
+        fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
+      }
       // We don't use the vl output of the VLEFF/VLSEGFF anymore.
       MI.getOperand(1).setReg(RISCV::X0);
+      if (IsVirtual)
+        fixupModifyVRegLI(VLOutput, LIS);
     }
   }
 }
@@ -1538,6 +1697,11 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
   TII = ST->getInstrInfo();
   MRI = &MF.getRegInfo();
 
+  if (IsSplitRA)
+    LIS = &getAnalysis<LiveIntervals>();
+  else
+    LIS = nullptr;
+
   assert(BlockInfo.empty() && "Expect empty block infos");
   BlockInfo.resize(MF.getNumBlockIDs());
 
@@ -1604,6 +1768,6 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
 }
 
 /// Returns an instance of the Insert VSETVLI pass.
-FunctionPass *llvm::createRISCVInsertVSETVLIPass() {
-  return new RISCVInsertVSETVLI();
+FunctionPass *llvm::createRISCVInsertVSETVLIPass(bool IsSplitRA) {
+  return new RISCVInsertVSETVLI(IsSplitRA);
 }
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 953ac097b915044..a3d5def678f745d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -27,6 +27,7 @@
 #include "llvm/CodeGen/MIRParser/MIParser.h"
 #include "llvm/CodeGen/MIRYamlMapping.h"
 #include "llvm/CodeGen/Passes.h"
+#include "llvm/CodeGen/RegAllocRegistry.h"
 #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/InitializePasses.h"
@@ -83,6 +84,10 @@ static cl::opt<bool>
                    cl::desc("Enable sinking and folding of instruction copies"),
                    cl::init(false), cl::Hidden);
 
+static cl::opt<bool> EnableSplitRA("riscv-split-RA", cl::Hidden,
+                              ...
[truncated]

@kito-cheng kito-cheng requested a review from rofirrim October 28, 2023 12:36
@BeMg BeMg marked this pull request as ready for review October 31, 2023 06:04
@rofirrim
Copy link
Collaborator

Oh, very interesting! Thanks for working on this. I didn't know we could do partial register allocation (RA)!

I have some high-level questions:

  1. unblock scheduler's constraints by removing vl/vtype def-use chain

Just for me to understand, this would not require RA of RVV registers byt itself. We would still need to run the InsertVSETVL pass late, like you're doing now. Is that correct?

However splitting the passes unlocks the two following cases:

  1. Support RVV re-materialization

I understand that here you will we able to do that because you have only RA'd RVV registers and the GPR virtual registers that we use as carriers of the VL value make this a bit easier. Is my intuition correct here?

  1. Support partial spill

Similar as above, I imagine you plan to emit a PseudoVSPILL_VL / PseudoVRELOAD_VL instruction or something like that, right?

Did you consider partial copies as well?

I * think * some of those might be hindered by the tail undisturbed policy which may still require to spill/copy the whole register. Do you think it does or it would already be handled after RA of RVV registers?

Looking forward progress here, this looks promising!

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved

static llvm::once_flag InitializeDefaultRVVRegisterAllocatorFlag;

/// -riscv-splitRA-rvv-regalloc=... command line option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is riscv-splitRA-rvv-regalloc redundant? Why not riscv-rvv-regalloc? It looks like that is what AMDGPU does with sgpr-regalloc and vgpr-regalloc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -427,7 +535,8 @@ void RISCVPassConfig::addPreRegAlloc() {
addPass(createRISCVPreRAExpandPseudoPass());
if (TM->getOptLevel() != CodeGenOptLevel::None)
addPass(createRISCVMergeBaseOffsetOptPass());
addPass(createRISCVInsertVSETVLIPass());
if (!EnableSplitRA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there is nothing stopping us from being able to do split register allocation with the vsetvli insertion pass remaining where it was.

I think we should have an option for enabling split register allocation, and another option for enabling late vsetvli insertion. That way we can have the following functionality:

  1. Split register allocation && early vsetvli insertion
  2. Split register allocation && late vsetvli insertion
  3. Non-split register allocation && early vsetvli insertion
  4. Non-split register allocation && late vsetvli insertion

This flexibility may allow us to pinpoint exactly where the shortcomings are (do they come from split regalloc, late vsetvli insertion, or the combination of both)?

@BeMg
Copy link
Contributor Author

BeMg commented Nov 2, 2023

Oh, very interesting! Thanks for working on this. I didn't know we could do partial register allocation (RA)!

I have some high-level questions:

  1. unblock scheduler's constraints by removing vl/vtype def-use chain

Just for me to understand, this would not require RA of RVV registers byt itself. We would still need to run the InsertVSETVL pass late, like you're doing now. Is that correct?

Yes. For scheduler, it actually doesn't need change the register allocator.

However splitting the passes unlocks the two following cases:

  1. Support RVV re-materialization

I understand that here you will we able to do that because you have only RA'd RVV registers and the GPR virtual registers that we use as carriers of the VL value make this a bit easier. Is my intuition correct here?

I just consider the trival case in remat now. Remove Implict use vl/vtype not only unblock scheduler but also make default remat (isReallyTriviallyReMaterializable) in RVV instruction possible. (For example, vid and vmv)

bool TargetInstrInfo::isReallyTriviallyReMaterializable(

The more complicated case (include some def-use chain) could be the further work by define RISC-V own remat hook or make the new pseudo to make it suitable for default remat.

  1. Support partial spill

Similar as above, I imagine you plan to emit a PseudoVSPILL_VL / PseudoVRELOAD_VL instruction or something like that, right?

Yes. We plan to add the concrete size information for each MachineInstruction. With these information, we can update the RISCVInstrInfo::storeRegToStackSlot and RISCVInstrInfo::loadRegToStackSlot to make it emit the pseudo base on concrete size.

Did you consider partial copies as well?

Not, but it work in same flow.

I * think * some of those might be hindered by the tail undisturbed policy which may still require to spill/copy the whole register. Do you think it does or it would already be handled after RA of RVV registers?

This is why we need the concrete size information to make sure the partial spill/reload doesn't break the correctness.

Looking forward progress here, this looks promising!

Thanks for comment.

@BeMg BeMg requested a review from michaelmaitland November 5, 2023 05:22
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at the dev meeting, I think this is the right overall direction.

In terms of staging, I think it makes sense to first split regalloc in it's own review, and then add the vsetvli variant between them.

I'm concerned by the regressions you mentioned. Have you investigated them? In particular, are there any that are cause by splitting RA on it's own? If they're "only" vsetvli regressions, those are likely less scary and a bit easier to fix.

MI && MI->getOpcode() == RISCV::ADDI &&
MI->getOperand(1).isReg() && MI->getOperand(2).isImm() &&
if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg());
MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be stray whitespace change. Remove.


public:
static char ID;

RISCVInsertVSETVLI() : MachineFunctionPass(ID) {
RISCVInsertVSETVLI(bool IsSplitRA = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of this flag, maybe a better name would be isPostRVVRegAlloc?

@BeMg
Copy link
Contributor Author

BeMg commented Nov 8, 2023

Thanks for reviewing.

As discussed at the dev meeting, I think this is the right overall direction.

In terms of staging, I think it makes sense to first split regalloc in it's own review, and then add the vsetvli variant between them.

Sure.

I'm concerned by the regressions you mentioned. Have you investigated them? In particular, are there any that are cause by splitting RA on it's own? If they're "only" vsetvli regressions, those are likely less scary and a bit easier to fix.

I push the some commits to represent how many llvm regression will be affect when we enable the splitting RA only, splitting RA with new vsetvl insertion pass and new vsetvl insertion pass only. In summary, Splitting RA itself doesn't change a lot but with new vsetvl insertion change a lot even only itself.

if (MRI->isSSA())
return MRI->getVRegDef(VReg);

return findReachingDef(VReg, MI, MRI);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use VNInfo, LiveQueryResult, getInstructionIndex, etc. to get the defintion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it uses the LiveIntervals/VNInfo to get the defintion.

@lukel97
Copy link
Contributor

lukel97 commented Nov 13, 2023

I tried out this patch and I agree that the register allocation split should be done in a separate PR, with the split just enabled by default: There were only four lit test diffs, none of them were regressions. I also ran this on the llvm-test-suite for rv64gv and couldn't find any regressions, just some shuffling about of registers, here's the diff:
regalloc-split.patch

@lukel97
Copy link
Contributor

lukel97 commented Nov 13, 2023

One small regression I noticed with the post-ra insertvsetvli pass is this function from test/CodeGen/RISCV/rvv/fixed-vectors-mask-buildvec.ll:

define <4 x i1> @buildvec_mask_nonconst_v4i1(i1 %x, i1 %y) {
  %1 = insertelement <4 x i1> poison, i1 %x, i32 0
  %2 = insertelement <4 x i1> %1,  i1 %x, i32 1
  %3 = insertelement <4 x i1> %2,  i1 %y, i32 2
  %4 = insertelement <4 x i1> %3,  i1 %y, i32 3
  ret <4 x i1> %4
}

With pre-RA vsetvli insertion we have:

buildvec_mask_nonconst_v4i1:            # @buildvec_mask_nonconst_v4i1
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 4, e8, mf4, ta, ma
	vmv.v.i	v0, 3
	vmv.v.x	v8, a1
	vmerge.vxm	v8, v8, a0, v0
	vand.vi	v8, v8, 1
	vmsne.vi	v0, v8, 0
	ret

But post-RA insertion results in an extra vsetvli:

buildvec_mask_nonconst_v4i1:            # @buildvec_mask_nonconst_v4i1
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 1, e8, mf8, ta, ma
	vmv.v.i	v0, 3
	vsetivli	zero, 4, e8, mf4, ta, ma
	vmv.v.x	v8, a1
	vmerge.vxm	v8, v8, a0, v0
	vand.vi	v8, v8, 1
	vmsne.vi	v0, v8, 0
	ret

From what I can tell this is due to how the vmv.v.i gets scheduled before the vmv.v.x, e.g. the BB goes from beginning like this:

bb.0 (%ir-block.0):
  liveins: $x10, $x11
  %1:gpr = COPY $x11
  %0:gpr = COPY $x10
  %2:vr = PseudoVMV_V_X_MF4 $noreg(tied-def 0), %1:gpr, 4, 3, 0
  %3:vr = PseudoVMV_V_I_MF8 $noreg(tied-def 0), 3, 1, 3, 0

to this:

0B	bb.0 (%ir-block.0):
	  liveins: $x10, $x11
16B	  %1:gpr = COPY $x11
32B	  %0:gpr = COPY $x10
64B	  renamable $v0 = PseudoVMV_V_I_MF8 undef renamable $v0(tied-def 0), 3, 1, 3, 0
80B	  renamable $v8 = PseudoVMV_V_X_MF4 undef renamable $v8(tied-def 0), %1:gpr, 4, 3, 0

We end up with the extra vsetvli because we have an optimisation where we avoid inserting a vsetvli if we know we have a vmv.v.i with VL=1 in needVSETVLI, which is in turn called after the first instruction in transferBefore when going from the vmv.v.x -> vmv.v.i. But because the vmv.v.i instruction is now first, we don't do the check going between the vmv.v.i -> vmv.v.x.

We could probably recover this case by teaching the backwards local postpass about this.

@BeMg
Copy link
Contributor Author

BeMg commented Nov 13, 2023

I have opened the seqerate patch for splitting regalloc only. #72096

@BeMg BeMg changed the title [RISCV] RISC-V split register allocation and move vsetvl pass in between [RISCV] Support postRA vsetvl insertion pass Nov 13, 2023
@lukel97
Copy link
Contributor

lukel97 commented Nov 13, 2023

With -riscv-vsetvli-after-rvv-regalloc=true, we seem to lose stack slot coloring, e.g. with

declare <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
  %a = call <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
  ret <vscale x 8 x i64> %a
}

We no longer mark one of the slots as dead:

diff --git a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
index 3514fa66f588..47f097087601 100644
--- a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
@@ -1134,9 +1134,9 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    sub sp, sp, a0
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
 ; RV32-NEXT:    sw zero, 12(sp)
 ; RV32-NEXT:    lui a0, 1044480
 ; RV32-NEXT:    sw a0, 8(sp)
@@ -1162,6 +1162,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    addi a4, sp, 16
 ; RV32-NEXT:    vl8r.v v0, (a4) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v24, v24, v0
+; RV32-NEXT:    csrr a4, vlenb
+; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    add a4, sp, a4
+; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v24, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    vand.vx v0, v8, a2
 ; RV32-NEXT:    vsll.vx v0, v0, a1
@@ -1173,7 +1177,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    vsll.vi v16, v16, 8
 ; RV32-NEXT:    vor.vv v8, v8, v16
 ; RV32-NEXT:    vor.vv v8, v24, v8
-; RV32-NEXT:    addi a0, sp, 16
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16
 ; RV32-NEXT:    vsrl.vi v16, v8, 4
@@ -1207,7 +1214,7 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    vadd.vv v8, v8, v8
 ; RV32-NEXT:    vor.vv v8, v16, v8
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    addi sp, sp, 16
 ; RV32-NEXT:    ret

It seems like it's because LiveStacks is now empty when we get to the stack slot colouring pass. Do we need to mark it as preserved?

@preames
Copy link
Collaborator

preames commented Nov 16, 2023

Now that #72096 has landed, the next step is a change which enables the split RA by default. Once that's in, this change can be rebased on tip of tree, and we can focus review attention on the changes to InsertVSETVLI and the resulting test changes.

@BeMg BeMg requested review from lukel97 and topperc November 17, 2023 10:30
@BeMg BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from e84993b to c92c6af Compare May 15, 2024 05:17
@BeMg
Copy link
Contributor Author

BeMg commented May 15, 2024

Rebase with main

@lukel97
Copy link
Contributor

lukel97 commented May 15, 2024

I ran this on SPEC CPU 2017 (-O3 -march=rv64gv) and collected the dynamic instruction count (total insns), dynamic vsetvl instruction count (vset), dynamic non vsetvl x0,x0 instruction count (vl set) and number of spills inserted.

The changes are all less than < +-1%, with the exception of 619.lbm_s/519.lbm_r where there's 40% more vsetvls executed but a 23% improvement in dynamic instruction count. The improvement seems to be due to us spilling less vectors in a hot loop.

On leela, there's 0.9% more vsetvls executed but 0.9% more non-x0,x0s. So it's likely that the new vsetvls aren't VL preserving.
On x264, there's more non-vsetvl x0,x0s executed, but less vsetvls executed overall.

So overall I think this looks good.

('Program', '') ('vset', 'prera') ('vset', 'postra') ('vset', 'diff') ('total insns', 'prera') ('total insns', 'postra') ('total insns', 'diff') ('regalloc.NumSpills', 'prera') ('regalloc.NumSpills', 'postra') ('regalloc.NumSpills', 'diff') ('vl set', 'prera') ('vl set', 'postra') ('vl set', 'diff')
test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 3.93984e+09 5.49984e+09 0.395955 9.34441e+11 7.23061e+11 -0.22621 208 179 -0.139423 1662 1666 0.00240674
test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 4.95512e+08 6.90512e+08 0.393532 1.17465e+11 9.08479e+10 -0.226599 208 180 -0.134615 1142 1146 0.00350263
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 4.22353e+08 4.26072e+08 0.00880622 4.674e+11 4.67404e+11 7.95723e-06 358 358 0 3.91329e+08 3.94923e+08 0.00918232
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 4.22353e+08 4.26072e+08 0.00880622 4.674e+11 4.67404e+11 7.95723e-06 358 358 0 3.91329e+08 3.94923e+08 0.00918232
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 1.88329e+09 1.88349e+09 0.000103542 7.20803e+11 7.20803e+11 3.96615e-07 13887 13888 7.20098e-05 1.4395e+09 1.43956e+09 4.23063e-05
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 1.73542e+07 1.73559e+07 9.89384e-05 1.08127e+10 1.08127e+10 1.69088e-06 13720 13728 0.00058309 1.69636e+07 1.69654e+07 0.000108586
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 1.73542e+07 1.73559e+07 9.89384e-05 1.08124e+10 1.08124e+10 1.69092e-06 13720 13728 0.00058309 1.69636e+07 1.69654e+07 0.000108586
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 6.54343e+07 6.54382e+07 5.89141e-05 3.15461e+10 3.15416e+10 -0.000142572 1916 1902 -0.00730689 5.5646e+07 5.56494e+07 6.2089e-05
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1.7041e+09 1.70412e+09 1.18502e-05 1.50246e+11 1.50246e+11 -4.98108e-07 44229 44232 6.78288e-05 1.69819e+09 1.69819e+09 6.17128e-07
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 2.26807e+08 2.26807e+08 7.93626e-07 1.41909e+11 1.4195e+11 0.000287282 694 698 0.00576369 2.26807e+08 2.26807e+08 7.93626e-07
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 2.26807e+08 2.26807e+08 7.93626e-07 1.41903e+11 1.41943e+11 0.000287296 694 698 0.00576369 2.26807e+08 2.26807e+08 7.93626e-07
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1.86518e+10 1.86518e+10 1.0814e-07 2.47888e+11 2.47888e+11 1.17916e-08 4634 4670 0.00776867 2.62581e+07 2.62609e+07 0.000105491
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1.86518e+10 1.86518e+10 1.0814e-07 2.47888e+11 2.47888e+11 1.17916e-08 4634 4670 0.00776867 2.62581e+07 2.62609e+07 0.000105491
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 218721 218721 0 1.79594e+09 1.79594e+09 0 4291 4291 0 218718 218718 0
test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test 6.19906e+07 6.19906e+07 0 4.72968e+10 4.72968e+10 0 301 301 0 5.57433e+07 5.57433e+07 0
test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test 2.36169e+07 2.36169e+07 0 1.53744e+11 1.53744e+11 0 123 123 0 2.36169e+07 2.36169e+07 0
test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test 6.19906e+07 6.19906e+07 0 4.72968e+10 4.72968e+10 0 301 301 0 5.57433e+07 5.57433e+07 0
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 2.23429e+09 2.23429e+09 0 4.08002e+11 4.08002e+11 0 749 749 0 2.23429e+09 2.23429e+09 0
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 218721 218721 0 1.79594e+09 1.79594e+09 0 4291 4291 0 218718 218718 0
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 2.23429e+09 2.23429e+09 0 4.08002e+11 4.08002e+11 0 749 749 0 2.23429e+09 2.23429e+09 0
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 9.48185e+07 9.48185e+07 0 2.1198e+11 2.11983e+11 1.13134e-05 6743 6744 0.000148302 9.46089e+07 9.46089e+07 0
test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test 2.36169e+07 2.36169e+07 0 1.53744e+11 1.53744e+11 0 123 123 0 2.36169e+07 2.36169e+07 0
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 4.64859e+08 4.64859e+08 -4.30238e-09 2.89851e+11 2.89851e+11 2.99464e-09 1604 1604 0 4.51635e+08 4.51635e+08 -4.42836e-09
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 4.64859e+08 4.64859e+08 -4.30238e-09 2.89851e+11 2.89851e+11 3.40174e-09 1604 1604 0 4.51635e+08 4.51635e+08 -4.42836e-09
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 6.96396e+08 6.96396e+08 -1.37853e-07 4.42779e+11 4.42818e+11 8.90763e-05 359 358 -0.00278552 6.37384e+08 6.37384e+08 0
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 6.60366e+08 6.60366e+08 -1.45374e-07 4.10743e+11 4.10779e+11 8.81044e-05 359 358 -0.00278552 6.04725e+08 6.04725e+08 0
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 5.04246e+09 5.02065e+09 -0.00432564 2.19801e+11 2.19701e+11 -0.000456991 1894 1894 0 1.16667e+09 1.1786e+09 0.0102294
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 5.04246e+09 5.02065e+09 -0.00432564 2.19801e+11 2.19701e+11 -0.000456991 1894 1894 0 1.16667e+09 1.1786e+09 0.0102294
Geomean difference 0.0243876 -0.0181787 -0.0099226 0.00161097

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I didn't look at all the test diffs, but in the ones I did I didn't spot anything interesting in them.

Some spills are removed, some new ones are added, but after looking at the benchmark results I don't think they're an issue. Same too goes for the fluctuation in copies.

This is a big change though so probably best to wait for a second LGTM too!

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a sweep through the test diffs - mostly ignored vp tests, and don't claim to have looked at everything. We could do another round of diff reduction here, but I am not currently asking for it. I'm going to apply this downstream and look at a couple workloads, another comment will follow once I've done so.

I will note that my comments flag at least one real regression caused/triggered by this patch. I'm not yet sure that regression matters but it does appear to be real.

; CHECK-NEXT: vmv1r.v v0, v24
; CHECK-NEXT: csrr a0, vlenb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delta is interesting and surprising - in a good way.

I hadn't considered that this change would allow reuse of the vlenb CSR read.

Though there's also a missed opportunity here, both before and after. a0 is vlenb. With etype=e64, and emul-m8, that means the VL here is VLMAX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delta is interesting and surprising - in a good way.

I hadn't considered that this change would allow reuse of the vlenb CSR read.

I think it is a coincidence that PseudoReadVLENB chooses $x11 as the destination register ($x10 is still used by vsetvl) during Prologue/Epilogue Insertion & Frame Finalization, allowing the Machine Late Instructions Cleanup Pass to reuse definitions from predecessors.

; CHECK-NEXT: lui a1, %hi(.LCPI10_0)
; CHECK-NEXT: flh fa5, %lo(.LCPI10_0)(a1)
; CHECK-NEXT: vmv1r.v v16, v0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another (unrelated) missed opportunity. We're moving v0 into another register, but nothing between this and restore appears to actually write to v0. Do we have an imprecise clobber specification on some instruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the instruction selection, compiler will emit the $v0 = COPY %1:vr for each RVV mask instruction. These COPY will be eliminated after Machine Copy Propagation Pass, but $v16 = COPY $v0 can't be eliminated due to PseudoVMFLT_VFPR16_M8_MASK early-clobber flag.

Could PseudoVMFLT_VFPR16_M8_MASK accept $v0 as the destination register, or is there another chance to remove that COPY?

; After Instruction Selection
bb.0 (%ir-block.0):
  liveins: $v8m8, $v0, $x10
  %2:gprnox0 = COPY $x10
  %1:vr = COPY $v0
  %0:vrm8 = COPY $v8m8
  %3:vrm8 = COPY %0:vrm8
  $v0 = COPY %1:vr
  %4:vrm8nov0 = PseudoVFSGNJX_VV_M8_E16_MASK $noreg(tied-def 0), %3:vrm8, %3:vrm8, $v0, %2:gprnox0, 4, 3
  %5:gpr = LUI target-flags(riscv-hi) %const.0
  %6:fpr16 = FLH killed %5:gpr, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool)
  $v0 = COPY %1:vr
  early-clobber %7:vr = nofpexcept PseudoVMFLT_VFPR16_M8_MASK %1:vr(tied-def 0), killed %4:vrm8nov0, killed %6:fpr16, $v0, %2:gprnox0, 4
...
; After Machine Copy Propagation Pass
bb.0 (%ir-block.0):
  liveins: $v0, $x10, $v8m8
  renamable $x11 = LUI target-flags(riscv-hi) %const.0
  renamable $f15_h = FLH killed renamable $x11, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool)
  renamable $v16 = COPY $v0
  dead $x0 = PseudoVSETVLI killed renamable $x10, 203, implicit-def $vl, implicit-def $vtype
  renamable $v24m8 = PseudoVFSGNJX_VV_M8_E16_MASK undef renamable $v24m8(tied-def 0), renamable $v8m8, renamable $v8m8, $v0, $noreg, 4, 3, implicit $vl, implicit $vtype
  dead $x0 = PseudoVSETVLIX0 killed $x0, 75, implicit-def $vl, implicit-def $vtype, implicit $vl
  early-clobber renamable $v16 = nofpexcept PseudoVMFLT_VFPR16_M8_MASK killed renamable $v16(tied-def 0), killed renamable $v24m8, killed renamable $f15_h, $v0, $noreg, 4, implicit $vl, implicit $vtype
...

; RV32-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
; RV32-NEXT: vslideup.vi v10, v9, 4
; RV32-NEXT: vslideup.vi v9, v10, 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another (unrelated) opportunity. We're inserting a subvec into a zero vector just to compute a sum reduction where the zero elements are nops. We could have computed the sum on the sub-vec directly.

@@ -197,28 +197,51 @@ entry:
define void @test_compresstore_v256i8(ptr %p, <256 x i1> %mask, <256 x i8> %data) {
; RV64-LABEL: test_compresstore_v256i8:
; RV64: # %bb.0: # %entry
; RV64-NEXT: vmv1r.v v7, v8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit concerning, we have no actual reordering happening here, but we chose a strictly less optimal register allocation solution. I don't see any obvious reason for this one, does anyone else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the register allocation process, the main reason for the less optimal regalloc is the register allocator making varying eviction decisions. These decisions differ due to different LiveInterval spill weights (evicting the one with the smaller weight). One factor in the weight calculation is the LiveInterval's size (And it changed between pre-ra and post-ra vsetvl flow).


In this testcase %12 going to evict.

selectOrSplit VRM8:%12 [240e,288r:0) 0@240e  weight:4.454343e-03 w=4.454343e-03

pre-ra flow choose evict %2

VR:%2 [48r,336r:0) 0@48r  weight:4.404070e-03 w=4.404070e-03
VRM8:%3 [32r,240r:0) 0@32r  weight:3.322368e-03 w=3.322368e-03

...

unassigning %2 from $v8: V8
assigning %12 to $v8m8: V8 [240e,288r:0) 0@240e V9 [240e,288r:0) 0@240e V10 [240e,288r:0) 0@240e V11 [240e,288r:0) 0@240e V12 [240e,288r:0) 0@240e V13 [240e,288r:0) 0@240e V14 [240e,288r:0) 0@240e V15 [240e,288r:0) 0@240e

post-ra flow choose evict %3 because %2 spill weight is heavy than %12.

VR:%2 [48r,256r:0) 0@48r  weight:4.983553e-03 w=4.983553e-03
VRM8:%3 [32r,192r:0) 0@32r  weight:3.607143e-03 w=3.607143e-03

...

unassigning %3 from $v16m8: V16 V17 V18 V19 V20 V21 V22 V23
assigning %12 to $v16m8: V16 [192e,224r:0) 0@192e V17 [192e,224r:0) 0@192e V18 [192e,224r:0) 0@192e V19 [192e,224r:0) 0@192e V20 [192e,224r:0) 0@192e V21 [192e,224r:0) 

The second round for %3 assign cause the spill code insertion.

@@ -35,10 +35,10 @@ define <512 x i8> @single_source(<512 x i8> %a) {
; CHECK-NEXT: vslidedown.vi v16, v16, 4
; CHECK-NEXT: li a0, 466
; CHECK-NEXT: li a1, 465
; CHECK-NEXT: lbu a2, 1012(sp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slight regression - it's the same push back heuristic I raised previously. Non blocking, but this would be good to address.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-mask-buildvec.ll Outdated Show resolved Hide resolved
; CHECK-NEXT: lui a0, 11
; CHECK-NEXT: addi a0, a0, -1366
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, real (if minor) regression.

Copy link
Contributor

@lukel97 lukel97 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're actually removing a vsetvli overall here. But we go from 1 VL toggle to 2.

Although we could go a step further and teach the doLocalPostpass to change the AVL and LMUL in this case to avoid a toggle:

vsetivli zero, 16, e16, m2, ta, ma
vmv.s.x v0, a0
vsetvli zero, zero, e8, m1, ta, ma
vslidedown.vi v8, v8, 1

This isn't something that it was able to do before, since doLocalPostpass will only mutate vsetlvis if it can remove a vsetvli entirely.

@BeMg BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 601d59c to 757d1d1 Compare May 17, 2024 07:12
@preames
Copy link
Collaborator

preames commented May 20, 2024

Following up to my prior comment, thank you @BeMg and @lukel97 for the analysis on each of the cases I flagged. For my purposes, the understanding demonstrated on the one remaining regression is sufficient. Knowing that it's an artifact of scheduling combined with an existing missed optimization gives me confidence we'll be able to address this.

Given that, LGTM!

I want to complement both of you for working through a long and challenging processing of splitting this up into reviewable pieces, landing and rebasing many many times, and driving this to conclusion. Thank you!

@BeMg BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 757d1d1 to 918f004 Compare May 21, 2024 03:07
lukel97 added a commit to lukel97/llvm-project that referenced this pull request May 21, 2024
…CVInsertVSETVLI

We have two rules in needVSETVLI where we can relax the demanded fields for slides and splats when VL=1.

However these aren't present in getDemanded which prevents us from coalescing some vsetvlis around slides and splats in the backwards pass.

The reasoning as to why they weren't in getDemanded is that these require us to check the value of the AVL operand, which may be stale in the backwards pass: the actual VL or VTYPE value may differ from what was precisely requested in the pseudo's operands.

Using the original operands should actually be fine though, as we only care about what was originally demanded by the instruction. The current value of VL or VTYPE shouldn't influence this.

This addresses some of the regressions we are seeing in llvm#70549 from splats and slides getting reordered.
lukel97 added a commit that referenced this pull request May 21, 2024
…CVInsertVSETVLI (#92860)

We have two rules in needVSETVLI where we can relax the demanded fields
for slides and splats when VL=1.

However these aren't present in getDemanded which prevents us from
coalescing some vsetvlis around slides and splats in the backwards pass.

The reasoning as to why they weren't in getDemanded is that these
require us to check the value of the AVL operand, which may be stale in
the backwards pass: the actual VL or VTYPE value may differ from what
was precisely requested in the pseudo's operands.

Using the original operands should actually be fine though, as we only
care about what was originally demanded by the instruction. The current
value of VL or VTYPE shouldn't influence this.

This addresses some of the regressions we are seeing in #70549 from
splats and slides getting reordered.
@BeMg BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 918f004 to f2a60bd Compare May 21, 2024 06:17
@BeMg BeMg merged commit 675e7bd into llvm:main May 21, 2024
3 of 4 checks passed
@lukel97
Copy link
Contributor

lukel97 commented May 21, 2024

Nice, great work :)

Now that rematerialization and partial spilling are unblocked, were you planning to take a look at either of them?
I have half a proof-of-concept patch lying around for rematerialization that I could pick up again, but only if you haven't already started that work.

lukel97 added a commit to lukel97/llvm-project that referenced this pull request May 21, 2024
We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089.

It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.
@BeMg
Copy link
Contributor Author

BeMg commented May 21, 2024

@lukel97 Thanks for collaborating on these patches. And the reviewer's comments also made this patch possible.

Nice, great work :)

Now that rematerialization and partial spilling are unblocked, were you planning to take a look at either of them? I have half a proof-of-concept patch lying around for rematerialization that I could pick up again, but only if you haven't already started that work.

Feel free to pick up on rematerialization. I haven't started on it yet. :)

I will explore whether we could do something on scheduler to control vsetvl insertion.

@topperc
Copy link
Collaborator

topperc commented May 22, 2024

Thank you @lukel97 and @BeMg for pushing this through.

@michalt
Copy link
Contributor

michalt commented May 23, 2024

Yeah, this is pretty exciting, thanks for all the work everyone! 🎉

lukel97 added a commit to lukel97/llvm-project that referenced this pull request May 29, 2024
We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089.

It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.
lukel97 added a commit that referenced this pull request May 29, 2024
We no longer need to separate the passes now that #70549 is landed and
this will unblock #89089.

It's not strictly NFC because it will move coalescing before register
allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this
makes it closer to the original behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants