-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Support postRA vsetvl insertion pass #70549
[RISCV] Support postRA vsetvl insertion pass #70549
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
@llvm/pr-subscribers-debuginfo @llvm/pr-subscribers-backend-risc-v Author: Piyou Chen (BeMg) ChangesThis patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by
It will gain the benefit for the following optimization from
This patch add a new option It can compile the the local testsuite and some relate big benchmark, but it still generate worse code in some case. (llvm regression test) NoteRegister allocation
VSETVL insertion pass
Patch is 31.97 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70549.diff 9 Files Affected:
diff --git a/llvm/lib/CodeGen/LiveDebugVariables.h b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
similarity index 96%
rename from llvm/lib/CodeGen/LiveDebugVariables.h
rename to llvm/include/llvm/CodeGen/LiveDebugVariables.h
index 9998ce9e8dad861..3643a2cd0981f19 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.h
+++ b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
@@ -29,7 +29,7 @@ template <typename T> class ArrayRef;
class LiveIntervals;
class VirtRegMap;
-class LLVM_LIBRARY_VISIBILITY LiveDebugVariables : public MachineFunctionPass {
+class LiveDebugVariables : public MachineFunctionPass {
void *pImpl = nullptr;
public:
diff --git a/llvm/lib/CodeGen/LiveDebugVariables.cpp b/llvm/lib/CodeGen/LiveDebugVariables.cpp
index 9603c1f01e08569..5f5a858c7bade6e 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.cpp
+++ b/llvm/lib/CodeGen/LiveDebugVariables.cpp
@@ -18,7 +18,7 @@
//
//===----------------------------------------------------------------------===//
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/IntervalMap.h"
diff --git a/llvm/lib/CodeGen/RegAllocBasic.cpp b/llvm/lib/CodeGen/RegAllocBasic.cpp
index 6661991396302ec..a4d4a218fea5584 100644
--- a/llvm/lib/CodeGen/RegAllocBasic.cpp
+++ b/llvm/lib/CodeGen/RegAllocBasic.cpp
@@ -12,7 +12,7 @@
//===----------------------------------------------------------------------===//
#include "AllocationOrder.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
#include "RegAllocBase.h"
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/CalcSpillWeights.h"
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 349d8b0975f3a10..4c25fcfeba89073 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -14,7 +14,7 @@
#include "RegAllocGreedy.h"
#include "AllocationOrder.h"
#include "InterferenceCache.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
#include "RegAllocBase.h"
#include "RegAllocEvictionAdvisor.h"
#include "RegAllocPriorityAdvisor.h"
diff --git a/llvm/lib/CodeGen/VirtRegMap.cpp b/llvm/lib/CodeGen/VirtRegMap.cpp
index 48f4ee29fbe95d4..a2b046a59c9d773 100644
--- a/llvm/lib/CodeGen/VirtRegMap.cpp
+++ b/llvm/lib/CodeGen/VirtRegMap.cpp
@@ -16,7 +16,7 @@
//===----------------------------------------------------------------------===//
#include "llvm/CodeGen/VirtRegMap.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/CodeGen/LiveInterval.h"
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 3d8e33dc716ea44..9c9906ff0bc6430 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -60,7 +60,7 @@ void initializeRISCVPreRAExpandPseudoPass(PassRegistry &);
FunctionPass *createRISCVExpandAtomicPseudoPass();
void initializeRISCVExpandAtomicPseudoPass(PassRegistry &);
-FunctionPass *createRISCVInsertVSETVLIPass();
+FunctionPass *createRISCVInsertVSETVLIPass(bool IsSplitRA = false);
void initializeRISCVInsertVSETVLIPass(PassRegistry &);
FunctionPass *createRISCVPostRAExpandPseudoPass();
diff --git a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
index 9d2b2c3b3f59264..03c8d72a32dc714 100644
--- a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
@@ -26,6 +26,7 @@
#include "RISCV.h"
#include "RISCVSubtarget.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
#include "llvm/CodeGen/LiveIntervals.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include <queue>
@@ -44,6 +45,86 @@ static cl::opt<bool> UseStrictAsserts(
namespace {
+// Return true when CandidateDefMI can arrive UserMI without reach other defMI
+// thought CFG. Otherwise, if it:
+// 1. Encounter the DefMI again
+// 2. Seen the all BasicBlock can reach
+// return false
+static bool isReachingDef(const MachineInstr &UserMI,
+ const MachineInstr &CandidateDefMI, Register Reg,
+ const MachineRegisterInfo *MRI) {
+ std::queue<std::pair<const MachineBasicBlock *, bool>> MBBQ;
+ llvm::DenseMap<int, bool> SeenMBB;
+
+ MBBQ.push(std::make_pair(CandidateDefMI.getParent(), false));
+ while (!MBBQ.empty()) {
+
+ const MachineBasicBlock *CurrMBB = MBBQ.front().first;
+ bool SeenDef = MBBQ.front().second;
+ MBBQ.pop();
+ SeenMBB[CurrMBB->getNumber()] = true;
+ bool NeedSkip = false;
+ for (auto &MI : *CurrMBB) {
+ // If we encounter DefMI more than once, this CandidateDefMI is not
+ // reaching definition for UserMI.
+ if (SeenDef && llvm::any_of(MRI->def_instructions(Reg),
+ [&MI](const MachineInstr &DefMI) {
+ return DefMI.isIdenticalTo(MI);
+ })) {
+ NeedSkip = true;
+ break;
+ }
+
+ if (MI.isIdenticalTo(CandidateDefMI))
+ SeenDef = true;
+
+ if (SeenDef && MI.isIdenticalTo(UserMI))
+ return true;
+ }
+
+ if (NeedSkip)
+ continue;
+
+ for (auto *Succ : CurrMBB->successors()) {
+ if (SeenMBB[Succ->getNumber()])
+ continue;
+ MBBQ.push(std::make_pair(Succ, SeenDef));
+ }
+ }
+
+ return false;
+}
+
+// Return MachineInstr* when we found the unique reaching definition.
+// Otherwise, return nullptr.
+// FIXME: Could it detect the PHI node situation?
+static MachineInstr *findReachingDef(Register Reg, const MachineInstr *MI,
+ const MachineRegisterInfo *MRI) {
+
+ MachineInstr *ReachingDefMI = nullptr;
+ for (auto &DefMI : MRI->def_instructions(Reg)) {
+ if (isReachingDef(*MI, DefMI, Reg, MRI)) {
+ if (!ReachingDefMI)
+ ReachingDefMI = &DefMI;
+ else
+ return nullptr;
+ }
+ }
+
+ return ReachingDefMI;
+}
+
+// For the SSA form, we could just use the getVRegDef to take Reaching
+// definition. For the non-SSA, we create this pass own reaching definition
+// searching function.
+static MachineInstr *getReachingDefMI(Register VReg, const MachineInstr *MI,
+ const MachineRegisterInfo *MRI) {
+ if (MRI->isSSA())
+ return MRI->getVRegDef(VReg);
+
+ return findReachingDef(VReg, MI, MRI);
+}
+
static unsigned getVLOpNum(const MachineInstr &MI) {
return RISCVII::getVLOpNum(MI.getDesc());
}
@@ -187,13 +268,17 @@ static bool hasUndefinedMergeOp(const MachineInstr &MI,
if (UseMO.getReg() == RISCV::NoRegister)
return true;
- if (MachineInstr *UseMI = MRI.getVRegDef(UseMO.getReg())) {
+ if (!MRI.isSSA() && UseMO.isUndef())
+ return true;
+
+ if (MachineInstr *UseMI = getReachingDefMI(UseMO.getReg(), &MI, &MRI)) {
if (UseMI->isImplicitDef())
return true;
if (UseMI->isRegSequence()) {
for (unsigned i = 1, e = UseMI->getNumOperands(); i < e; i += 2) {
- MachineInstr *SourceMI = MRI.getVRegDef(UseMI->getOperand(i).getReg());
+ MachineInstr *SourceMI =
+ getReachingDefMI(UseMI->getOperand(i).getReg(), UseMI, &MRI);
if (!SourceMI || !SourceMI->isImplicitDef())
return false;
}
@@ -494,9 +579,9 @@ class VSETVLIInfo {
if (hasAVLReg()) {
if (getAVLReg() == RISCV::X0)
return true;
- if (MachineInstr *MI = MRI.getVRegDef(getAVLReg());
- MI && MI->getOpcode() == RISCV::ADDI &&
- MI->getOperand(1).isReg() && MI->getOperand(2).isImm() &&
+ if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg());
+ MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() &&
+ MI->getOperand(2).isImm() &&
MI->getOperand(1).getReg() == RISCV::X0 &&
MI->getOperand(2).getImm() != 0)
return true;
@@ -727,20 +812,32 @@ class RISCVInsertVSETVLI : public MachineFunctionPass {
const RISCVSubtarget *ST;
const TargetInstrInfo *TII;
MachineRegisterInfo *MRI;
+ LiveIntervals *LIS = nullptr;
std::vector<BlockData> BlockInfo;
std::queue<const MachineBasicBlock *> WorkList;
+ bool IsSplitRA = false;
public:
static char ID;
- RISCVInsertVSETVLI() : MachineFunctionPass(ID) {
+ RISCVInsertVSETVLI(bool IsSplitRA = false)
+ : MachineFunctionPass(ID), IsSplitRA(IsSplitRA) {
initializeRISCVInsertVSETVLIPass(*PassRegistry::getPassRegistry());
}
bool runOnMachineFunction(MachineFunction &MF) override;
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();
+
+ if (IsSplitRA) {
+ AU.addRequired<LiveIntervals>();
+ AU.addPreserved<LiveIntervals>();
+ AU.addRequired<SlotIndexes>();
+ AU.addPreserved<SlotIndexes>();
+ AU.addPreserved<LiveDebugVariables>();
+ }
+
MachineFunctionPass::getAnalysisUsage(AU);
}
@@ -864,6 +961,30 @@ static VSETVLIInfo getInfoForVSETVLI(const MachineInstr &MI) {
return NewInfo;
}
+static void fixupModifyVRegLI(Register VReg, LiveIntervals *LIS) {
+ if (!LIS)
+ return;
+
+ if (LIS->hasInterval(VReg))
+ LIS->removeInterval(VReg);
+ LIS->createAndComputeVirtRegInterval(VReg);
+}
+
+static void fixupModifyVRegLIFromVSETVL(MachineInstr *MI, LiveIntervals *LIS) {
+
+ if (!LIS)
+ return;
+
+ if (LIS->isNotInMIMap(*MI))
+ LIS->InsertMachineInstrInMaps(*MI);
+ for (auto &MO : MI->operands()) {
+ if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+ continue;
+ Register Reg = MO.getReg();
+ fixupModifyVRegLI(Reg, LIS);
+ }
+}
+
void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
MachineBasicBlock::iterator InsertPt, DebugLoc DL,
const VSETVLIInfo &Info, const VSETVLIInfo &PrevInfo) {
@@ -872,11 +993,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
// Use X0, X0 form if the AVL is the same and the SEW+LMUL gives the same
// VLMAX.
if (Info.hasSameAVL(PrevInfo) && Info.hasSameVLMAX(PrevInfo)) {
- BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
- .addReg(RISCV::X0, RegState::Define | RegState::Dead)
- .addReg(RISCV::X0, RegState::Kill)
- .addImm(Info.encodeVTYPE())
- .addReg(RISCV::VL, RegState::Implicit);
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+ .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+ .addReg(RISCV::X0, RegState::Kill)
+ .addImm(Info.encodeVTYPE())
+ .addReg(RISCV::VL, RegState::Implicit);
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
return;
}
@@ -885,15 +1008,18 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
// same, we can use the X0, X0 form.
if (Info.hasSameVLMAX(PrevInfo) && Info.hasAVLReg() &&
Info.getAVLReg().isVirtual()) {
- if (MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg())) {
+ if (MachineInstr *DefMI =
+ getReachingDefMI(Info.getAVLReg(), &(*InsertPt), MRI)) {
if (isVectorConfigInstr(*DefMI)) {
VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
if (DefInfo.hasSameAVL(PrevInfo) && DefInfo.hasSameVLMAX(PrevInfo)) {
- BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
- .addReg(RISCV::X0, RegState::Define | RegState::Dead)
- .addReg(RISCV::X0, RegState::Kill)
- .addImm(Info.encodeVTYPE())
- .addReg(RISCV::VL, RegState::Implicit);
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+ .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+ .addReg(RISCV::X0, RegState::Kill)
+ .addImm(Info.encodeVTYPE())
+ .addReg(RISCV::VL, RegState::Implicit);
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
return;
}
}
@@ -902,10 +1028,12 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
}
if (Info.hasAVLImm()) {
- BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
- .addReg(RISCV::X0, RegState::Define | RegState::Dead)
- .addImm(Info.getAVLImm())
- .addImm(Info.encodeVTYPE());
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+ .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+ .addImm(Info.getAVLImm())
+ .addImm(Info.encodeVTYPE());
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
return;
}
@@ -915,18 +1043,22 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
// the previous vl to become invalid.
if (PrevInfo.isValid() && !PrevInfo.isUnknown() &&
Info.hasSameVLMAX(PrevInfo)) {
- BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
- .addReg(RISCV::X0, RegState::Define | RegState::Dead)
- .addReg(RISCV::X0, RegState::Kill)
- .addImm(Info.encodeVTYPE())
- .addReg(RISCV::VL, RegState::Implicit);
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+ .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+ .addReg(RISCV::X0, RegState::Kill)
+ .addImm(Info.encodeVTYPE())
+ .addReg(RISCV::VL, RegState::Implicit);
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
return;
}
// Otherwise use an AVL of 1 to avoid depending on previous vl.
- BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
- .addReg(RISCV::X0, RegState::Define | RegState::Dead)
- .addImm(1)
- .addImm(Info.encodeVTYPE());
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+ .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+ .addImm(1)
+ .addImm(Info.encodeVTYPE());
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
return;
}
@@ -942,10 +1074,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
DestReg = MRI->createVirtualRegister(&RISCV::GPRRegClass);
Opcode = RISCV::PseudoVSETVLIX0;
}
- BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
- .addReg(DestReg, RegState::Define | RegState::Dead)
- .addReg(AVLReg)
- .addImm(Info.encodeVTYPE());
+ auto NeedFixupMI =
+ BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
+ .addReg(DestReg, RegState::Define | RegState::Dead)
+ .addReg(AVLReg,
+ (IsSplitRA && MRI->def_empty(AVLReg)) ? RegState::Undef : 0)
+ .addImm(Info.encodeVTYPE());
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
}
static bool isLMUL1OrSmaller(RISCVII::VLMUL LMUL) {
@@ -1007,7 +1142,7 @@ bool RISCVInsertVSETVLI::needVSETVLI(const MachineInstr &MI,
// VSETVLI here.
if (Require.hasAVLReg() && Require.getAVLReg().isVirtual() &&
CurInfo.hasCompatibleVTYPE(Used, Require)) {
- if (MachineInstr *DefMI = MRI->getVRegDef(Require.getAVLReg())) {
+ if (MachineInstr *DefMI = getReachingDefMI(Require.getAVLReg(), &MI, MRI)) {
if (isVectorConfigInstr(*DefMI)) {
VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
if (DefInfo.hasSameAVL(CurInfo) && DefInfo.hasSameVLMAX(CurInfo))
@@ -1062,7 +1197,7 @@ void RISCVInsertVSETVLI::transferBefore(VSETVLIInfo &Info,
// without being sure we can kill the original source reg entirely.
if (!Info.hasAVLReg() || !Info.getAVLReg().isVirtual())
return;
- MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg());
+ const MachineInstr *DefMI = getReachingDefMI(Info.getAVLReg(), &MI, MRI);
if (!DefMI || !isVectorConfigInstr(*DefMI))
return;
@@ -1187,7 +1322,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
return true;
// We need the AVL to be produce by a PHI node in this basic block.
- MachineInstr *PHI = MRI->getVRegDef(AVLReg);
+ MachineInstr *PHI = MRI->getUniqueVRegDef(AVLReg);
if (!PHI || PHI->getOpcode() != RISCV::PHI || PHI->getParent() != &MBB)
return true;
@@ -1202,7 +1337,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
return true;
// We need the PHI input to the be the output of a VSET(I)VLI.
- MachineInstr *DefMI = MRI->getVRegDef(InReg);
+ MachineInstr *DefMI = MRI->getUniqueVRegDef(InReg);
if (!DefMI || !isVectorConfigInstr(*DefMI))
return true;
@@ -1258,8 +1393,12 @@ void RISCVInsertVSETVLI::emitVSETVLIs(MachineBasicBlock &MBB) {
MachineOperand &VLOp = MI.getOperand(getVLOpNum(MI));
if (VLOp.isReg()) {
// Erase the AVL operand from the instruction.
+ Register VLOpReg = VLOp.getReg();
+ bool IsVirtVLOpReg = VLOp.getReg().isVirtual();
VLOp.setReg(RISCV::NoRegister);
VLOp.setIsKill(false);
+ if (IsVirtVLOpReg)
+ fixupModifyVRegLI(VLOpReg, LIS);
}
MI.addOperand(MachineOperand::CreateReg(RISCV::VL, /*isDef*/ false,
/*isImp*/ true));
@@ -1509,8 +1648,23 @@ void RISCVInsertVSETVLI::doLocalPostpass(MachineBasicBlock &MBB) {
Used = getDemanded(MI, MRI, ST);
}
- for (auto *MI : ToDelete)
+ std::vector<Register> NeedFixup;
+
+ for (auto *MI : ToDelete) {
+ for (auto &MO : MI->operands()) {
+ if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+ continue;
+ Register Reg = MO.getReg();
+ NeedFixup.push_back(Reg);
+ }
MI->eraseFromParent();
+ if (IsSplitRA)
+ LIS->RemoveMachineInstrFromMaps(*MI);
+ }
+
+ for (auto Reg : NeedFixup) {
+ fixupModifyVRegLI(Reg, LIS);
+ }
}
void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
@@ -1518,11 +1672,16 @@ void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
MachineInstr &MI = *I++;
if (RISCV::isFaultFirstLoad(MI)) {
Register VLOutput = MI.getOperand(1).getReg();
- if (!MRI->use_nodbg_empty(VLOutput))
- BuildMI(MBB, I, MI.getDebugLoc(), TII->get(RISCV::PseudoReadVL),
- VLOutput);
+ bool IsVirtual = MI.getOperand(1).getReg().isVirtual();
+ if (!MRI->use_nodbg_empty(VLOutput)) {
+ auto NeedFixupMI = BuildMI(MBB, I, MI.getDebugLoc(),
+ TII->get(RISCV::PseudoReadVL), VLOutput);
+ fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
+ }
// We don't use the vl output of the VLEFF/VLSEGFF anymore.
MI.getOperand(1).setReg(RISCV::X0);
+ if (IsVirtual)
+ fixupModifyVRegLI(VLOutput, LIS);
}
}
}
@@ -1538,6 +1697,11 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
TII = ST->getInstrInfo();
MRI = &MF.getRegInfo();
+ if (IsSplitRA)
+ LIS = &getAnalysis<LiveIntervals>();
+ else
+ LIS = nullptr;
+
assert(BlockInfo.empty() && "Expect empty block infos");
BlockInfo.resize(MF.getNumBlockIDs());
@@ -1604,6 +1768,6 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
}
/// Returns an instance of the Insert VSETVLI pass.
-FunctionPass *llvm::createRISCVInsertVSETVLIPass() {
- return new RISCVInsertVSETVLI();
+FunctionPass *llvm::createRISCVInsertVSETVLIPass(bool IsSplitRA) {
+ return new RISCVInsertVSETVLI(IsSplitRA);
}
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 953ac097b915044..a3d5def678f745d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -27,6 +27,7 @@
#include "llvm/CodeGen/MIRParser/MIParser.h"
#include "llvm/CodeGen/MIRYamlMapping.h"
#include "llvm/CodeGen/Passes.h"
+#include "llvm/CodeGen/RegAllocRegistry.h"
#include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/InitializePasses.h"
@@ -83,6 +84,10 @@ static cl::opt<bool>
cl::desc("Enable sinking and folding of instruction copies"),
cl::init(false), cl::Hidden);
+static cl::opt<bool> EnableSplitRA("riscv-split-RA", cl::Hidden,
+ ...
[truncated]
|
Oh, very interesting! Thanks for working on this. I didn't know we could do partial register allocation (RA)! I have some high-level questions:
Just for me to understand, this would not require RA of RVV registers byt itself. We would still need to run the InsertVSETVL pass late, like you're doing now. Is that correct? However splitting the passes unlocks the two following cases:
I understand that here you will we able to do that because you have only RA'd RVV registers and the GPR virtual registers that we use as carriers of the VL value make this a bit easier. Is my intuition correct here?
Similar as above, I imagine you plan to emit a Did you consider partial copies as well? I * think * some of those might be hindered by the tail undisturbed policy which may still require to spill/copy the whole register. Do you think it does or it would already be handled after RA of RVV registers? Looking forward progress here, this looks promising! |
|
||
static llvm::once_flag InitializeDefaultRVVRegisterAllocatorFlag; | ||
|
||
/// -riscv-splitRA-rvv-regalloc=... command line option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is riscv-splitRA-rvv-regalloc
redundant? Why not riscv-rvv-regalloc
? It looks like that is what AMDGPU does with sgpr-regalloc
and vgpr-regalloc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -427,7 +535,8 @@ void RISCVPassConfig::addPreRegAlloc() { | |||
addPass(createRISCVPreRAExpandPseudoPass()); | |||
if (TM->getOptLevel() != CodeGenOptLevel::None) | |||
addPass(createRISCVMergeBaseOffsetOptPass()); | |||
addPass(createRISCVInsertVSETVLIPass()); | |||
if (!EnableSplitRA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, there is nothing stopping us from being able to do split register allocation with the vsetvli insertion pass remaining where it was.
I think we should have an option for enabling split register allocation, and another option for enabling late vsetvli insertion. That way we can have the following functionality:
- Split register allocation && early vsetvli insertion
- Split register allocation && late vsetvli insertion
- Non-split register allocation && early vsetvli insertion
- Non-split register allocation && late vsetvli insertion
This flexibility may allow us to pinpoint exactly where the shortcomings are (do they come from split regalloc, late vsetvli insertion, or the combination of both)?
Yes. For scheduler, it actually doesn't need change the register allocator.
I just consider the trival case in remat now. Remove Implict use llvm-project/llvm/lib/CodeGen/TargetInstrInfo.cpp Line 1130 in e98f3bf
The more complicated case (include some def-use chain) could be the further work by define RISC-V own remat hook or make the new pseudo to make it suitable for default remat.
Yes. We plan to add the concrete size information for each MachineInstruction. With these information, we can update the RISCVInstrInfo::storeRegToStackSlot and RISCVInstrInfo::loadRegToStackSlot to make it emit the pseudo base on concrete size.
Not, but it work in same flow.
This is why we need the concrete size information to make sure the partial spill/reload doesn't break the correctness.
Thanks for comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed at the dev meeting, I think this is the right overall direction.
In terms of staging, I think it makes sense to first split regalloc in it's own review, and then add the vsetvli variant between them.
I'm concerned by the regressions you mentioned. Have you investigated them? In particular, are there any that are cause by splitting RA on it's own? If they're "only" vsetvli regressions, those are likely less scary and a bit easier to fix.
MI && MI->getOpcode() == RISCV::ADDI && | ||
MI->getOperand(1).isReg() && MI->getOperand(2).isImm() && | ||
if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg()); | ||
MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be stray whitespace change. Remove.
|
||
public: | ||
static char ID; | ||
|
||
RISCVInsertVSETVLI() : MachineFunctionPass(ID) { | ||
RISCVInsertVSETVLI(bool IsSplitRA = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of this flag, maybe a better name would be isPostRVVRegAlloc?
Thanks for reviewing.
Sure.
I push the some commits to represent how many llvm regression will be affect when we enable the splitting RA only, splitting RA with new vsetvl insertion pass and new vsetvl insertion pass only. In summary, Splitting RA itself doesn't change a lot but with new vsetvl insertion change a lot even only itself. |
if (MRI->isSSA()) | ||
return MRI->getVRegDef(VReg); | ||
|
||
return findReachingDef(VReg, MI, MRI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use VNInfo
, LiveQueryResult
, getInstructionIndex
, etc. to get the defintion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it uses the LiveIntervals/VNInfo to get the defintion.
I tried out this patch and I agree that the register allocation split should be done in a separate PR, with the split just enabled by default: There were only four lit test diffs, none of them were regressions. I also ran this on the llvm-test-suite for rv64gv and couldn't find any regressions, just some shuffling about of registers, here's the diff: |
One small regression I noticed with the post-ra insertvsetvli pass is this function from define <4 x i1> @buildvec_mask_nonconst_v4i1(i1 %x, i1 %y) {
%1 = insertelement <4 x i1> poison, i1 %x, i32 0
%2 = insertelement <4 x i1> %1, i1 %x, i32 1
%3 = insertelement <4 x i1> %2, i1 %y, i32 2
%4 = insertelement <4 x i1> %3, i1 %y, i32 3
ret <4 x i1> %4
} With pre-RA vsetvli insertion we have: buildvec_mask_nonconst_v4i1: # @buildvec_mask_nonconst_v4i1
.cfi_startproc
# %bb.0:
vsetivli zero, 4, e8, mf4, ta, ma
vmv.v.i v0, 3
vmv.v.x v8, a1
vmerge.vxm v8, v8, a0, v0
vand.vi v8, v8, 1
vmsne.vi v0, v8, 0
ret But post-RA insertion results in an extra vsetvli: buildvec_mask_nonconst_v4i1: # @buildvec_mask_nonconst_v4i1
.cfi_startproc
# %bb.0:
vsetivli zero, 1, e8, mf8, ta, ma
vmv.v.i v0, 3
vsetivli zero, 4, e8, mf4, ta, ma
vmv.v.x v8, a1
vmerge.vxm v8, v8, a0, v0
vand.vi v8, v8, 1
vmsne.vi v0, v8, 0
ret From what I can tell this is due to how the vmv.v.i gets scheduled before the vmv.v.x, e.g. the BB goes from beginning like this:
to this:
We end up with the extra vsetvli because we have an optimisation where we avoid inserting a vsetvli if we know we have a vmv.v.i with VL=1 in needVSETVLI, which is in turn called after the first instruction in transferBefore when going from the vmv.v.x -> vmv.v.i. But because the vmv.v.i instruction is now first, we don't do the check going between the vmv.v.i -> vmv.v.x. We could probably recover this case by teaching the backwards local postpass about this. |
I have opened the seqerate patch for splitting regalloc only. #72096 |
With declare <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
%a = call <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
ret <vscale x 8 x i64> %a
} We no longer mark one of the slots as dead: diff --git a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
index 3514fa66f588..47f097087601 100644
--- a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
@@ -1134,9 +1134,9 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
; RV32-NEXT: addi sp, sp, -16
; RV32-NEXT: .cfi_def_cfa_offset 16
; RV32-NEXT: csrr a0, vlenb
-; RV32-NEXT: slli a0, a0, 3
+; RV32-NEXT: slli a0, a0, 4
; RV32-NEXT: sub sp, sp, a0
-; RV32-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
+; RV32-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
; RV32-NEXT: sw zero, 12(sp)
; RV32-NEXT: lui a0, 1044480
; RV32-NEXT: sw a0, 8(sp)
@@ -1162,6 +1162,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
; RV32-NEXT: addi a4, sp, 16
; RV32-NEXT: vl8r.v v0, (a4) # Unknown-size Folded Reload
; RV32-NEXT: vor.vv v24, v24, v0
+; RV32-NEXT: csrr a4, vlenb
+; RV32-NEXT: slli a4, a4, 3
+; RV32-NEXT: add a4, sp, a4
+; RV32-NEXT: addi a4, a4, 16
; RV32-NEXT: vs8r.v v24, (a4) # Unknown-size Folded Spill
; RV32-NEXT: vand.vx v0, v8, a2
; RV32-NEXT: vsll.vx v0, v0, a1
@@ -1173,7 +1177,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
; RV32-NEXT: vsll.vi v16, v16, 8
; RV32-NEXT: vor.vv v8, v8, v16
; RV32-NEXT: vor.vv v8, v24, v8
-; RV32-NEXT: addi a0, sp, 16
+; RV32-NEXT: csrr a0, vlenb
+; RV32-NEXT: slli a0, a0, 3
+; RV32-NEXT: add a0, sp, a0
+; RV32-NEXT: addi a0, a0, 16
; RV32-NEXT: vl8r.v v16, (a0) # Unknown-size Folded Reload
; RV32-NEXT: vor.vv v8, v8, v16
; RV32-NEXT: vsrl.vi v16, v8, 4
@@ -1207,7 +1214,7 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
; RV32-NEXT: vadd.vv v8, v8, v8
; RV32-NEXT: vor.vv v8, v16, v8
; RV32-NEXT: csrr a0, vlenb
-; RV32-NEXT: slli a0, a0, 3
+; RV32-NEXT: slli a0, a0, 4
; RV32-NEXT: add sp, sp, a0
; RV32-NEXT: addi sp, sp, 16
; RV32-NEXT: ret It seems like it's because LiveStacks is now empty when we get to the stack slot colouring pass. Do we need to mark it as preserved? |
Now that #72096 has landed, the next step is a change which enables the split RA by default. Once that's in, this change can be rebased on tip of tree, and we can focus review attention on the changes to InsertVSETVLI and the resulting test changes. |
e84993b
to
c92c6af
Compare
Rebase with main |
I ran this on SPEC CPU 2017 (-O3 -march=rv64gv) and collected the dynamic instruction count (total insns), dynamic vsetvl instruction count (vset), dynamic non vsetvl x0,x0 instruction count (vl set) and number of spills inserted. The changes are all less than < +-1%, with the exception of 619.lbm_s/519.lbm_r where there's 40% more vsetvls executed but a 23% improvement in dynamic instruction count. The improvement seems to be due to us spilling less vectors in a hot loop. On leela, there's 0.9% more vsetvls executed but 0.9% more non-x0,x0s. So it's likely that the new vsetvls aren't VL preserving. So overall I think this looks good.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I didn't look at all the test diffs, but in the ones I did I didn't spot anything interesting in them.
Some spills are removed, some new ones are added, but after looking at the benchmark results I don't think they're an issue. Same too goes for the fluctuation in copies.
This is a big change though so probably best to wait for a second LGTM too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a sweep through the test diffs - mostly ignored vp tests, and don't claim to have looked at everything. We could do another round of diff reduction here, but I am not currently asking for it. I'm going to apply this downstream and look at a couple workloads, another comment will follow once I've done so.
I will note that my comments flag at least one real regression caused/triggered by this patch. I'm not yet sure that regression matters but it does appear to be real.
; CHECK-NEXT: vmv1r.v v0, v24 | ||
; CHECK-NEXT: csrr a0, vlenb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This delta is interesting and surprising - in a good way.
I hadn't considered that this change would allow reuse of the vlenb CSR read.
Though there's also a missed opportunity here, both before and after. a0 is vlenb. With etype=e64, and emul-m8, that means the VL here is VLMAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This delta is interesting and surprising - in a good way.
I hadn't considered that this change would allow reuse of the vlenb CSR read.
I think it is a coincidence that PseudoReadVLENB
chooses $x11
as the destination register ($x10
is still used by vsetvl) during Prologue/Epilogue Insertion & Frame Finalization
, allowing the Machine Late Instructions Cleanup Pass
to reuse definitions from predecessors.
; CHECK-NEXT: lui a1, %hi(.LCPI10_0) | ||
; CHECK-NEXT: flh fa5, %lo(.LCPI10_0)(a1) | ||
; CHECK-NEXT: vmv1r.v v16, v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another (unrelated) missed opportunity. We're moving v0 into another register, but nothing between this and restore appears to actually write to v0. Do we have an imprecise clobber specification on some instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the instruction selection, compiler will emit the $v0 = COPY %1:vr
for each RVV mask instruction. These COPY will be eliminated after Machine Copy Propagation Pass, but $v16 = COPY $v0
can't be eliminated due to PseudoVMFLT_VFPR16_M8_MASK
early-clobber flag.
Could PseudoVMFLT_VFPR16_M8_MASK
accept $v0
as the destination register, or is there another chance to remove that COPY?
; After Instruction Selection
bb.0 (%ir-block.0):
liveins: $v8m8, $v0, $x10
%2:gprnox0 = COPY $x10
%1:vr = COPY $v0
%0:vrm8 = COPY $v8m8
%3:vrm8 = COPY %0:vrm8
$v0 = COPY %1:vr
%4:vrm8nov0 = PseudoVFSGNJX_VV_M8_E16_MASK $noreg(tied-def 0), %3:vrm8, %3:vrm8, $v0, %2:gprnox0, 4, 3
%5:gpr = LUI target-flags(riscv-hi) %const.0
%6:fpr16 = FLH killed %5:gpr, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool)
$v0 = COPY %1:vr
early-clobber %7:vr = nofpexcept PseudoVMFLT_VFPR16_M8_MASK %1:vr(tied-def 0), killed %4:vrm8nov0, killed %6:fpr16, $v0, %2:gprnox0, 4
...
; After Machine Copy Propagation Pass
bb.0 (%ir-block.0):
liveins: $v0, $x10, $v8m8
renamable $x11 = LUI target-flags(riscv-hi) %const.0
renamable $f15_h = FLH killed renamable $x11, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool)
renamable $v16 = COPY $v0
dead $x0 = PseudoVSETVLI killed renamable $x10, 203, implicit-def $vl, implicit-def $vtype
renamable $v24m8 = PseudoVFSGNJX_VV_M8_E16_MASK undef renamable $v24m8(tied-def 0), renamable $v8m8, renamable $v8m8, $v0, $noreg, 4, 3, implicit $vl, implicit $vtype
dead $x0 = PseudoVSETVLIX0 killed $x0, 75, implicit-def $vl, implicit-def $vtype, implicit $vl
early-clobber renamable $v16 = nofpexcept PseudoVMFLT_VFPR16_M8_MASK killed renamable $v16(tied-def 0), killed renamable $v24m8, killed renamable $f15_h, $v0, $noreg, 4, implicit $vl, implicit $vtype
...
; RV32-NEXT: vsetivli zero, 8, e8, mf2, ta, ma | ||
; RV32-NEXT: vslideup.vi v10, v9, 4 | ||
; RV32-NEXT: vslideup.vi v9, v10, 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another (unrelated) opportunity. We're inserting a subvec into a zero vector just to compute a sum reduction where the zero elements are nops. We could have computed the sum on the sub-vec directly.
@@ -197,28 +197,51 @@ entry: | |||
define void @test_compresstore_v256i8(ptr %p, <256 x i1> %mask, <256 x i8> %data) { | |||
; RV64-LABEL: test_compresstore_v256i8: | |||
; RV64: # %bb.0: # %entry | |||
; RV64-NEXT: vmv1r.v v7, v8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a bit concerning, we have no actual reordering happening here, but we chose a strictly less optimal register allocation solution. I don't see any obvious reason for this one, does anyone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the register allocation process, the main reason for the less optimal regalloc is the register allocator making varying eviction decisions. These decisions differ due to different LiveInterval spill weights (evicting the one with the smaller weight). One factor in the weight calculation is the LiveInterval's size (And it changed between pre-ra and post-ra vsetvl flow).
In this testcase %12
going to evict.
selectOrSplit VRM8:%12 [240e,288r:0) 0@240e weight:4.454343e-03 w=4.454343e-03
pre-ra flow choose evict %2
VR:%2 [48r,336r:0) 0@48r weight:4.404070e-03 w=4.404070e-03
VRM8:%3 [32r,240r:0) 0@32r weight:3.322368e-03 w=3.322368e-03
...
unassigning %2 from $v8: V8
assigning %12 to $v8m8: V8 [240e,288r:0) 0@240e V9 [240e,288r:0) 0@240e V10 [240e,288r:0) 0@240e V11 [240e,288r:0) 0@240e V12 [240e,288r:0) 0@240e V13 [240e,288r:0) 0@240e V14 [240e,288r:0) 0@240e V15 [240e,288r:0) 0@240e
post-ra flow choose evict %3
because %2
spill weight is heavy than %12
.
VR:%2 [48r,256r:0) 0@48r weight:4.983553e-03 w=4.983553e-03
VRM8:%3 [32r,192r:0) 0@32r weight:3.607143e-03 w=3.607143e-03
...
unassigning %3 from $v16m8: V16 V17 V18 V19 V20 V21 V22 V23
assigning %12 to $v16m8: V16 [192e,224r:0) 0@192e V17 [192e,224r:0) 0@192e V18 [192e,224r:0) 0@192e V19 [192e,224r:0) 0@192e V20 [192e,224r:0) 0@192e V21 [192e,224r:0)
The second round for %3 assign cause the spill code insertion.
@@ -35,10 +35,10 @@ define <512 x i8> @single_source(<512 x i8> %a) { | |||
; CHECK-NEXT: vslidedown.vi v16, v16, 4 | |||
; CHECK-NEXT: li a0, 466 | |||
; CHECK-NEXT: li a1, 465 | |||
; CHECK-NEXT: lbu a2, 1012(sp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slight regression - it's the same push back heuristic I raised previously. Non blocking, but this would be good to address.
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fnearbyint-constrained-sdnode.ll
Outdated
Show resolved
Hide resolved
; CHECK-NEXT: lui a0, 11 | ||
; CHECK-NEXT: addi a0, a0, -1366 | ||
; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma | ||
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, real (if minor) regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're actually removing a vsetvli overall here. But we go from 1 VL toggle to 2.
Although we could go a step further and teach the doLocalPostpass to change the AVL and LMUL in this case to avoid a toggle:
vsetivli zero, 16, e16, m2, ta, ma
vmv.s.x v0, a0
vsetvli zero, zero, e8, m1, ta, ma
vslidedown.vi v8, v8, 1
This isn't something that it was able to do before, since doLocalPostpass will only mutate vsetlvis if it can remove a vsetvli entirely.
601d59c
to
757d1d1
Compare
Following up to my prior comment, thank you @BeMg and @lukel97 for the analysis on each of the cases I flagged. For my purposes, the understanding demonstrated on the one remaining regression is sufficient. Knowing that it's an artifact of scheduling combined with an existing missed optimization gives me confidence we'll be able to address this. Given that, LGTM! I want to complement both of you for working through a long and challenging processing of splitting this up into reviewable pieces, landing and rebasing many many times, and driving this to conclusion. Thank you! |
757d1d1
to
918f004
Compare
…CVInsertVSETVLI We have two rules in needVSETVLI where we can relax the demanded fields for slides and splats when VL=1. However these aren't present in getDemanded which prevents us from coalescing some vsetvlis around slides and splats in the backwards pass. The reasoning as to why they weren't in getDemanded is that these require us to check the value of the AVL operand, which may be stale in the backwards pass: the actual VL or VTYPE value may differ from what was precisely requested in the pseudo's operands. Using the original operands should actually be fine though, as we only care about what was originally demanded by the instruction. The current value of VL or VTYPE shouldn't influence this. This addresses some of the regressions we are seeing in llvm#70549 from splats and slides getting reordered.
…CVInsertVSETVLI (#92860) We have two rules in needVSETVLI where we can relax the demanded fields for slides and splats when VL=1. However these aren't present in getDemanded which prevents us from coalescing some vsetvlis around slides and splats in the backwards pass. The reasoning as to why they weren't in getDemanded is that these require us to check the value of the AVL operand, which may be stale in the backwards pass: the actual VL or VTYPE value may differ from what was precisely requested in the pseudo's operands. Using the original operands should actually be fine though, as we only care about what was originally demanded by the instruction. The current value of VL or VTYPE shouldn't influence this. This addresses some of the regressions we are seeing in #70549 from splats and slides getting reordered.
918f004
to
f2a60bd
Compare
Nice, great work :) Now that rematerialization and partial spilling are unblocked, were you planning to take a look at either of them? |
We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089. It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.
@lukel97 Thanks for collaborating on these patches. And the reviewer's comments also made this patch possible.
Feel free to pick up on rematerialization. I haven't started on it yet. :) I will explore whether we could do something on scheduler to control vsetvl insertion. |
Yeah, this is pretty exciting, thanks for all the work everyone! 🎉 |
We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089. It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.
This patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by moving the vsetvl insertion pass after RVV register allocation
It will gain the benefit for the following optimization from
This patch add a new option
-riscv-vsetvl-after-rvv-regalloc=<1|0>
to control this feature and default set as disable.