Skip to content

Commit

Permalink
[LoopVectorize] Add support for vectorisation of more early exit loops
Browse files Browse the repository at this point in the history
This patch follows on from PR llvm#107004 by adding support for
vectorisation of a simple class of loops that typically involves
searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal
if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early uncountable exit like those shown
in the above example.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
6. The loop must not contain reductions or recurrences.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll
  • Loading branch information
david-arm committed Sep 23, 2024
1 parent f4eeae1 commit 02d078e
Show file tree
Hide file tree
Showing 12 changed files with 2,000 additions and 372 deletions.
4 changes: 4 additions & 0 deletions llvm/include/llvm/Support/GenericLoopInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,10 @@ template <class BlockT, class LoopT> class LoopBase {
/// Otherwise return null.
BlockT *getUniqueExitBlock() const;

/// Return the unique exit block for the latch, or null if there are multiple
/// different exit blocks.
BlockT *getUniqueLatchExitBlock() const;

/// Return true if this loop does not have any exit blocks.
bool hasNoExitBlocks() const;

Expand Down
10 changes: 10 additions & 0 deletions llvm/include/llvm/Support/GenericLoopInfoImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,16 @@ BlockT *LoopBase<BlockT, LoopT>::getUniqueExitBlock() const {
return getExitBlockHelper(this, true).first;
}

template <class BlockT, class LoopT>
BlockT *LoopBase<BlockT, LoopT>::getUniqueLatchExitBlock() const {
const BlockT *Latch = getLoopLatch();
assert(Latch && "Latch block must exists");
SmallVector<BlockT *, 4> ExitBlocks;
getUniqueExitBlocksHelper(this, ExitBlocks,
[Latch](const BlockT *BB) { return BB == Latch; });
return ExitBlocks.size() == 1 ? ExitBlocks[0] : nullptr;
}

/// getExitEdges - Return all pairs of (_inside_block_,_outside_block_).
template <class BlockT, class LoopT>
void LoopBase<BlockT, LoopT>::getExitEdges(
Expand Down
20 changes: 15 additions & 5 deletions llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ static cl::opt<LoopVectorizeHints::ScalableForceKind>
"Scalable vectorization is available and favored when the "
"cost is inconclusive.")));

static cl::opt<bool> AssumeNoMemFault(
"vectorizer-no-mem-fault", cl::init(false), cl::Hidden,
cl::desc("Assume vectorized loops will not have memory faults, which is "
"potentially unsafe but can be useful for testing vectorization "
"of early exit loops."));

/// Maximum vectorization interleave count.
static const unsigned MaxInterleaveFactor = 16;

Expand Down Expand Up @@ -1579,11 +1585,15 @@ bool LoopVectorizationLegality::isVectorizableEarlyExitLoop() {
Predicates.clear();
if (!isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC,
&Predicates)) {
reportVectorizationFailure(
"Loop may fault",
"Cannot vectorize potentially faulting early exit loop",
"PotentiallyFaultingEarlyExitLoop", ORE, TheLoop);
return false;
if (!AssumeNoMemFault) {
reportVectorizationFailure(
"Loop may fault",
"Cannot vectorize potentially faulting early exit loop",
"PotentiallyFaultingEarlyExitLoop", ORE, TheLoop);
return false;
} else
LLVM_DEBUG(dbgs() << "LV: Assuming early exit vector loop will not "
<< "fault\n");
}

[[maybe_unused]] const SCEV *SymbolicMaxBTC =
Expand Down
425 changes: 377 additions & 48 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Large diffs are not rendered by default.

80 changes: 66 additions & 14 deletions llvm/lib/Transforms/Vectorize/VPlan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,14 @@ VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {

// Hook up the new basic block to its predecessors.
for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
VPBasicBlock *PredVPBB = PredVPBlock->getExitingBasicBlock();
auto *VPRB = dyn_cast<VPRegionBlock>(PredVPBlock);

// The exiting block that leads to this block might be an early exit from
// a loop region.
VPBasicBlock *PredVPBB = VPRB && VPRB->getEarlyExit() == this
? cast<VPBasicBlock>(VPRB->getEarlyExiting())
: PredVPBlock->getExitingBasicBlock();

auto &PredVPSuccessors = PredVPBB->getHierarchicalSuccessors();
BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB];

Expand All @@ -443,6 +450,11 @@ VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
// Set each forward successor here when it is created, excluding
// backedges. A backward successor is set when the branch is created.
unsigned idx = PredVPSuccessors.front() == this ? 0 : 1;
VPRegionBlock *PredParentRegion =
dyn_cast_or_null<VPRegionBlock>(PredVPBB->getParent());
if (PredParentRegion->getEarlyExiting() == PredVPBB) {
idx = 1 - idx;
}
assert(!TermBr->getSuccessor(idx) &&
"Trying to reset an existing successor block.");
TermBr->setSuccessor(idx, NewBB);
Expand Down Expand Up @@ -499,6 +511,7 @@ void VPBasicBlock::execute(VPTransformState *State) {
!((SingleHPred = getSingleHierarchicalPredecessor()) &&
SingleHPred->getExitingBasicBlock() == PrevVPBB &&
PrevVPBB->getSingleHierarchicalSuccessor() &&
PrevVPBB != getEnclosingLoopRegion()->getEarlyExiting() &&
(SingleHPred->getParent() == getEnclosingLoopRegion() &&
!IsLoopRegion(SingleHPred))) && /* B */
!(Replica && getPredecessors().empty())) { /* C */
Expand All @@ -517,7 +530,8 @@ void VPBasicBlock::execute(VPTransformState *State) {
UnreachableInst *Terminator = State->Builder.CreateUnreachable();
// Register NewBB in its loop. In innermost loops its the same for all
// BB's.
if (State->CurrentVectorLoop)
if (State->CurrentVectorLoop &&
this != getEnclosingLoopRegion()->getEarlyExit())
State->CurrentVectorLoop->addBasicBlockToLoop(NewBB, *State->LI);
State->Builder.SetInsertPoint(Terminator);
State->CFG.PrevBB = NewBB;
Expand Down Expand Up @@ -635,7 +649,11 @@ const VPRecipeBase *VPBasicBlock::getTerminator() const {
}

bool VPBasicBlock::isExiting() const {
return getParent() && getParent()->getExitingBasicBlock() == this;
const VPRegionBlock *VPRB = getParent();
if (!VPRB)
return false;
return VPRB->getExitingBasicBlock() == this ||
VPRB->getEarlyExiting() == this;
}

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
Expand Down Expand Up @@ -876,13 +894,15 @@ static VPIRBasicBlock *createVPIRBasicBlockFor(BasicBlock *BB) {
VPlanPtr VPlan::createInitialVPlan(Type *InductionTy,
PredicatedScalarEvolution &PSE,
bool RequiresScalarEpilogueCheck,
bool TailFolded, Loop *TheLoop) {
bool TailFolded, Loop *TheLoop,
BasicBlock *EarlyExitingBB,
BasicBlock *EarlyExitBB) {
VPIRBasicBlock *Entry = createVPIRBasicBlockFor(TheLoop->getLoopPreheader());
VPBasicBlock *VecPreheader = new VPBasicBlock("vector.ph");
auto Plan = std::make_unique<VPlan>(Entry, VecPreheader);

// Create SCEV and VPValue for the trip count.
const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
const SCEV *BackedgeTakenCount = PSE.getSymbolicMaxBackedgeTakenCount();
assert(!isa<SCEVCouldNotCompute>(BackedgeTakenCount) && "Invalid loop count");
ScalarEvolution &SE = *PSE.getSE();
const SCEV *TripCount =
Expand All @@ -902,6 +922,13 @@ VPlanPtr VPlan::createInitialVPlan(Type *InductionTy,
VPBasicBlock *MiddleVPBB = new VPBasicBlock("middle.block");
VPBlockUtils::insertBlockAfter(MiddleVPBB, TopRegion);

if (EarlyExitingBB) {
VPBasicBlock *EarlyExitVPBB = new VPBasicBlock("vector.early.exit");
TopRegion->setEarlyExit(EarlyExitVPBB);
VPBlockUtils::connectBlocks(TopRegion, EarlyExitVPBB);
TopRegion->setOrigEarlyExit(EarlyExitBB);
}

VPBasicBlock *ScalarPH = new VPBasicBlock("scalar.ph");
if (!RequiresScalarEpilogueCheck) {
VPBlockUtils::connectBlocks(MiddleVPBB, ScalarPH);
Expand All @@ -916,7 +943,7 @@ VPlanPtr VPlan::createInitialVPlan(Type *InductionTy,
// 2) If we require a scalar epilogue, there is no conditional branch as
// we unconditionally branch to the scalar preheader. Do nothing.
// 3) Otherwise, construct a runtime check.
BasicBlock *IRExitBlock = TheLoop->getUniqueExitBlock();
BasicBlock *IRExitBlock = TheLoop->getUniqueLatchExitBlock();
auto *VPExitBlock = createVPIRBasicBlockFor(IRExitBlock);
// The connection order corresponds to the operands of the conditional branch.
VPBlockUtils::insertBlockAfter(VPExitBlock, MiddleVPBB);
Expand Down Expand Up @@ -992,7 +1019,8 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
/// VPBB are moved to the end of the newly created VPIRBasicBlock. VPBB must
/// have a single predecessor, which is rewired to the new VPIRBasicBlock. All
/// successors of VPBB, if any, are rewired to the new VPIRBasicBlock.
static void replaceVPBBWithIRVPBB(VPBasicBlock *VPBB, BasicBlock *IRBB) {
static VPIRBasicBlock *replaceVPBBWithIRVPBB(VPBasicBlock *VPBB,
BasicBlock *IRBB) {
VPIRBasicBlock *IRVPBB = createVPIRBasicBlockFor(IRBB);
for (auto &R : make_early_inc_range(*VPBB)) {
assert(!R.isPhi() && "Tried to move phi recipe to end of block");
Expand All @@ -1006,6 +1034,7 @@ static void replaceVPBBWithIRVPBB(VPBasicBlock *VPBB, BasicBlock *IRBB) {
VPBlockUtils::disconnectBlocks(VPBB, Succ);
}
delete VPBB;
return IRVPBB;
}

/// Generate the code inside the preheader and body of the vectorized loop.
Expand All @@ -1029,7 +1058,7 @@ void VPlan::execute(VPTransformState *State) {
// VPlan execution rather than earlier during VPlan construction.
BasicBlock *MiddleBB = State->CFG.ExitBB;
VPBasicBlock *MiddleVPBB =
cast<VPBasicBlock>(getVectorLoopRegion()->getSingleSuccessor());
cast<VPBasicBlock>(getVectorLoopRegion()->getSuccessors()[0]);
// Find the VPBB for the scalar preheader, relying on the current structure
// when creating the middle block and its successrs: if there's a single
// predecessor, it must be the scalar preheader. Otherwise, the second
Expand All @@ -1043,7 +1072,14 @@ void VPlan::execute(VPTransformState *State) {
assert(!isa<VPIRBasicBlock>(ScalarPhVPBB) &&
"scalar preheader cannot be wrapped already");
replaceVPBBWithIRVPBB(ScalarPhVPBB, ScalarPh);
replaceVPBBWithIRVPBB(MiddleVPBB, MiddleBB);
MiddleVPBB = replaceVPBBWithIRVPBB(MiddleVPBB, MiddleBB);

// Ensure the middle block is still the first successor.
for (auto *Succ : getVectorLoopRegion()->getSuccessors())
if (Succ == MiddleVPBB) {
getVectorLoopRegion()->moveSuccessorToFront(MiddleVPBB);
break;
}

// Disconnect the middle block from its single successor (the scalar loop
// header) in both the CFG and DT. The branch will be recreated during VPlan
Expand Down Expand Up @@ -1104,6 +1140,20 @@ void VPlan::execute(VPTransformState *State) {
cast<PHINode>(Phi)->addIncoming(Val, VectorLatchBB);
}

// Patch up early exiting vector block to jump to the original scalar loop's
// early exit block.
if (getVectorLoopRegion()->getEarlyExit()) {
VPBasicBlock *EarlyExitVPBB =
cast<VPBasicBlock>(getVectorLoopRegion()->getEarlyExit());
BasicBlock *VectorEarlyExitBB = State->CFG.VPBB2IRBB[EarlyExitVPBB];
BasicBlock *OrigEarlyExitBB = getVectorLoopRegion()->getOrigEarlyExit();
BranchInst *BI = BranchInst::Create(OrigEarlyExitBB);
BI->insertBefore(VectorEarlyExitBB->getTerminator());
VectorEarlyExitBB->getTerminator()->eraseFromParent();
State->CFG.DTU.applyUpdates(
{{DominatorTree::Insert, VectorEarlyExitBB, OrigEarlyExitBB}});
}

State->CFG.DTU.flush();
assert(State->CFG.DTU.getDomTree().verify(
DominatorTree::VerificationLevel::Fast) &&
Expand Down Expand Up @@ -1212,9 +1262,10 @@ LLVM_DUMP_METHOD
void VPlan::dump() const { print(dbgs()); }
#endif

void VPlan::addLiveOut(PHINode *PN, VPValue *V) {
assert(LiveOuts.count(PN) == 0 && "an exit value for PN already exists");
LiveOuts.insert({PN, new VPLiveOut(PN, V)});
void VPlan::addLiveOut(PHINode *PN, VPValue *V, VPBasicBlock *IncomingBlock) {
auto Key = std::pair<PHINode *, VPBasicBlock *>(PN, IncomingBlock);
assert(LiveOuts.count(Key) == 0 && "an exit value for PN already exists");
LiveOuts.insert({Key, new VPLiveOut(PN, V)});
}

static void remapOperands(VPBlockBase *Entry, VPBlockBase *NewEntry,
Expand Down Expand Up @@ -1285,8 +1336,9 @@ VPlan *VPlan::duplicate() {
remapOperands(Entry, NewEntry, Old2NewVPValues);

// Clone live-outs.
for (const auto &[_, LO] : LiveOuts)
NewPlan->addLiveOut(LO->getPhi(), Old2NewVPValues[LO->getOperand(0)]);
for (const auto &[Key, LO] : LiveOuts)
NewPlan->addLiveOut(LO->getPhi(), Old2NewVPValues[LO->getOperand(0)],
Key.second);

// Initialize remaining fields of cloned VPlan.
NewPlan->VFs = VFs;
Expand Down
Loading

0 comments on commit 02d078e

Please sign in to comment.