Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopVectorize] Teach LoopVectorizationLegality about more early exits #107004

Merged
merged 5 commits into from
Sep 19, 2024

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Sep 2, 2024

This patch is split off from PR #88385 and concerns only the code related to the legality of vectorising early exit loops. It is the first step in adding support for vectorisation of a simple class of loops that typically involves searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit LoopVectorizationLegality will only consider early exit loops legal for vectorising if they follow these criteria:

  1. There are no stores in the loop.
  2. The loop must have only one early exit like those shown in the above example. I have referred to such exits as speculative early exits, to distinguish from existing support for early exits where the exit-not-taken count is known exactly at compile time.
  3. The early exit block dominates the latch block.
  4. The latch block must have an exact exit count.
  5. There are no loads after the early exit block.
  6. The loop must not contain reductions or recurrences. I don't see anything fundamental blocking vectorisation of such loops, but I just haven't done the work to support them yet.
  7. We must be able to prove at compile-time that loops will not contain faulting loads.

Tests have been added here:

Transforms/LoopVectorize/AArch64/simple_early_exit.ll

@llvmbot
Copy link
Member

llvmbot commented Sep 2, 2024

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

This patch is split off from PR #88385 and concerns only the code related to the legality of vectorising early exit loops. It is the first step in adding support for vectorisation of a simple class of loops that typically involves searching for something, i.e.

for (int i = 0; i < n; i++) {
if (p[i] == val)
return i;
}
return n;

or

for (int i = 0; i < n; i++) {
if (p1[i] != p2[i])
return i;
}
return n;

In this initial commit LoopVectorizationLegality will only consider early exit loops legal for vectorising if they follow these criteria:

  1. There are no stores in the loop.
  2. The loop must have only one early exit like those shown in the above example. I have referred to such exits as speculative early exits, to distinguish from existing support for early exits where the exit-not-taken count is known exactly at compile time.
  3. The early exit block dominates the latch block.
  4. The latch block must have an exact exit count.
  5. There are no loads after the early exit block.
  6. The loop must not contain reductions or recurrences. I don't see anything fundamental blocking vectorisation of such loops, but I just haven't done the work to support them yet.
  7. We must be able to prove at compile-time that loops will not contain faulting loads.

Tests have been added here:

Transforms/LoopVectorize/AArch64/simple_early_exit.ll


Patch is 82.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/107004.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+3-1)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+49-1)
  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+5-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+146-12)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+6)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll (+1584)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-missed.ll (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/control-flow.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/remarks-multi-exit-loops.ll (+1-1)
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index 73d9c26ed6b1b7..5ea2202c4d4559 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -660,7 +660,8 @@ class LoopAccessInfo {
   bool isInvariant(Value *V) const;
 
   unsigned getNumStores() const { return NumStores; }
-  unsigned getNumLoads() const { return NumLoads;}
+  unsigned getNumLoads() const { return NumLoads; }
+  unsigned getNumCalls() const { return NumCalls; }
 
   /// The diagnostics report generated for the analysis.  E.g. why we
   /// couldn't analyze the loop.
@@ -753,6 +754,7 @@ class LoopAccessInfo {
 
   unsigned NumLoads = 0;
   unsigned NumStores = 0;
+  unsigned NumCalls = 0;
 
   /// Cache the result of analyzeLoop.
   bool CanVecMem = false;
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index 0f4d1355dd2bfe..bb21f97c0ae521 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -377,6 +377,24 @@ class LoopVectorizationLegality {
     return LAI->getDepChecker().getMaxSafeVectorWidthInBits();
   }
 
+  /// Returns true if the loop has a speculative early exit, i.e. an
+  /// uncountable exit that isn't the latch block.
+  bool hasSpeculativeEarlyExit() const { return HasSpeculativeEarlyExit; }
+
+  /// Returns the speculative early exiting block.
+  BasicBlock *getSpeculativeEarlyExitingBlock() const {
+    assert(getUncountableExitingBlocks().size() == 1 &&
+           "Expected only a single uncountable exiting block");
+    return getUncountableExitingBlocks()[0];
+  }
+
+  /// Returns the destination of a speculative early exiting block.
+  BasicBlock *getSpeculativeEarlyExitBlock() const {
+    assert(getUncountableExitBlocks().size() == 1 &&
+           "Expected only a single uncountable exit block");
+    return getUncountableExitBlocks()[0];
+  }
+
   /// Returns true if vector representation of the instruction \p I
   /// requires mask.
   bool isMaskRequired(const Instruction *I) const {
@@ -404,6 +422,22 @@ class LoopVectorizationLegality {
 
   DominatorTree *getDominatorTree() const { return DT; }
 
+  /// Returns all exiting blocks with a countable exit, i.e. the
+  /// exit-not-taken count is known exactly at compile time.
+  const SmallVector<BasicBlock *, 4> &getCountableExitingBlocks() const {
+    return CountableExitingBlocks;
+  }
+
+  /// Returns all the exiting blocks with an uncountable exit.
+  const SmallVector<BasicBlock *, 4> &getUncountableExitingBlocks() const {
+    return UncountableExitingBlocks;
+  }
+
+  /// Returns all the exit blocks from uncountable exiting blocks.
+  SmallVector<BasicBlock *, 4> getUncountableExitBlocks() const {
+    return UncountableExitBlocks;
+  }
+
 private:
   /// Return true if the pre-header, exiting and latch blocks of \p Lp and all
   /// its nested loops are considered legal for vectorization. These legal
@@ -436,7 +470,7 @@ class LoopVectorizationLegality {
   /// we read and write from memory. This method checks if it is
   /// legal to vectorize the code, considering only memory constrains.
   /// Returns true if the loop is vectorizable
-  bool canVectorizeMemory();
+  bool canVectorizeMemory(bool IsEarlyExitLoop);
 
   /// Return true if we can vectorize this loop using the IF-conversion
   /// transformation.
@@ -446,6 +480,10 @@ class LoopVectorizationLegality {
   /// specific checks for outer loop vectorization.
   bool canVectorizeOuterLoop();
 
+  /// Returns true if this is a supported early exit loop that we can
+  /// vectorize.
+  bool isVectorizableEarlyExitLoop();
+
   /// Return true if all of the instructions in the block can be speculatively
   /// executed, and record the loads/stores that require masking.
   /// \p SafePtrs is a list of addresses that are known to be legal and we know
@@ -551,6 +589,16 @@ class LoopVectorizationLegality {
   /// (potentially) make a better decision on the maximum VF and enable
   /// the use of those function variants.
   bool VecCallVariantsFound = false;
+
+  /// Indicates whether this loop has a speculative early exit, i.e. an
+  /// uncountable exiting block that is not the latch.
+  bool HasSpeculativeEarlyExit = false;
+
+  /// Keeps track of all the exits with known or countable exit-not-taken
+  /// counts.
+  SmallVector<BasicBlock *, 4> CountableExitingBlocks;
+  SmallVector<BasicBlock *, 4> UncountableExitingBlocks;
+  SmallVector<BasicBlock *, 4> UncountableExitBlocks;
 };
 
 } // namespace llvm
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 980f142f113265..bf367a77aae1ce 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -2445,8 +2445,11 @@ bool LoopAccessInfo::analyzeLoop(AAResults *AA, const LoopInfo *LI,
       // vectorize a loop if it contains known function calls that don't set
       // the flag. Therefore, it is safe to ignore this read from memory.
       auto *Call = dyn_cast<CallInst>(&I);
-      if (Call && getVectorIntrinsicIDForCall(Call, TLI))
-        continue;
+      if (Call) {
+        NumCalls++;
+        if (getVectorIntrinsicIDForCall(Call, TLI))
+          continue;
+      }
 
       // If this is a load, save it. If this instruction can read from memory
       // but is not a load, then we quit. Notice that we don't handle function
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 66a779da8c25bc..3efdfd1fa2bc6e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1048,7 +1048,7 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
   return true;
 }
 
-bool LoopVectorizationLegality::canVectorizeMemory() {
+bool LoopVectorizationLegality::canVectorizeMemory(bool IsEarlyExitLoop) {
   LAI = &LAIs.getInfo(*TheLoop);
   const OptimizationRemarkAnalysis *LAR = LAI->getReport();
   if (LAR) {
@@ -1070,6 +1070,50 @@ bool LoopVectorizationLegality::canVectorizeMemory() {
     return false;
   }
 
+  // For loops with uncountable early exiting blocks that are not the latch
+  // it's necessary to perform extra checks, since the vectoriser is currently
+  // only capable of handling simple search loops.
+  if (IsEarlyExitLoop) {
+    // We don't support calls or any memory accesses that write to memory.
+    if (LAI->getNumStores()) {
+      reportVectorizationFailure(
+          "Writes to memory unsupported in early exit loops",
+          "Cannot vectorize early exit loop with writes to memory",
+          "WritesInEarlyExitLoop", ORE, TheLoop);
+      return false;
+    }
+
+    if (LAI->getNumCalls()) {
+      reportVectorizationFailure(
+          "Calls unsupported in early exit loops",
+          "Cannot vectorize early exit loop with function calls",
+          "CallsInEarlyExitLoop", ORE, TheLoop);
+      return false;
+    }
+
+    // The vectoriser cannot handle loads that occur after the early exit block.
+    BasicBlock *LatchBB = TheLoop->getLoopLatch();
+    for (Instruction &I : *LatchBB) {
+      if (I.mayReadFromMemory()) {
+        reportVectorizationFailure(
+            "Loads not permitted after early exit",
+            "Cannot vectorize early exit loop with loads after early exit",
+            "LoadsAfterEarlyExit", ORE, TheLoop);
+        return false;
+      }
+    }
+
+    // The vectoriser does not yet handle loops that may fault, but this will
+    // be improved in a follow-on patch.
+    if (!isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC)) {
+      reportVectorizationFailure(
+          "Loop may fault",
+          "Cannot vectorize potentially faulting early exit loop",
+          "PotentiallyFaultingEarlyExitLoop", ORE, TheLoop);
+      return false;
+    }
+  }
+
   // We can vectorize stores to invariant address when final reduction value is
   // guaranteed to be stored at the end of the loop. Also, if decision to
   // vectorize loop is made, runtime checks are added so as to make sure that
@@ -1442,6 +1486,95 @@ bool LoopVectorizationLegality::canVectorizeLoopNestCFG(
   return Result;
 }
 
+bool LoopVectorizationLegality::isVectorizableEarlyExitLoop() {
+  // At least one of the exiting blocks must be the latch.
+  BasicBlock *LatchBB = TheLoop->getLoopLatch();
+  if (!LatchBB) {
+    reportVectorizationFailure("Loop does not have a latch",
+                               "Cannot vectorize early exit loop",
+                               "NoLatchEarlyExit", ORE, TheLoop);
+    return false;
+  }
+
+  SmallVector<BasicBlock *, 8> ExitingBlocks;
+  TheLoop->getExitingBlocks(ExitingBlocks);
+
+  // Keep a record of all the exiting blocks with exact exit counts, as well as
+  // those with inexact counts.
+  SmallVector<const SCEVPredicate *, 4> Predicates;
+  for (BasicBlock *BB1 : ExitingBlocks) {
+    const SCEV *EC =
+        PSE.getSE()->getPredicatedExitCount(TheLoop, BB1, &Predicates);
+    if (isa<SCEVCouldNotCompute>(EC)) {
+      UncountableExitingBlocks.push_back(BB1);
+
+      unsigned NumExitBlocks = 0;
+      for (BasicBlock *BB2 : successors(BB1)) {
+        if (!TheLoop->contains(BB2)) {
+          UncountableExitBlocks.push_back(BB2);
+          NumExitBlocks++;
+        }
+      }
+      if (NumExitBlocks > 1) {
+        reportVectorizationFailure(
+            "Early exiting block has more than one successor outside of loop",
+            "Too many successors from early exiting block",
+            "EarlyExitTooManySuccessors", ORE, TheLoop);
+        return false;
+      }
+    } else
+      CountableExitingBlocks.push_back(BB1);
+  }
+  Predicates.clear();
+
+  // We only support one uncountable early exit.
+  if (getUncountableExitingBlocks().size() != 1) {
+    reportVectorizationFailure(
+        "Loop has too many uncountable exits",
+        "Cannot vectorize early exit loop with more than one early exit",
+        "TooManyUncountableEarlyExits", ORE, TheLoop);
+    return false;
+  }
+
+  // The only supported early exit loops so far are ones where the early
+  // exiting block is a unique predecessor of the latch block.
+  BasicBlock *LatchPredBB = LatchBB->getUniquePredecessor();
+  if (!LatchPredBB || LatchPredBB != getUncountableExitingBlocks()[0]) {
+    reportVectorizationFailure("Early exit is not the latch predecessor",
+                               "Cannot vectorize early exit loop",
+                               "EarlyExitNotLatchPredecessor", ORE, TheLoop);
+    return false;
+  }
+
+  if (Reductions.size() || FixedOrderRecurrences.size()) {
+    reportVectorizationFailure(
+        "Found reductions or recurrences in early-exit loop",
+        "Cannot vectorize early exit loop with reductions or recurrences",
+        "RecurrencesInEarlyExitLoop", ORE, TheLoop);
+    return false;
+  }
+
+  LLVM_DEBUG(
+      dbgs()
+      << "LV: Found an early exit. Retrying with speculative exit count.\n");
+  if (isa<SCEVCouldNotCompute>(
+          PSE.getSE()->getPredicatedExitCount(TheLoop, LatchBB, &Predicates))) {
+    reportVectorizationFailure(
+        "Cannot determine exact exit count for latch block",
+        "Cannot vectorize early exit loop",
+        "UnknownLatchExitCountEarlyExitLoop", ORE, TheLoop);
+    return false;
+  }
+
+  const SCEV *SpecExitCount = PSE.getSymbolicMaxBackedgeTakenCount();
+  assert(!isa<SCEVCouldNotCompute>(SpecExitCount) &&
+         "Failed to get symbolic expression for backedge taken count");
+
+  LLVM_DEBUG(dbgs() << "LV: Found speculative backedge taken count: "
+                    << *SpecExitCount << '\n');
+  return true;
+}
+
 bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {
   // Store the result and return it at the end instead of exiting early, in case
   // allowExtraAnalysis is used to report multiple reasons for not vectorizing.
@@ -1500,19 +1633,20 @@ bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {
       return false;
   }
 
-  // Go over each instruction and look at memory deps.
-  if (!canVectorizeMemory()) {
-    LLVM_DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");
-    if (DoExtraAnalysis)
-      Result = false;
-    else
-      return false;
+  HasSpeculativeEarlyExit = false;
+  if (isa<SCEVCouldNotCompute>(PSE.getBackedgeTakenCount())) {
+    if (!isVectorizableEarlyExitLoop()) {
+      if (DoExtraAnalysis)
+        Result = false;
+      else
+        return false;
+    } else
+      HasSpeculativeEarlyExit = true;
   }
 
-  if (isa<SCEVCouldNotCompute>(PSE.getBackedgeTakenCount())) {
-    reportVectorizationFailure("could not determine number of loop iterations",
-                               "could not determine number of loop iterations",
-                               "CantComputeNumberOfIterations", ORE, TheLoop);
+  // Go over each instruction and look at memory deps.
+  if (!canVectorizeMemory(HasSpeculativeEarlyExit)) {
+    LLVM_DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");
     if (DoExtraAnalysis)
       Result = false;
     else
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index a8722db654f5c9..bef7aa264157ce 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -9802,6 +9802,12 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     return false;
   }
 
+  if (LVL.hasSpeculativeEarlyExit()) {
+    LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Auto-vectorization of early "
+                      << "exit loops is not yet supported.\n");
+    return false;
+  }
+
   // Entrance to the VPlan-native vectorization path. Outer loops are processed
   // here. They may require CFG and instruction level transformations before
   // even evaluating whether vectorization is profitable. Since we cannot modify
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
new file mode 100644
index 00000000000000..d677598399f1c2
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
@@ -0,0 +1,1584 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S < %s -p loop-vectorize -mattr=+sve -debug-only=loop-vectorize \
+; RUN:    -mtriple aarch64-linux-gnu 2>%t | FileCheck %s --check-prefixes=CHECK
+; RUN: cat %t | FileCheck %s --check-prefix=DEBUG
+
+declare void @init_mem(ptr, i64);
+
+define i64 @same_exit_block_pre_inc_use1() {
+; DEBUG-LABEL: LV: Checking a loop in 'same_exit_block_pre_inc_use1'
+; DEBUG:       LV: Found an early exit. Retrying with speculative exit count.
+; DEBUG-NEXT:  LV: Found speculative backedge taken count: 63
+; DEBUG-NEXT:  LV: We can vectorize this loop!
+; DEBUG-NEXT:  LV: Not vectorizing: Auto-vectorization of early exit loops is not yet supported.
+; CHECK-LABEL: define i64 @same_exit_block_pre_inc_use1(
+; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P1:%.*]] = alloca [1024 x i8], align 4
+; CHECK-NEXT:    [[P2:%.*]] = alloca [1024 x i8], align 4
+; CHECK-NEXT:    call void @init_mem(ptr [[P1]], i64 1024)
+; CHECK-NEXT:    call void @init_mem(ptr [[P2]], i64 1024)
+; CHECK-NEXT:    br label [[LAND_RHS:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[FOR_INC:%.*]] ], [ 3, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP38:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds i8, ptr [[P2]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP39:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1
+; CHECK-NEXT:    [[CMP3:%.*]] = icmp eq i8 [[TMP38]], [[TMP39]]
+; CHECK-NEXT:    br i1 [[CMP3]], label [[FOR_INC]], label [[FOR_END_LOOPEXIT:%.*]]
+; CHECK:       loop.inc:
+; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp ne i64 [[INDEX_NEXT]], 67
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[LAND_RHS]], label [[FOR_END_LOOPEXIT]]
+; CHECK:       loop.end:
+; CHECK-NEXT:    [[START_0_LCSSA:%.*]] = phi i64 [ [[INDEX]], [[LAND_RHS]] ], [ 67, [[FOR_INC]] ]
+; CHECK-NEXT:    ret i64 [[START_0_LCSSA]]
+;
+entry:
+  %p1 = alloca [1024 x i8]
+  %p2 = alloca [1024 x i8]
+  call void @init_mem(ptr %p1, i64 1024)
+  call void @init_mem(ptr %p2, i64 1024)
+  br label %loop
+
+loop:
+  %index = phi i64 [ %index.next, %loop.inc ], [ 3, %entry ]
+  %arrayidx = getelementptr inbounds i8, ptr %p1, i64 %index
+  %ld1 = load i8, ptr %arrayidx, align 1
+  %arrayidx1 = getelementptr inbounds i8, ptr %p2, i64 %index
+  %ld2 = load i8, ptr %arrayidx1, align 1
+  %cmp3 = icmp eq i8 %ld1, %ld2
+  br i1 %cmp3, label %loop.inc, label %loop.end
+
+loop.inc:
+  %index.next = add i64 %index, 1
+  %exitcond = icmp ne i64 %index.next, 67
+  br i1 %exitcond, label %loop, label %loop.end
+
+loop.end:
+  %retval = phi i64 [ %index, %loop ], [ 67, %loop.inc ]
+  ret i64 %retval
+}
+
+
+define i64 @same_exit_block_pre_inc_use1_gep_two_indices() {
+; CHECK-LABEL: define i64 @same_exit_block_pre_inc_use1_gep_two_indices(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P1:%.*]] = alloca [1024 x i8], align 4
+; CHECK-NEXT:    [[P2:%.*]] = alloca [1024 x i8], align 4
+; CHECK-NEXT:    call void @init_mem(ptr [[P1]], i64 1024)
+; CHECK-NEXT:    call void @init_mem(ptr [[P2]], i64 1024)
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[LOOP_INC:%.*]] ], [ 3, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [1024 x i8], ptr [[P1]], i64 0, i64 [[INDEX]]
+; CHECK-NEXT:    [[LD1:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds [1024 x i8], ptr [[P2]], i64 0, i64 [[INDEX]]
+; CHECK-NEXT:    [[LD2:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1
+; CHECK-NEXT:    [[CMP3:%.*]] = icmp eq i8 [[LD1]], [[LD2]]
+; CHECK-NEXT:    br i1 [[CMP3]], label [[LOOP_INC]], label [[LOOP_END:%.*]]
+; CHECK:       loop.inc:
+; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp ne i64 [[INDEX_NEXT]], 67
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[LOOP]], label [[LOOP_END]]
+; CHECK:       loop.end:
+; CHECK-NEXT:    [[RETVAL:%.*]] = phi i64 [ [[INDEX]], [[LOOP]] ], [ 67, [[LOOP_INC]] ]
+; CHECK-NEXT:    ret i64 [[RETVAL]]
+;
+entry:
+  %p1 = alloca [1024 x i8]
+  %p2 = alloca [1024 x i8]
+  call void @init_mem(ptr %p1, i64 1024)
+  call void @init_mem(ptr %p2, i64 1024)
+  br label %loop
+
+loop:
+  %index = phi i64 [ %index.next, %loop.inc ], [ 3, %entry ]
+  %arrayidx = getelementptr inbounds [1024 x i8], ptr %p1, i64 0, i64 %index
+  %ld1 = load i8, ptr %arrayidx, align 1
+  %arrayidx1 = getelementptr inbounds [1024 x i8], ptr %p2, i64 0, i64 %index
+  %ld2 = load i8, ptr %arrayidx1, align 1
+  %cmp3 = icmp eq i8 %ld1, %ld2
+  br i1 %cmp3, label %loop.inc, label %loop.end
+
+loop.inc:
+  %index.next = add i64 %index, 1
+  %exitcond = icmp ne i64 %index.next, 67
+  br i1 %exitcond, label %loop, label %loop.end
+
+loop.end:
+  %retval = phi i64 [ %index, %loop ], [ 67, %loop.inc ]
+  ret i64 %retval
+}
+
+
+define i64 @same_exit_block_pre_inc_use1_alloca_diff_type() {
+; CHECK-LABEL: define i64 @same_exit_block_pre_inc_use1_alloca_diff_type(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P1:%.*]] = alloca [40 x i32], align 4
+; CHECK-NEXT:    [[P2:%.*]] = alloca [40 x i32], align 4
+; CHECK-NEXT:    call void @init_mem(ptr [[P1]], i64 1024)
+; CHECK-NEXT:    call void @init_mem(ptr [[P2]], i64 1024)
+; CHECK-NEXT:    br label [[LAND_RHS:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[FOR_INC:%.*]] ], [ 3, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP38:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds i8, ...
[truncated]

NumExitBlocks++;
}
}
if (NumExitBlocks > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BB1 ends with conditional BranchInst, it has two jumps. Both jumps can't go outside of the loop; otherwise, BB1 wouldn't belong to TheLoop. So, the only way BB1 can have multiple Exit Blocks is when it ends with SwitchInst. So, couldn't we check if BB1 ends with a switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right at the moment that the only way this can happen is with a switch statement, but I can't 100% convince myself of that. :) That's why I wrote this in a more paranoid way by checking the number of blocks manually. What I could do is bail out early if the number of successors > 2, then add the single successor outside of the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch support has only be added recently, so would probably be good to add a test if possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I already have with @early_exit_has_multiple_outside_successors, unless you were thinking of a different test?

Comment on lines 483 to 484
/// Returns true if this is a supported early exit loop that we can
/// vectorize.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "supported" mean in this context? Would it suffice to say "Returns true if this is an early exit loop that can be vectorized".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// uncountable exiting block that is not the latch.
bool HasSpeculativeEarlyExit = false;

/// Keeps track of all the exits with known or countable exit-not-taken
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "known or uncountable"? Although if true then doesn't that cover the entire space and thus "Keeps track of all loop exits." would suffice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the comment - hopefully makes more sense!

@@ -1070,6 +1070,50 @@ bool LoopVectorizationLegality::canVectorizeMemory() {
return false;
}

// For loops with uncountable early exiting blocks that are not the latch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that are not the latch" can be dropped because by definition the latch block cannot be early exiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true, because you can have loops like this:

while.cond:                                       ; preds = %while.body, %entry
  %i.0 = phi i32 [ %start, %entry ], [ %inc, %while.body ]
  %inc = add i32 %i.0, 1
  %cmp.not = icmp eq i32 %inc, %end
  br i1 %cmp.not, label %cleanup, label %while.body

while.body:                                       ; preds = %while.cond
  %idxprom = zext i32 %inc to i64
  %arrayidx = getelementptr inbounds i32, ptr %p, i64 %idxprom
  %0 = load i32, ptr %arrayidx, align 4, !tbaa !6
  %cmp1 = icmp eq i32 %0, 7
  br i1 %cmp1, label %cleanup, label %while.cond, !llvm.loop !10

cleanup:                                          ; preds = %while.cond, %while.body
  %retval.0 = phi i32 [ 1, %while.body ], [ 0, %while.cond ]
  ret i32 %retval.0

We do see code like this being generated from C:

unsigned foo(int *p, unsigned start, unsigned end) {
  unsigned i = start;
  while (++i != end)
    if (p[i] == 7)
      return 1;
  return 0;
}

In this case the latch block either jumps to the beginning of the loop or takes an early exit.

if (Call && getVectorIntrinsicIDForCall(Call, TLI))
continue;
if (Call) {
NumCalls++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the use, is this too broad? For example, would it prevent loops that call sqrt intrinsics from being vectorisable? What about keeping a count of non-store instructions that can modify memory? This can be added to the code below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was probably being overly cautious here, but I'm worried about more than just stores. The function call could have side-effects, such as changing system state or generating exceptions. I might be able to relax this to permit calls to intrinsics that have no side-effects, and do not write to memory. I'll see how easy this is to do - it would require adding new tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now makes me wonder about operations such as integer divides that may generate exceptions and whether or not such operations should be considered legal in an early exit block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can allow any instruction that may trigger UB or a trap in the header if it would be executed unconditionally in the vectorized loop, but this should probably be checked outside of LAA, as it is irrelevant for memory dependence analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It might make more sense to analyse all instructions in the early exiting block by checking for I.mayHaveSideEffects(), I.mayThrow(), I.mayReadOrWriteMemory(), etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this code in favour of calling isSafeToSpeculativelyExecute from LoopVectorizationLegality. This covers more cases such as divides, while at the same time permitting calls to intrinsics such as sqrt.

Comment on lines 1106 to 1107
// The vectoriser does not yet handle loops that may fault, but this will
// be improved in a follow-on patch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "TODO: Handle loops that may fault."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but my suggestion was to replace the original comment with the TODO.

Comment on lines 1095 to 1096
BasicBlock *LatchBB = TheLoop->getLoopLatch();
for (Instruction &I : *LatchBB) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a naive question but is the latch block the only one that needs to be check? Are there no other blocks between the existing an latch blocks?

I answered my own question but leaving comment here because it is referenced below.

Comment on lines 1502 to 1503
// Keep a record of all the exiting blocks with exact exit counts, as well as
// those with inexact counts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, isn't this just all exiting blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const SCEV *EC =
PSE.getSE()->getPredicatedExitCount(TheLoop, BB1, &Predicates);
if (isa<SCEVCouldNotCompute>(EC)) {
UncountableExitingBlocks.push_back(BB1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if BB1 has no out-of-loop successors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's an exiting block at least one successor must be outside the loop. I've added an assert for this.

Comment on lines +1539 to +1508
// The only supported early exit loops so far are ones where the early
// exiting block is a unique predecessor of the latch block.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This answers my earlier question but that means the code in canVectorizeMemory is fragile. Can an assert be added to canVectorizeMemory to ensure this is the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1549 to 1555
if (Reductions.size() || FixedOrderRecurrences.size()) {
reportVectorizationFailure(
"Found reductions or recurrences in early-exit loop",
"Cannot vectorize early exit loop with reductions or recurrences",
"RecurrencesInEarlyExitLoop", ORE, TheLoop);
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a quick test that's worth having early before having to iterate across basic blocks etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// The vectoriser cannot handle loads that occur after the early exit block.
BasicBlock *LatchBB = TheLoop->getLoopLatch();
assert(LatchBB->getUniquePredecessor() ==
getUncountableExitingBlocks()[0] &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSpeculativeEarlyExitingBlock()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1106 to 1107
// The vectoriser does not yet handle loops that may fault, but this will
// be improved in a follow-on patch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but my suggestion was to replace the original comment with the TODO.

// The only supported early exit loops so far are ones where the early
// exiting block is a unique predecessor of the latch block.
BasicBlock *LatchPredBB = LatchBB->getUniquePredecessor();
if (!LatchPredBB || LatchPredBB != getUncountableExitingBlocks()[0]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use getUncountableExitingBlocks()? Also, perhaps !LatchPredBB is redundant given it'll fail the != test anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 9806 to 9807
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Auto-vectorization of early "
<< "exit loops is not yet supported.\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a reportVectorizationFailure required here until vectorisation is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SmallVector<const SCEVPredicate *, 4> Predicates;
for (BasicBlock *BB1 : ExitingBlocks) {
const SCEV *EC =
PSE.getSE()->getPredicatedExitCount(TheLoop, BB1, &Predicates);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Predicates is not used, can you pass nullptr here? From what I can see all the functions getPredicatedExitCount represent support not caring about the predicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true because if the loop really does require predicates then getPredicatedExitCount will return SCEVCouldNotCompute for the exit count and we will incorrectly treat this as an uncountable exiting block. If I pass in nullptr here the simple_early_exit.ll test file starts failing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I didn't see this parameter controlling any code paths but I do see an assert that would fail without it.

What worried me was that you're clearing Predicates after iterating across all blocks and that made me assume it's contents didn't matter, but then I wondered why you need to manually clear it at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here or below why predicates is passed and dropped below?

// For loops with uncountable early exiting blocks that are not the latch
// it's necessary to perform extra checks, since the vectoriser is currently
// only capable of handling simple search loops.
if (IsEarlyExitLoop) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what others think but I see canVectorizeMemory as being about load/store address analysis rather than "the only place to analyse load/store instructions". In this case you don't care about the addresses, you just want to guard against introducing side effects which I think can all be contained in isVectorizableEarlyExitLoop?

It kind of feels odd for canVectorizeMemory to depend on work done by isVectorizableEarlyExitLoop, when the "reason to reject an instruction" is largely the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would make sense to keep the checks together in isVectorizableEarlyExitLoop in particular with the extra checks recently added there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


for (auto *BB : TheLoop->blocks())
for (auto &I : *BB)
if (!IsSafeOperation(&I)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do early exiting loops introduce any special requirement here? I initially thought you'd restrict based on safe-to-speculate for only the blocks between the exit block and the latch block. Which, given the current restriction, means just the latch block along the same lines as why you disallow loads in the latch block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to restrict for all blocks up to and including the early exit too, because any operation after the vector lane that triggers the exit could be unsafe. For example, see test @loop_contains_unsafe_div where the early exit block looks like this:

loop:
  %index = phi i64 [ %index.next, %loop.inc ], [ 3, %entry ]
  %arrayidx = getelementptr inbounds i8, ptr %p1, i64 %index
  %ld1 = load i32, ptr %arrayidx, align 1
  %div = udiv i32 20000, %ld1
  %cmp = icmp eq i32 %div, 1
  br i1 %cmp, label %loop.inc, label %loop.end

It could be lane 1 of the comparison that triggers the early exit, but lane 3 could trap with divide-by-zero. If this was a normal loop where the divide was conditionally executed, i.e. something like

  for (int i = 0; i < n; i++) {
    if (mask[i]) {
      dst[i] = 20000 / src[i];
    }
  }

then the vectoriser can at least create a vector expression for the mask and select between src and a vector of ones to avoid unnecessarily creating an exception. However, in the test above the vectoriser has to perform the divide first in order to know what the mask should be.

I suppose in future improvements if we can prove that the udiv can be safely moved after the vector comparison (i.e. because the comparison does not depend upon the udiv) then we might be able to create a mask and select between src and 1.

For now, I thought it best to avoid over-complicating the initial early exit vectorisation work by bailing out on edge cases like this.

Comment on lines +596 to +616
/// Keep track of all the loop exiting blocks.
SmallVector<BasicBlock *, 4> CountableExitingBlocks;
SmallVector<BasicBlock *, 4> UncountableExitingBlocks;

/// Keep track of the destinations of all uncountable exits.
SmallVector<BasicBlock *, 4> UncountableExitBlocks;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the three only UncountableExitingBlocks looks to be used by this patch. I'd just like confirmation this it intentional and will be built upon in a following PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these will be used in follow-on patches. However, if you prefer to keep this initial patch clean I can remove them and add them later. I was just trying to reduce the size of follow-on patches by moving it into this patch that's all.


/// Returns the destination of a speculative early exiting block.
BasicBlock *getSpeculativeEarlyExitBlock() const {
assert(getUncountableExitBlocks().size() == 1 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both getSpeculativeEarlyExitBloc and getUncountableExitingBlocks assert that there's exactly one entry. Would it be simpler to store the pointers directly, rather than in a 1 element vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that, and would be happy to do so if people prefer. I wrote it this way because in future I thought it's possible we may support multiple early exits. @paulwalker-arm @huntergr-arm any preference?

@@ -436,7 +470,7 @@ class LoopVectorizationLegality {
/// we read and write from memory. This method checks if it is
/// legal to vectorize the code, considering only memory constrains.
/// Returns true if the loop is vectorizable
bool canVectorizeMemory();
bool canVectorizeMemory(bool IsEarlyExitLoop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// For loops with uncountable early exiting blocks that are not the latch
// it's necessary to perform extra checks, since the vectoriser is currently
// only capable of handling simple search loops.
if (IsEarlyExitLoop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would make sense to keep the checks together in isVectorizableEarlyExitLoop in particular with the extra checks recently added there

@@ -446,6 +480,9 @@ class LoopVectorizationLegality {
/// specific checks for outer loop vectorization.
bool canVectorizeOuterLoop();

/// Returns true if this is an early exit loop that can be vectorized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be helpful to summarize the required conditions here (or somewhere else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Check all instructions in the loop to see if they could potentially
// generate exceptions or have side-effects.
auto IsSafeOperation = [](Instruction *I) -> bool {
// Is this a divide?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't seem to match the code below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this reads better now!

getUncountableExitingBlocks()[0] &&
"Expected latch predecessor to be the early exiting block");

for (Instruction &I : *LatchBB) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any restrictions that there is only the latch after the early exit? If not, all blocks between early exit and latch would need to be check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you look in isVectorizableEarlyExitLoop you'll see this code:

  // The only supported early exit loops so far are ones where the early
  // exiting block is a unique predecessor of the latch block.
  BasicBlock *LatchPredBB = LatchBB->getUniquePredecessor();
  if (LatchPredBB != getUncountableExitingBlocks()[0]) {
    reportVectorizationFailure("Early exit is not the latch predecessor",
                               "Cannot vectorize early exit loop",
                               "EarlyExitNotLatchPredecessor", ORE, TheLoop);
    return false;
  }

which is why a few lines above I am able to add this assert:

    assert(LatchBB->getUniquePredecessor() ==
               getUncountableExitingBlocks()[0] &&
           "Expected latch predecessor to be the early exiting block");

Although if I move the new code from canVectorizeMemory to isVectorizableEarlyExitLoop I can get rid of the assert.

I added this restriction as a simple optimisation to avoid messier CFGs, such as conditionally executed blocks after the early exit:

loop:
...
  br i1 %cmp, label %early.exit, label %loop.bb1

loop.bb1:
...
  br i1 %cmp2, label %do.stuff, label %loop.latch

do.stuff:
...
  br label %loop.latch

loop.latch:
...
  br i1 %cmp3, label %normal.exit, label %loop

In practice we can't yet permit any useful work after the early exit anyway until we can teach the vectoriser that subsequent blocks are effectively if-converted and require a mask, and the operations would need to occur before the early exit.

@@ -1442,6 +1487,126 @@ bool LoopVectorizationLegality::canVectorizeLoopNestCFG(
return Result;
}

bool LoopVectorizationLegality::isVectorizableEarlyExitLoop() {
// At least one of the exiting blocks must be the latch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below just checks if there's a latch, not if it is exiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I should move the comment lower down to the code where it's checked:

  if (isa<SCEVCouldNotCompute>(
          PSE.getSE()->getPredicatedExitCount(TheLoop, LatchBB, &Predicates))) {
    reportVectorizationFailure(
        "Cannot determine exact exit count for latch block",
        "Cannot vectorize early exit loop",
        "UnknownLatchExitCountEarlyExitLoop", ORE, TheLoop);
    return false;
  }

There is a test for this - see @early_exit_infinite_loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final comments but otherwise this looks good to me.

SmallVector<const SCEVPredicate *, 4> Predicates;
for (BasicBlock *BB1 : ExitingBlocks) {
const SCEV *EC =
PSE.getSE()->getPredicatedExitCount(TheLoop, BB1, &Predicates);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I didn't see this parameter controlling any code paths but I do see an assert that would fail without it.

What worried me was that you're clearing Predicates after iterating across all blocks and that made me assume it's contents didn't matter, but then I wondered why you need to manually clear it at all?

This patch is split off from PR llvm#88385 and concerns only the code
related to the legality of vectorising early exit loops. It is the
first step in adding support for vectorisation of a simple class of
loops that typically involves searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit LoopVectorizationLegality will only consider
early exit loops legal for vectorising if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early exit like those shown in the above
example. I have referred to such exits as speculative early exits, to
distinguish from existing support for early exits where the
exit-not-taken count is known exactly at compile time.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
5. There are no loads after the early exit block.
6. The loop must not contain reductions or recurrences. I don't see
anything fundamental blocking vectorisation of such loops, but I just
haven't done the work to support them yet.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/simple_early_exit.ll
* Reverted LoopAccessAnalysis changes in favour of new code in
LoopVectorizationLegality.cpp that checks if instructions are
safe to speculatively execute.
* Updated some comments and done minor code refactoring.
* Added new tests for demonstrating under which circumstances
we can speculatively execute calls and divides.
* Added new test where the early exit is in a conditional block.
* Added new test containing store instructions.
* Added new test containing loads after early exit.
* Combined changes from canVectorizeMemory into
isVectorizableEarlyExitLoop.
* Documented restrictions for isVectorizableEarlyExitLoop in
header file.
* Added reportVectorizationFailure call in LoopVectorize when
we bail out.
* General code refactor, clean-up.
@david-arm
Copy link
Contributor Author

Rebased to fix the conflict (that wasn't a conflict) warning

* Remove code to check loads after early exit, since at the
moment this is redundant. We only vectorise if we can prove
that none of the loads will fault, so there is no harm in
speculatively executing. Any outside users of a load in the
latch block must have come from a normal exit and so we
will extract the last lane of the vectorised in the same way
as a normal loop.
* Re-wrote the test loop_contains_load_after_early_exit to
show it is legal to vectorise when the load in the latch
block is dereferenceable. The test also contains an outside
use of that load. At least once a patch lands to start
vectorising this loop it will be obvious whether the code
is correct or not.
* Updated a comment.
@@ -0,0 +1,1969 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
; RUN: opt -S < %s -p loop-vectorize -mattr=+sve -debug-only=loop-vectorize \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing target specific in this test at the moment, can it be moved out so we have test coverage for the legality checks independent of the target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, seems reasonable. We will need a few SVE specific tests once we start vectorising these loops just to ensure the cost model decides it's worth vectorising, but that's fine!

* Move simple_early_exit.ll test file to top level directory and
changed command line to be target-independent. As a result of this
I had to update the tests due to some alignment differences.
@david-arm david-arm merged commit e762d4d into llvm:main Sep 19, 2024
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Sep 19, 2024

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/5918

Here is the relevant piece of the build log for the reference
Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[38/40] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/memmove-hip-6.0.2.dir/memmove.hip.o -o External/HIP/memmove-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/memmove.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/memmove.reference_output-hip-6.0.2
[39/40] /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /buildbot/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
[40/40] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -o External/HIP/TheNextWeek-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/TheNextWeek.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/TheNextWeek.reference_output-hip-6.0.2
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
@@@BUILD_STEP Testing HIP test-suite@@@
+ ninja -v check-hip-simple
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 7 tests, 7 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 7)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

Input 1:
Memory access fault by GPU node-1 (Agent handle: 0x561b1190eac0) on address (nil). Reason: Page not present or supervisor privilege.
exit 134

Input 2:
image width = 1200 height = 675
block size = (16, 16) grid size = (75, 43)
Start rendering by GPU.
Done.
gpu.ppm and ref.ppm are the same.
exit 0

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 336.24s

Total Discovered Tests: 7
  Passed: 6 (85.71%)
  Failed: 1 (14.29%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
+ ninja -v check-hip-simple
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 7 tests, 7 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 7)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

Input 1:
Memory access fault by GPU node-1 (Agent handle: 0x561b1190eac0) on address (nil). Reason: Page not present or supervisor privilege.
exit 134

Input 2:
image width = 1200 height = 675
block size = (16, 16) grid size = (75, 43)
Start rendering by GPU.
Done.
gpu.ppm and ref.ppm are the same.
exit 0

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 336.24s

Total Discovered Tests: 7
  Passed: 6 (85.71%)
  Failed: 1 (14.29%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
program finished with exit code 1
elapsedTime=454.757562

@@ -0,0 +1,1941 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need REQUIRES:asserts I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right! Good spot. I'll fix asap.


/// Indicates whether this loop has a speculative early exit, i.e. an
/// uncountable exiting block that is not the latch.
bool HasSpeculativeEarlyExit = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, with SCEV terminology and the variables below, Speculative -> Uncountable?

/// uncountable exiting block that is not the latch.
bool HasSpeculativeEarlyExit = false;

/// Keep track of all the loop exiting blocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate? AFAICT they only keep track of a single countable and uncountable exiting block, if the backedge taken count is not computable.

SmallVector<const SCEVPredicate *, 4> Predicates;
for (BasicBlock *BB1 : ExitingBlocks) {
const SCEV *EC =
PSE.getSE()->getPredicatedExitCount(TheLoop, BB1, &Predicates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here or below why predicates is passed and dropped below?

}

// The latch block must have a countable exit.
if (isa<SCEVCouldNotCompute>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move simpler checks earlier before iterating over the whole loop? Can also check if LatchBB is in the collected countable exiting blocks?


LLVM_DEBUG(
dbgs()
<< "LV: Found an early exit. Retrying with speculative exit count.\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to be consistent with SCEV terminology, i.e. symbolic max BTC. Or more specifically the computable exiting count for the loop latch, as with the current restrictions getSymbolicMaxBackedgeTakenCount() == >getPredicatedExitCount(TheLoop, Latch)?

return false;
}

BasicBlock *BB2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearer to name ExitBlock?

@david-arm
Copy link
Contributor Author

Hi @fhahn, I'll take a look at addressing your comments in a follow-up patch. I had used the term speculative for things like HasSpeculativeEarlyExit to try and indicate that even the loop has to have more than just an uncountable early exit, because the exit count from the latch needs to be countable too in order for this flag to be true. But perhaps that's unnecessary and a bit pedantic and I think I'm happy to rename for consistency.

david-arm added a commit to david-arm/llvm-project that referenced this pull request Sep 19, 2024
* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates
when building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
tmsri pushed a commit to tmsri/llvm-project that referenced this pull request Sep 19, 2024
llvm#107004)

This patch is split off from PR llvm#88385 and concerns only the code
related to the legality of vectorising early exit loops. It is the first
step in adding support for vectorisation of a simple class of loops that
typically involves searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit LoopVectorizationLegality will only consider
early exit loops legal for vectorising if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early exit like those shown in the above
example. I have referred to such exits as speculative early exits, to
distinguish from existing support for early exits where the
exit-not-taken count is known exactly at compile time.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
5. There are no loads after the early exit block.
6. The loop must not contain reductions or recurrences. I don't see
anything fundamental blocking vectorisation of such loops, but I just
haven't done the work to support them yet.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/simple_early_exit.ll
david-arm added a commit to david-arm/llvm-project that referenced this pull request Sep 20, 2024
This patch follows on from PR llvm#107004 by adding support for
vectorisation of a simple class of loops that typically involves
searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal
if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early uncountable exit like those shown
in the above example.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
6. The loop must not contain reductions or recurrences.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll
david-arm added a commit to david-arm/llvm-project that referenced this pull request Sep 23, 2024
* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates
when building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
david-arm added a commit that referenced this pull request Sep 23, 2024
)

* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates when
building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Sep 23, 2024
This patch follows on from PR llvm#107004 by adding support for
vectorisation of a simple class of loops that typically involves
searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal
if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early uncountable exit like those shown
in the above example.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
6. The loop must not contain reductions or recurrences.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
…lvm#109300)

* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates when
building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Sep 27, 2024
This patch follows on from PR llvm#107004 by adding support for
vectorisation of a simple class of loops that typically involves
searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal
if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early uncountable exit like those shown
in the above example.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
6. The loop must not contain reductions or recurrences.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll
@david-arm david-arm deleted the ee_legal branch October 3, 2024 08:53
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
…lvm#109300)

* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates when
building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Oct 8, 2024
This patch follows on from PR llvm#107004 by adding support for
vectorisation of a simple class of loops that typically involves
searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal
if they follow these criteria:

1. There are no stores in the loop.
2. The loop must have only one early uncountable exit like those shown
in the above example.
3. The early exit block dominates the latch block.
4. The latch block must have an exact exit count.
6. The loop must not contain reductions or recurrences.
7. We must be able to prove at compile-time that loops will not contain
faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants