Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ValueTracking] Add dominating condition support in computeKnownBits() #73662

Merged
merged 2 commits into from
Dec 6, 2023

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Nov 28, 2023

This adds support for using dominating conditions in computeKnownBits() when called from InstCombine. The implementation uses a DomConditionCache, which stores which branches may provide information that is relevant for a given value.

DomConditionCache is similar to AssumptionCache, but does not try to do any kind of automatic tracking. Relevant branches have to be explicitly registered and invalidated values explicitly removed. The necessary tracking is done inside InstCombine.

The reason why this doesn't just do exactly the same thing as AssumptionCache is that a lot more transforms touch branches and branch conditions than assumptions. AssumptionCache is an immutable analysis and mostly gets away with this because only a handful of places have to register additional assumptions (mostly as a result of cloning). This is very much not the case for branches.

This change has some non-trivial impact on compile-time. The first-order impact is about 0.2%. This is the impact if we compute the KnownBits, but don't use them. The second-order impact is about 0.1% -- however, on stage2-O0-g it's actually a -0.2% improvement, which indicates that this change results in additional optimizations inside clang itself.

Fixes #74242.

@nikic nikic requested review from dtcxzyw and goldsteinn November 28, 2023 15:45
@llvmbot llvmbot added clang Clang issues not falling into any other category PGO Profile Guided Optimizations llvm:analysis llvm:transforms labels Nov 28, 2023
@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2023

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-pgo
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-clang

Author: Nikita Popov (nikic)

Changes

This adds support for using dominating conditions in computeKnownBits() when called from InstCombine. The implementation uses a DomConditionCache, which stores which branches may provide information that is relevant for a given value.

DomConditionCache is similar to AssumptionCache, but does not try to do any kind of automatic tracking. Relevant branches have to be explicitly registered and invalidated values explicitly removed. The necessary tracking is done inside InstCombine.

The reason why this doesn't just do exactly the same thing as AssumptionCache is that a lot more transforms touch branches and branch conditions than assumptions. AssumptionCache is an immutable analysis and mostly gets away with this because only a handful of places have to register additional assumptions (mostly as a result of cloning). This is very much not the case for branches.

This change has some non-trivial impact on compile-time. The first-order impact is about 0.2%. This is the impact if we compute the KnownBits, but don't use them. The second-order impact is about 0.1% -- however, on stage2-O0-g it's actually a -0.2% improvement, which indicates that this change results in additional optimizations inside clang itself.


Patch is 89.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73662.diff

41 Files Affected:

  • (modified) clang/test/CodeGenCXX/RelativeVTablesABI/member-function-pointer.cpp (+1-1)
  • (added) llvm/include/llvm/Analysis/DomConditionCache.h (+56)
  • (modified) llvm/include/llvm/Analysis/SimplifyQuery.h (+4-2)
  • (modified) llvm/include/llvm/Transforms/InstCombine/InstCombiner.h (+5-1)
  • (modified) llvm/lib/Analysis/CMakeLists.txt (+1)
  • (added) llvm/lib/Analysis/DomConditionCache.cpp (+74)
  • (modified) llvm/lib/Analysis/ValueTracking.cpp (+28-3)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineInternal.h (+1)
  • (modified) llvm/lib/Transforms/InstCombine/InstructionCombining.cpp (+1)
  • (modified) llvm/test/CodeGen/BPF/loop-exit-cond.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/2007-10-31-RangeCrash.ll (+5-2)
  • (modified) llvm/test/Transforms/InstCombine/2009-02-20-InstCombine-SROA.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/cast_phi.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/icmp-binop.ll (+1-2)
  • (modified) llvm/test/Transforms/InstCombine/icmp-dom.ll (+3-2)
  • (modified) llvm/test/Transforms/InstCombine/icmp-mul-zext.ll (+1-2)
  • (modified) llvm/test/Transforms/InstCombine/icmp-ne-pow2.ll (+4-8)
  • (modified) llvm/test/Transforms/InstCombine/icmp-of-or-x.ll (+2-6)
  • (modified) llvm/test/Transforms/InstCombine/known-non-zero.ll (+4-6)
  • (modified) llvm/test/Transforms/InstCombine/minmax-of-xor-x.ll (+1-4)
  • (modified) llvm/test/Transforms/InstCombine/shift.ll (+3-5)
  • (modified) llvm/test/Transforms/InstCombine/sink_instruction.ll (+5-1)
  • (modified) llvm/test/Transforms/InstCombine/sub-of-negatible-inseltpoison.ll (+3-4)
  • (modified) llvm/test/Transforms/InstCombine/sub-of-negatible.ll (+3-4)
  • (modified) llvm/test/Transforms/LoopUnroll/peel-loop-inner.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopUnroll/peel-loop.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll (+20-20)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll (+20-20)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-store-accesses-with-gaps.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/float-induction.ll (+20-20)
  • (modified) llvm/test/Transforms/LoopVectorize/if-conversion-nest.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction.ll (+11-11)
  • (modified) llvm/test/Transforms/LoopVectorize/runtime-check.ll (+1-1)
  • (modified) llvm/test/Transforms/PGOProfile/chr.ll (+3-3)
  • (modified) llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll (+1-1)
  • (modified) llvm/test/Transforms/SimpleLoopUnswitch/2007-08-01-LCSSA.ll (+5-2)
  • (modified) llvm/test/Transforms/SimplifyCFG/merge-cond-stores.ll (+2-3)
diff --git a/clang/test/CodeGenCXX/RelativeVTablesABI/member-function-pointer.cpp b/clang/test/CodeGenCXX/RelativeVTablesABI/member-function-pointer.cpp
index 24f884a8d23befb..000568b3b6bf09c 100644
--- a/clang/test/CodeGenCXX/RelativeVTablesABI/member-function-pointer.cpp
+++ b/clang/test/CodeGenCXX/RelativeVTablesABI/member-function-pointer.cpp
@@ -14,7 +14,7 @@
 
 // The loading of the virtual function here should be replaced with a llvm.load.relative() call.
 // CHECK-NEXT:   [[vtable:%.+]] = load ptr, ptr [[this_adj]], align 8
-// CHECK-NEXT:   [[offset:%.+]] = add i64 [[fn_ptr]], -1
+// CHECK-NEXT:   [[offset:%.+]] = add nsw i64 [[fn_ptr]], -1
 // CHECK-NEXT:   [[ptr:%.+]] = tail call ptr @llvm.load.relative.i64(ptr [[vtable]], i64 [[offset]])
 // CHECK-NEXT:   br label %[[memptr_end:.+]]
 // CHECK:      [[nonvirt]]:
diff --git a/llvm/include/llvm/Analysis/DomConditionCache.h b/llvm/include/llvm/Analysis/DomConditionCache.h
new file mode 100644
index 000000000000000..68e34f773af8ae8
--- /dev/null
+++ b/llvm/include/llvm/Analysis/DomConditionCache.h
@@ -0,0 +1,56 @@
+//===- llvm/Analysis/DomConditionCache.h ------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Cache for branch conditions that affect a certain value for use by
+// ValueTracking. Unlike AssumptionCache, this class does not perform any
+// automatic analysis or invalidation. The caller is responsible for registering
+// all relevant branches (and re-registering them if they change), and for
+// removing invalidated values from the cache.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ANALYSIS_DOMCONDITIONCACHE_H
+#define LLVM_ANALYSIS_DOMCONDITIONCACHE_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfo.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/IR/ValueHandle.h"
+
+namespace llvm {
+
+class Value;
+class BranchInst;
+
+class DomConditionCache {
+private:
+  /// A map of values about which a branch might be providing information.
+  using AffectedValuesMap = DenseMap<Value *, SmallVector<BranchInst *, 1>>;
+  AffectedValuesMap AffectedValues;
+
+public:
+  /// Add a branch condition to the cache.
+  void registerBranch(BranchInst *BI);
+
+  /// Remove a value from the cache, e.g. because it will be erased.
+  void removeValue(Value *V) { AffectedValues.erase(V); }
+
+  /// Access the list of branches which affect this value.
+  ArrayRef<BranchInst *> conditionsFor(const Value *V) const {
+    auto AVI = AffectedValues.find_as(const_cast<Value *>(V));
+    if (AVI == AffectedValues.end())
+      return ArrayRef<BranchInst *>();
+
+    return AVI->second;
+  }
+};
+
+} // end namespace llvm
+
+#endif // LLVM_ANALYSIS_DOMCONDITIONCACHE_H
diff --git a/llvm/include/llvm/Analysis/SimplifyQuery.h b/llvm/include/llvm/Analysis/SimplifyQuery.h
index f9cc3029221d679..e5e6ae0d3d8e3e8 100644
--- a/llvm/include/llvm/Analysis/SimplifyQuery.h
+++ b/llvm/include/llvm/Analysis/SimplifyQuery.h
@@ -14,6 +14,7 @@
 namespace llvm {
 
 class AssumptionCache;
+class DomConditionCache;
 class DominatorTree;
 class TargetLibraryInfo;
 
@@ -62,6 +63,7 @@ struct SimplifyQuery {
   const DominatorTree *DT = nullptr;
   AssumptionCache *AC = nullptr;
   const Instruction *CxtI = nullptr;
+  const DomConditionCache *DC = nullptr;
 
   // Wrapper to query additional information for instructions like metadata or
   // keywords like nsw, which provides conservative results if those cannot
@@ -80,8 +82,8 @@ struct SimplifyQuery {
                 const DominatorTree *DT = nullptr,
                 AssumptionCache *AC = nullptr,
                 const Instruction *CXTI = nullptr, bool UseInstrInfo = true,
-                bool CanUseUndef = true)
-      : DL(DL), TLI(TLI), DT(DT), AC(AC), CxtI(CXTI), IIQ(UseInstrInfo),
+                bool CanUseUndef = true, const DomConditionCache *DC = nullptr)
+      : DL(DL), TLI(TLI), DT(DT), AC(AC), CxtI(CXTI), DC(DC), IIQ(UseInstrInfo),
         CanUseUndef(CanUseUndef) {}
 
   SimplifyQuery(const DataLayout &DL, const DominatorTree *DT,
diff --git a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
index 160fc2ebe493521..7ba4b49932c1a89 100644
--- a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
+++ b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
@@ -18,6 +18,7 @@
 #ifndef LLVM_TRANSFORMS_INSTCOMBINE_INSTCOMBINER_H
 #define LLVM_TRANSFORMS_INSTCOMBINE_INSTCOMBINER_H
 
+#include "llvm/Analysis/DomConditionCache.h"
 #include "llvm/Analysis/InstructionSimplify.h"
 #include "llvm/Analysis/TargetFolder.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -76,6 +77,7 @@ class LLVM_LIBRARY_VISIBILITY InstCombiner {
   OptimizationRemarkEmitter &ORE;
   BlockFrequencyInfo *BFI;
   ProfileSummaryInfo *PSI;
+  DomConditionCache DC;
 
   // Optional analyses. When non-null, these can both be used to do better
   // combining and will be updated to reflect any changes.
@@ -98,7 +100,9 @@ class LLVM_LIBRARY_VISIBILITY InstCombiner {
                const DataLayout &DL, LoopInfo *LI)
       : TTI(TTI), Builder(Builder), Worklist(Worklist),
         MinimizeSize(MinimizeSize), AA(AA), AC(AC), TLI(TLI), DT(DT), DL(DL),
-        SQ(DL, &TLI, &DT, &AC), ORE(ORE), BFI(BFI), PSI(PSI), LI(LI) {}
+        SQ(DL, &TLI, &DT, &AC, nullptr, /*UseInstrInfo*/ true,
+           /*CanUseUndef*/ true, &DC),
+        ORE(ORE), BFI(BFI), PSI(PSI), LI(LI) {}
 
   virtual ~InstCombiner() = default;
 
diff --git a/llvm/lib/Analysis/CMakeLists.txt b/llvm/lib/Analysis/CMakeLists.txt
index 9d8c9cfda66c921..34ff6bb74c106f4 100644
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@@ -55,6 +55,7 @@ add_llvm_component_library(LLVMAnalysis
   DependenceAnalysis.cpp
   DependenceGraphBuilder.cpp
   DevelopmentModeInlineAdvisor.cpp
+  DomConditionCache.cpp
   DomPrinter.cpp
   DomTreeUpdater.cpp
   DominanceFrontier.cpp
diff --git a/llvm/lib/Analysis/DomConditionCache.cpp b/llvm/lib/Analysis/DomConditionCache.cpp
new file mode 100644
index 000000000000000..c47dcf6767a8ede
--- /dev/null
+++ b/llvm/lib/Analysis/DomConditionCache.cpp
@@ -0,0 +1,74 @@
+//===- DomConditionCache.cpp ----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/DomConditionCache.h"
+#include "llvm/IR/PatternMatch.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+// TODO: This code is very similar to findAffectedValues() in
+// AssumptionCache, but currently specialized to just the patterns that
+// computeKnownBits() supports, and without the notion of result elem indices
+// that are AC specific. Deduplicate this code once we have a clearer picture
+// of how much they can be shared.
+static void findAffectedValues(Value *Cond,
+                               SmallVectorImpl<Value *> &Affected) {
+  auto AddAffected = [&Affected](Value *V) {
+    if (isa<Argument>(V) || isa<GlobalValue>(V)) {
+      Affected.push_back(V);
+    } else if (auto *I = dyn_cast<Instruction>(V)) {
+      Affected.push_back(I);
+
+      // Peek through unary operators to find the source of the condition.
+      Value *Op;
+      if (match(I, m_PtrToInt(m_Value(Op)))) {
+        if (isa<Instruction>(Op) || isa<Argument>(Op))
+          Affected.push_back(Op);
+      }
+    }
+  };
+
+  ICmpInst::Predicate Pred;
+  Value *A;
+  Constant *C;
+  if (match(Cond, m_ICmp(Pred, m_Value(A), m_Constant(C)))) {
+    AddAffected(A);
+
+    if (Pred == ICmpInst::ICMP_EQ) {
+      Value *X;
+      // (X & C) or (X | C) or (X ^ C).
+      // (X << C) or (X >>_s C) or (X >>_u C).
+      if (match(A, m_BitwiseLogic(m_Value(X), m_ConstantInt())) ||
+          match(A, m_Shift(m_Value(X), m_ConstantInt())))
+        AddAffected(X);
+    } else if (Pred == ICmpInst::ICMP_NE) {
+      Value *X;
+      // Handle (X & pow2 != 0).
+      if (match(A, m_And(m_Value(X), m_Power2())) && match(C, m_Zero()))
+        AddAffected(X);
+    } else if (Pred == ICmpInst::ICMP_ULT) {
+      Value *X;
+      // Handle (A + C1) u< C2, which is the canonical form of A > C3 && A < C4,
+      // and recognized by LVI at least.
+      if (match(A, m_Add(m_Value(X), m_ConstantInt())))
+        AddAffected(X);
+    }
+  }
+}
+
+void DomConditionCache::registerBranch(BranchInst *BI) {
+  assert(BI->isConditional() && "Must be conditional branch");
+  SmallVector<Value *, 16> Affected;
+  findAffectedValues(BI->getCondition(), Affected);
+  for (Value *V : Affected) {
+    auto &AV = AffectedValues[V];
+    if (!is_contained(AV, BI))
+      AV.push_back(BI);
+  }
+}
diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp
index ccb5ae3ba0a11dd..57b29a92ff61a43 100644
--- a/llvm/lib/Analysis/ValueTracking.cpp
+++ b/llvm/lib/Analysis/ValueTracking.cpp
@@ -26,6 +26,7 @@
 #include "llvm/Analysis/AssumeBundleQueries.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/ConstantFolding.h"
+#include "llvm/Analysis/DomConditionCache.h"
 #include "llvm/Analysis/GuardUtils.h"
 #include "llvm/Analysis/InstructionSimplify.h"
 #include "llvm/Analysis/Loads.h"
@@ -713,9 +714,33 @@ static void computeKnownBitsFromCmp(const Value *V, CmpInst::Predicate Pred,
 
 void llvm::computeKnownBitsFromContext(const Value *V, KnownBits &Known,
                                       unsigned Depth, const SimplifyQuery &Q) {
-  // Use of assumptions is context-sensitive. If we don't have a context, we
-  // cannot use them!
-  if (!Q.AC || !Q.CxtI)
+  if (!Q.CxtI)
+    return;
+
+  if (Q.DC && Q.DT) {
+    // Handle dominating conditions.
+    for (BranchInst *BI : Q.DC->conditionsFor(V)) {
+      auto *Cmp = dyn_cast<ICmpInst>(BI->getCondition());
+      if (!Cmp)
+        continue;
+
+      BasicBlockEdge Edge0(BI->getParent(), BI->getSuccessor(0));
+      if (Q.DT->dominates(Edge0, Q.CxtI->getParent()))
+        computeKnownBitsFromCmp(V, Cmp->getPredicate(), Cmp->getOperand(0),
+                                Cmp->getOperand(1), Known, Depth, Q);
+
+      BasicBlockEdge Edge1(BI->getParent(), BI->getSuccessor(1));
+      if (Q.DT->dominates(Edge1, Q.CxtI->getParent()))
+        computeKnownBitsFromCmp(V, Cmp->getInversePredicate(),
+                                Cmp->getOperand(0), Cmp->getOperand(1), Known,
+                                Depth, Q);
+    }
+
+    if (Known.hasConflict())
+      Known.resetAll();
+  }
+
+  if (!Q.AC)
     return;
 
   unsigned BitWidth = Known.getBitWidth();
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
index 0bbb22be71569f6..aecb33db1031918 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
+++ b/llvm/lib/Transforms/InstCombine/InstCombineInternal.h
@@ -459,6 +459,7 @@ class LLVM_LIBRARY_VISIBILITY InstCombinerImpl final
     // use counts.
     SmallVector<Value *> Ops(I.operands());
     Worklist.remove(&I);
+    DC.removeValue(&I);
     I.eraseFromParent();
     for (Value *Op : Ops)
       Worklist.handleUseCountDecrement(Op);
diff --git a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
index e6088541349784b..aadebcb598c383c 100644
--- a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -3070,6 +3070,7 @@ Instruction *InstCombinerImpl::visitBranchInst(BranchInst &BI) {
     return nullptr;
   }
 
+  DC.registerBranch(&BI);
   return nullptr;
 }
 
diff --git a/llvm/test/CodeGen/BPF/loop-exit-cond.ll b/llvm/test/CodeGen/BPF/loop-exit-cond.ll
index 7666d961753ac40..ff34432009a1117 100644
--- a/llvm/test/CodeGen/BPF/loop-exit-cond.ll
+++ b/llvm/test/CodeGen/BPF/loop-exit-cond.ll
@@ -43,7 +43,7 @@ define dso_local i32 @test(i32 %len, ptr %data) #0 {
 ; CHECK-NEXT:    store i64 [[CONV2]], ptr [[D]], align 8, !tbaa [[TBAA6:![0-9]+]]
 ; CHECK-NEXT:    call void @foo(ptr nonnull @.str, i32 [[I_05]], ptr nonnull [[D]]) #[[ATTR3]]
 ; CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 8, ptr nonnull [[D]]) #[[ATTR3]]
-; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_05]], 1
+; CHECK-NEXT:    [[INC]] = add nuw i32 [[I_05]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[LEN]]
 ; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[IF_END]], label [[FOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
 ; CHECK:       if.end:
diff --git a/llvm/test/Transforms/InstCombine/2007-10-31-RangeCrash.ll b/llvm/test/Transforms/InstCombine/2007-10-31-RangeCrash.ll
index f2548f6f8c8c8c3..8b472aa5af09024 100644
--- a/llvm/test/Transforms/InstCombine/2007-10-31-RangeCrash.ll
+++ b/llvm/test/Transforms/InstCombine/2007-10-31-RangeCrash.ll
@@ -1,5 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
-; RUN: opt < %s -S -passes=instcombine | FileCheck %s
+; RUN: opt < %s -S -passes='instcombine<no-verify-fixpoint>' | FileCheck %s
+
+; We do not reach a fixpoint, because we first have to infer nsw on the IV add,
+; and could eliminate the icmp slt afterwards, but don't revisit it.
 
 target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f128:64:128"
 
@@ -10,7 +13,7 @@ define i32 @test() {
 ; CHECK:       bb.i:
 ; CHECK-NEXT:    br label [[BB51_I_I:%.*]]
 ; CHECK:       bb27.i.i:
-; CHECK-NEXT:    [[TMP50_I_I:%.*]] = add i32 [[X_0_I_I:%.*]], 2
+; CHECK-NEXT:    [[TMP50_I_I:%.*]] = add nsw i32 [[X_0_I_I:%.*]], 2
 ; CHECK-NEXT:    br label [[BB51_I_I]]
 ; CHECK:       bb51.i.i:
 ; CHECK-NEXT:    [[X_0_I_I]] = phi i32 [ [[TMP50_I_I]], [[BB27_I_I:%.*]] ], [ 0, [[BB_I]] ]
diff --git a/llvm/test/Transforms/InstCombine/2009-02-20-InstCombine-SROA.ll b/llvm/test/Transforms/InstCombine/2009-02-20-InstCombine-SROA.ll
index 13eca92f5535906..9f69d67f61e115f 100644
--- a/llvm/test/Transforms/InstCombine/2009-02-20-InstCombine-SROA.ll
+++ b/llvm/test/Transforms/InstCombine/2009-02-20-InstCombine-SROA.ll
@@ -80,7 +80,7 @@ define ptr @_Z3fooRSt6vectorIiSaIiEE(ptr %X) {
 ; IC-NEXT:    [[TMP31:%.*]] = load ptr, ptr [[__FIRST_ADDR_I_I]], align 4
 ; IC-NEXT:    [[TMP32:%.*]] = getelementptr i32, ptr [[TMP31]], i32 1
 ; IC-NEXT:    store ptr [[TMP32]], ptr [[__FIRST_ADDR_I_I]], align 4
-; IC-NEXT:    [[TMP33:%.*]] = add i32 [[__TRIP_COUNT_0_I_I:%.*]], -1
+; IC-NEXT:    [[TMP33:%.*]] = add nsw i32 [[__TRIP_COUNT_0_I_I:%.*]], -1
 ; IC-NEXT:    br label [[BB12_I_I]]
 ; IC:       bb12.i.i:
 ; IC-NEXT:    [[__TRIP_COUNT_0_I_I]] = phi i32 [ [[TMP7]], [[ENTRY:%.*]] ], [ [[TMP33]], [[BB11_I_I]] ]
@@ -188,7 +188,7 @@ define ptr @_Z3fooRSt6vectorIiSaIiEE(ptr %X) {
 ; IC_SROA-NEXT:    br label [[_ZST4FINDIN9__GNU_CXX17__NORMAL_ITERATORIPIST6VECTORIISAIIEEEEIET_S7_S7_RKT0__EXIT]]
 ; IC_SROA:       bb11.i.i:
 ; IC_SROA-NEXT:    [[TMP18:%.*]] = getelementptr i32, ptr [[TMP15]], i32 1
-; IC_SROA-NEXT:    [[TMP19:%.*]] = add i32 [[__TRIP_COUNT_0_I_I:%.*]], -1
+; IC_SROA-NEXT:    [[TMP19:%.*]] = add nsw i32 [[__TRIP_COUNT_0_I_I:%.*]], -1
 ; IC_SROA-NEXT:    br label [[BB12_I_I]]
 ; IC_SROA:       bb12.i.i:
 ; IC_SROA-NEXT:    [[__FIRST_ADDR_I_I_SROA_0_0]] = phi ptr [ [[TMP2]], [[ENTRY:%.*]] ], [ [[TMP18]], [[BB11_I_I]] ]
diff --git a/llvm/test/Transforms/InstCombine/cast_phi.ll b/llvm/test/Transforms/InstCombine/cast_phi.ll
index feeee16e27f23c8..5b4425b4d83082c 100644
--- a/llvm/test/Transforms/InstCombine/cast_phi.ll
+++ b/llvm/test/Transforms/InstCombine/cast_phi.ll
@@ -319,7 +319,7 @@ define i8 @trunc_in_loop_exit_block() {
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[IV]], 100
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP_LATCH]], label [[EXIT:%.*]]
 ; CHECK:       loop.latch:
-; CHECK-NEXT:    [[IV_NEXT]] = add i32 [[IV]], 1
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
 ; CHECK-NEXT:    br label [[LOOP]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    [[TRUNC:%.*]] = trunc i32 [[PHI]] to i8
diff --git a/llvm/test/Transforms/InstCombine/icmp-binop.ll b/llvm/test/Transforms/InstCombine/icmp-binop.ll
index 60a12411ee910d2..878f39bb7c9a564 100644
--- a/llvm/test/Transforms/InstCombine/icmp-binop.ll
+++ b/llvm/test/Transforms/InstCombine/icmp-binop.ll
@@ -132,8 +132,7 @@ define i1 @mul_broddV_unkV_eq(i16 %v, i16 %v2) {
 ; CHECK-NEXT:    [[ODD_NOT:%.*]] = icmp eq i16 [[LB]], 0
 ; CHECK-NEXT:    br i1 [[ODD_NOT]], label [[FALSE:%.*]], label [[TRUE:%.*]]
 ; CHECK:       true:
-; CHECK-NEXT:    [[MUL:%.*]] = mul i16 [[V:%.*]], [[V2]]
-; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i16 [[MUL]], 0
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i16 [[V:%.*]], 0
 ; CHECK-NEXT:    ret i1 [[CMP]]
 ; CHECK:       false:
 ; CHECK-NEXT:    call void @use64(i16 [[V]])
diff --git a/llvm/test/Transforms/InstCombine/icmp-dom.ll b/llvm/test/Transforms/InstCombine/icmp-dom.ll
index f4b9022d14349b2..c30b2d724afbd59 100644
--- a/llvm/test/Transforms/InstCombine/icmp-dom.ll
+++ b/llvm/test/Transforms/InstCombine/icmp-dom.ll
@@ -75,7 +75,8 @@ define void @idom_sign_bit_check_edge_dominates_select(i64 %a, i64 %b) {
 ; CHECK:       land.lhs.true:
 ; CHECK-NEXT:    br label [[LOR_END:%.*]]
 ; CHECK:       lor.rhs:
-; CHECK-NEXT:    [[CMP3_NOT:%.*]] = icmp eq i64 [[A]], [[B:%.*]]
+; CHECK-NEXT:    [[SELECT:%.*]] = call i64 @llvm.umax.i64(i64 [[A]], i64 5)
+; CHECK-NEXT:    [[CMP3_NOT:%.*]] = icmp eq i64 [[SELECT]], [[B:%.*]]
 ; CHECK-NEXT:    br i1 [[CMP3_NOT]], label [[LOR_END]], label [[LAND_RHS:%.*]]
 ; CHECK:       land.rhs:
 ; CHECK-NEXT:    br label [[LOR_END]]
@@ -385,7 +386,7 @@ define i8 @PR48900_alt(i8 %i, ptr %p) {
 ; CHECK-NEXT:    [[I4:%.*]] = icmp ugt i8 [[SMAX]], -128
 ; CHECK-NEXT:    br i1 [[I4]], label [[TRUELABEL:%.*]], label [[FALSELABEL:%.*]]
 ; CHECK:       truelabel:
-; CHECK-NEXT:    [[UMIN:%.*]] = call i8 @llvm.smin.i8(i8 [[SMAX]], i8 -126)
+; CHECK-NEXT:    [[UMIN:%.*]] = call i8 @llvm.umin.i8(i8 [[SMAX]], i8 -126)
 ; CHECK-NEXT:    ret i8 [[UMIN]]
 ; CHECK:       falselabel:
 ; CHECK-NEXT:    ret i8 0
diff --git a/llvm/test/Transforms/InstCombine/icmp-mul-zext.ll b/llvm/test/Transforms/InstCombine/icmp-mul-zext.ll
index 095ac5b27f59635..adf78723b1302aa 100644
--- a/llvm/test/Transforms/InstCombine/icmp-mul-zext.ll
+++ b/llvm/test/Transforms/InstCombine/icmp-mul-zext.ll
@@ -17,8 +17,7 @@ define i32 @sterix(i32, i8, i64) {
 ; CHECK-NEXT:    br i1 [[TOBOOL_NOT]], label [[LOR_RHS:%.*]], label [[LOR_END:%.*]]
 ; CHECK:       lor.rhs:
 ; CHECK-NEXT:    [[AND:%.*]] = and i64 [[MUL3]], [[TMP2]]
-; CHECK-NEXT:    [[CONV4:%.*]] = trunc i64 [[AND]] to i32
-; CHECK-NEXT:    [[TOBOOL7_NOT:%.*]] = icmp eq i32 [[CONV4]], 0
+; CHECK-NEXT:    [[TOBOOL7_NOT:%.*]] = icmp eq i64 [[AND]], 0
 ; CHECK-NEXT:    [[TMP3:%.*]] = zext i1 [[TOBOOL7_NOT]] to i32
 ; CHECK-NEXT:    br label [[LOR_END]]
 ; CHECK:       lor.end:
diff --git a/llvm/test/Transforms/InstCombine/icmp-ne-pow2.ll b/llvm/test/Transforms/InstCombine/icmp-ne-pow2.ll
index 224ea3cd76cc6d1..70a2b33d17dd7ca 100644
--- a/llvm/test/Transforms/InstCombine/icmp-ne-pow2.ll
+++ b/llvm/test/Transforms/InstCombine/icmp-ne-pow2.ll
@@ -125,8 +125,7 @@ define i32 @pow2_32_br(i32 %x) {
 ; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i32 [[AND]], 0
 ; CHECK-NEXT:    br i1 [[CMP_NOT]], label [[FALSE:%.*]], label [[TRUE:%.*]]
 ; CHECK:       True:
-; CHECK-NEXT:    [[OR:%.*]] = or i32 [[X]], 4
-; CHECK-NEXT:    ret i32 [[OR]]
+; CHECK-NEXT:    ret i32 [[X]]
 ; CHECK:       False:
 ; CHECK-NEXT:    ret i32 0
 ;
@@ -167,8 +166,7 @@ define i64 @pow2_64_br(i64 %x) {
 ; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i64 [[AND]], 0
 ; CHECK-NEXT:    br i1 [[CMP_NOT]], label [[FALSE:%.*]], label [[TRUE:%.*]]
 ; CHECK:       True:
-; CHECK-NEXT:    [[AND2:%.*]] = and i64 [[X]], 1
-; CHECK-NEXT:    ret i64 [[AND2]]
+; CHECK-NEXT:    ret i64 1
 ; CHECK:       False:
 ; CHECK-NEXT:    ret i64 0
 ;
@@ -209,8 +207,7 @@ define i16 @pow2_16_br(i16 %x) {
 ; CHECK-NEXT:    [[CMP_NO...
[truncated]

@nikic
Copy link
Contributor Author

nikic commented Nov 28, 2023

Just realized that this doesn't cover uses of isKnownNonNegative() in InstCombine yet, as it currently doesn't go through SimplifyQuery. I'll see about migrating those APIs tomorrow.

llvm/lib/Analysis/DomConditionCache.cpp Outdated Show resolved Hide resolved
llvm/test/Transforms/InstCombine/icmp-dom.ll Outdated Show resolved Hide resolved
@@ -16,11 +16,15 @@ define i32 @test_asr(i32 %a, i32 %b) {
; CHECK-NEXT: [[C:%.*]] = icmp slt i32 [[A]], 0
; CHECK-NEXT: br i1 [[C]], label [[BB2:%.*]], label [[BB3:%.*]]
; CHECK: bb2:
; CHECK-NEXT: [[NOT:%.*]] = xor i32 [[A]], -1
; CHECK-NEXT: [[D:%.*]] = lshr i32 [[NOT]], [[B]]
; CHECK-NEXT: [[NOT2:%.*]] = xor i32 [[D]], -1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by faebb1b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I saw that the output changed and thought that it was fixed, but this wasn't sufficient. We now have ashr in one branch and lshr in the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the remaining pattern is already handled by an extra simplifycfg+instcombine run, because we have this select fold:

static Value *foldSelectICmpLshrAshr(const ICmpInst *IC, Value *TrueVal,

I've adjusted the test to check for the combination of the three passes, to show that it folds in that case. But I'd also be open to add a variant of this fold for phi nodes rather than selects, if you think it's necessary.

@nikic
Copy link
Contributor Author

nikic commented Nov 29, 2023

Just realized that this doesn't cover uses of isKnownNonNegative() in InstCombine yet, as it currently doesn't go through SimplifyQuery. I'll see about migrating those APIs tomorrow.

This is done now, but introduced an additional regression in idioms.ll.

@nikic nikic force-pushed the dom-cond branch 3 times, most recently from b51bd83 to 9a3689f Compare December 1, 2023 15:10
Copy link
Member

@dtcxzyw dtcxzyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me. Waiting for the result of my CI.
https://github.com/dtcxzyw/llvm-ci/actions/runs/7066692655

@goldsteinn Any comments?

@dtcxzyw
Copy link
Member

dtcxzyw commented Dec 2, 2023

My CI detected some significant regressions caused by this patch:
plctlab/llvm-ci#839 (comment)

@dtcxzyw
Copy link
Member

dtcxzyw commented Dec 3, 2023

Could you please rebase this patch on #74246 and add a test for #74242?

dtcxzyw added a commit that referenced this pull request Dec 3, 2023
…` directly. NFC. (#74246)

This patch passes `SimplifyQuery` to `computeKnownBits` directly in
`InstSimplify` and `InstCombine`.
As the `DomConditionCache` in #73662 is only used in `InstCombine`, it
is inconvenient to introduce a new argument `DC` to `computeKnownBits`.
@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2023

Could you please rebase this patch on #74246 and add a test for #74242?

Done. The new test is @div_by_zero_or_one_from_dom_cond.

@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2023

@dtcxzyw Could you please run another test with this branch? https://github.com/nikic/llvm-project/tree/perf/dom-cond-3

I believe the issue with DILATE is that we have some icmps that get canonicalized from signed to unsigned predicate, and IndVars is not able to perform some reasoning anymore. I think a change like 7fa42da may help, but possibly not completely and it has compile-time impact. So I'm considering disabling the use of dominating conditions for the icmp known bits simplification to not go down this rabbit hole. (The proper solution is probably to have something similar to zext nneg for icmp.)

@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2023

I went ahead and pushed a clean up version of that change to this PR, so you can just re-test this PR.

I've found that dropping the icmp case also removes most of the second-order compile-time regressions, and also drops the (very minor) regression in the BPF test, so those are also points in favor of doing this.

Copy link

github-actions bot commented Dec 4, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@dtcxzyw
Copy link
Member

dtcxzyw commented Dec 5, 2023

Looks like the regression in DILATE has been addressed.
Could you please have a look at MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode?

@nikic
Copy link
Contributor Author

nikic commented Dec 5, 2023

The problem for mpeg2decode seems to be that we do more add to or disjoint conversions. But or disjoint is still being implemented, so e.g. in SCEV we don't recognize it yet and fail to create an add SCEV for it. So I think we need to do some more work on or disjoint and then try again.

@yonghong-song
Copy link
Contributor

Hi, @nikic,
This patch caused a bpf verifier regression with one of bpf selftests. The details can be found in kernel bpf mailing list.
https://lore.kernel.org/bpf/0ff5f011-7524-4550-89eb-bb2c89f699d6@linux.dev/

Note that bpf verification failure here does not mean that generated code is incorrect. It means llvm generates 'more complex' code and the current verifier cannot handle that.

I created a unit test like below:

 $ cat iters.bpf.o.i
 typedef unsigned long long __u64;
 
 struct bpf_iter_num {
  __u64 __opaque[1];
 } __attribute__((aligned(8)));
 
 extern int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) __attribute__((weak))  __attribute__((section(".ksyms")));
 extern int *bpf_iter_num_next(struct bpf_iter_num *it) __attribute__((weak)) __attribute__((section(".ksyms")));
 extern void bpf_iter_num_destroy(struct bpf_iter_num *it) __attribute__((weak)) __attribute__((section(".ksyms")));
 
 struct {
  int data[32];
  int n;
 } loop_data;
 
 int iter_arr_with_actual_elem_count(const void *ctx)
 {
  int i, n = loop_data.n, sum = 0;
 
  if (n > (sizeof(loop_data.data) / sizeof((loop_data.data)[0])))
   return 0;
 
  for ( struct bpf_iter_num ___it __attribute__((aligned(8), cleanup(bpf_iter_num_destroy))), *___p __attribute__((unused))  = ( bpf_iter_num_new(&___it, (0), (n)), (void)bpf_iter_num_destroy, (void *)0); ({ int *___t = bpf_iter_num_next(&___it);  (___t && ((i) = *___t, (i) >= (0) && (i) < (n))); }); ) {
 
   sum += loop_data.data[i];
  }
 
  return sum;
 }

The compilation flag:
clang -O2 -mcpu=v4 iters.bpf.o.i -c --target=bpf -mllvm -print-after-all

For a llvm whose top commit is the patch, the assembly looks like

0000000000000000 <iter_arr_with_actual_elem_count>:
       0:       b4 07 00 00 00 00 00 00 w7 = 0x0
       1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
       3:       61 16 80 00 00 00 00 00 r6 = *(u32 *)(r1 + 0x80)
       4:       26 06 1b 00 20 00 00 00 if w6 > 0x20 goto +0x1b <LBB0_4>
       5:       bf a8 00 00 00 00 00 00 r8 = r10
       6:       07 08 00 00 f8 ff ff ff r8 += -0x8
       7:       bf 81 00 00 00 00 00 00 r1 = r8
       8:       b4 02 00 00 00 00 00 00 w2 = 0x0
       9:       bc 63 00 00 00 00 00 00 w3 = w6
      10:       85 10 00 00 ff ff ff ff call -0x1
      11:       bf 81 00 00 00 00 00 00 r1 = r8
      12:       85 10 00 00 ff ff ff ff call -0x1
      13:       15 00 02 00 00 00 00 00 if r0 == 0x0 goto +0x2 <LBB0_3>

0000000000000070 <LBB0_2>:
      14:       81 01 00 00 00 00 00 00 r1 = *(s32 *)(r0 + 0x0)  <=== a sign extension here, upper 32bit may be 0xffffffff
      15:       ae 61 04 00 00 00 00 00 if w1 < w6 goto +0x4 <LBB0_5>  <=== refine the value range of r1/w1 depending on which branch is taken.

0000000000000080 <LBB0_3>:
      16:       bf a1 00 00 00 00 00 00 r1 = r10
      17:       07 01 00 00 f8 ff ff ff r1 += -0x8
      18:       85 10 00 00 ff ff ff ff call -0x1
      19:       05 00 0c 00 00 00 00 00 goto +0xc <LBB0_4>

00000000000000a0 <LBB0_5>:
      20:       67 01 00 00 02 00 00 00 r1 <<= 0x2 <=== reaching here from insn 15, the verifier assumes top 32bit is not zero
                                                                               <=== and verification will fail at insn 23.
      21:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
      23:       0f 12 00 00 00 00 00 00 r2 += r1
      24:       61 28 00 00 00 00 00 00 r8 = *(u32 *)(r2 + 0x0)
      25:       0c 78 00 00 00 00 00 00 w8 += w7
      26:       bf a1 00 00 00 00 00 00 r1 = r10
      27:       07 01 00 00 f8 ff ff ff r1 += -0x8
      28:       85 10 00 00 ff ff ff ff call -0x1
      29:       bc 87 00 00 00 00 00 00 w7 = w8
      30:       15 00 f1 ff 00 00 00 00 if r0 == 0x0 goto -0xf <LBB0_3>
      31:       05 00 ee ff 00 00 00 00 goto -0x12 <LBB0_2>

0000000000000100 <LBB0_4>:
      32:       bc 70 00 00 00 00 00 00 w0 = w7
      33:       95 00 00 00 00 00 00 00 exit

In the above, I added a few comments to show why verification failure. The more detail can be found in
https://lore.kernel.org/bpf/0ff5f011-7524-4550-89eb-bb2c89f699d6@linux.dev/
as well.

For a llvm whose top commit is the patch below this commit, the assembly looks like

0000000000000000 <iter_arr_with_actual_elem_count>:
       0:       b4 07 00 00 00 00 00 00 w7 = 0x0
       1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
       3:       61 16 80 00 00 00 00 00 r6 = *(u32 *)(r1 + 0x80)
       4:       26 06 1c 00 20 00 00 00 if w6 > 0x20 goto +0x1c <LBB0_5>
       5:       bf a8 00 00 00 00 00 00 r8 = r10
       6:       07 08 00 00 f8 ff ff ff r8 += -0x8
       7:       bf 81 00 00 00 00 00 00 r1 = r8
       8:       b4 02 00 00 00 00 00 00 w2 = 0x0
       9:       bc 63 00 00 00 00 00 00 w3 = w6
      10:       85 10 00 00 ff ff ff ff call -0x1
      11:       bf 81 00 00 00 00 00 00 r1 = r8
      12:       85 10 00 00 ff ff ff ff call -0x1
      13:       15 00 03 00 00 00 00 00 if r0 == 0x0 goto +0x3 <LBB0_4>

0000000000000070 <LBB0_2>:
      14:       61 01 00 00 00 00 00 00 r1 = *(u32 *)(r0 + 0x0)  <=== unsigned extension, upper 32bit of r1 is 0.
      15:       c6 01 01 00 00 00 00 00 if w1 s< 0x0 goto +0x1 <LBB0_4>
      16:       ce 61 04 00 00 00 00 00 if w1 s< w6 goto +0x4 <LBB0_6>
                     <=== w1 (lower 32bit of r1) value range is determined here.

0000000000000088 <LBB0_4>:
      17:       bf a1 00 00 00 00 00 00 r1 = r10
      18:       07 01 00 00 f8 ff ff ff r1 += -0x8
      19:       85 10 00 00 ff ff ff ff call -0x1
      20:       05 00 0c 00 00 00 00 00 goto +0xc <LBB0_5>

00000000000000a8 <LBB0_6>:
      21:       67 01 00 00 02 00 00 00 r1 <<= 0x2 <=== reaching from insn 16, now we have a much smaller range of 'r1'
                                                                               <=== and verifier will be okay with insn 24 below.
      22:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
      24:       0f 12 00 00 00 00 00 00 r2 += r1
      25:       61 28 00 00 00 00 00 00 r8 = *(u32 *)(r2 + 0x0)
      26:       0c 78 00 00 00 00 00 00 w8 += w7
      27:       bf a1 00 00 00 00 00 00 r1 = r10
      28:       07 01 00 00 f8 ff ff ff r1 += -0x8
      29:       85 10 00 00 ff ff ff ff call -0x1
      30:       bc 87 00 00 00 00 00 00 w7 = w8
      31:       15 00 f1 ff 00 00 00 00 if r0 == 0x0 goto -0xf <LBB0_4>
      32:       05 00 ed ff 00 00 00 00 goto -0x13 <LBB0_2>

0000000000000108 <LBB0_5>:
      33:       bc 70 00 00 00 00 00 00 w0 = w7
      34:       95 00 00 00 00 00 00 00 exit

With '-mllvm -print-after-all', I am able to find why this patch caused the problem.
For with this patch, the IR eventually used sext right before BPF DAG->DAG Pattern Instruction Selection

...
land.end8:                                        ; preds = %land.end8.preheader, %for.body
  %call222 = phi ptr [ %call2, %for.body ], [ %call219, %land.end8.preheader ]
  %sum.021 = phi i32 [ %add, %for.body ], [ 0, %land.end8.preheader ]
  %1 = load i32, ptr %call222, align 4, !tbaa !8
  %idxprom = sext i32 %1 to i64
  %2 = icmp ult i32 %1, %0
  br i1 %2, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %land.end8, %for.body, %if.end
  %sum.0.lcssa = phi i32 [ 0, %if.end ], [ %sum.021, %land.end8 ], [ %add, %for.body ]
  call void @bpf_iter_num_destroy(ptr noundef nonnull %___it) #3
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %___it) #3
  br label %cleanup

for.body:                                         ; preds = %land.end8
  %arrayidx = getelementptr inbounds [32 x i32], ptr @loop_data, i64 0, i64 %idxprom
  %3 = load i32, ptr %arrayidx, align 4, !tbaa !8
  %add = add nsw i32 %3, %sum.021
  %call2 = call ptr @bpf_iter_num_next(ptr noundef nonnull %___it) #3
  %tobool.not = icmp eq ptr %call2, null
  br i1 %tobool.not, label %for.cond.cleanup, label %land.end8, !llvm.loop !9

You can see that idxprom is sign extended and the variable is used to construct the array index. The current
verifier won't be able to handle this so verification failed.

For without this patch, the IR eventually removed sext since it knows the value range at CorrelatedValuePropagationPass

...
land.end8:                                        ; preds = %for.cond
  %1 = load i32, ptr %call2, align 4, !tbaa !8
  %cmp3 = icmp sgt i32 %1, -1
  %cmp6 = icmp slt i32 %1, %0
  %2 = and i1 %cmp3, %cmp6 
  br i1 %2, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond, %land.end8
  call void @bpf_iter_num_destroy(ptr noundef nonnull %___it) #3
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %___it) #3
  br label %cleanup

for.body:                                         ; preds = %land.end8
  %idxprom = zext nneg i32 %1 to i64
  %arrayidx = getelementptr inbounds [32 x i32], ptr @loop_data, i64 0, i64 %idxprom
  %3 = load i32, ptr %arrayidx, align 4, !tbaa !8
  %add = add nsw i32 %sum.0, %3
  br label %for.cond, !llvm.loop !9
...

So my question is that we would like to undo this transformation in llvm so we can please the verifier. Any suggestion on how to do this? Thanks!

cc @4ast cc @eddyz87 @anakryiko @jemarch

@nikic
Copy link
Contributor Author

nikic commented Dec 11, 2023

@bjope It looks like the InstCombine changes enable IndVars to perform LFTR, which is unprofitable in this case. Though the umax(1) call is actually completely unnecessary here, but SCEV doesn't realize it. I've put up #75039 to fix that. Does that improve things for you?

@nikic
Copy link
Contributor Author

nikic commented Dec 11, 2023

@yonghong-song I think it may be possible to improve CVP to handle this better, in which case we won't need BPF workarounds. I'll look into it.

@bjope
Copy link
Collaborator

bjope commented Dec 11, 2023

@bjope It looks like the InstCombine changes enable IndVars to perform LFTR, which is unprofitable in this case. Though the umax(1) call is actually completely unnecessary here, but SCEV doesn't realize it. I've put up #75039 to fix that. Does that improve things for you?

@nikic , thanks!
I was thinking that maybe I would need to do something to please the SCEVExpander::isHighCostExpansionHelper to prevent the IndVar transform. But with #75039 I no longer see the problem umax instructions appearing (and increasing the IR instruction count).
I still see some other regressions, but that could be due to limitations in the backend. I haven't found anything more that I can blame on this patch.

@yonghong-song
Copy link
Contributor

@yonghong-song I think it may be possible to improve CVP to handle this better, in which case we won't need BPF workarounds. I'll look into it.

@nikic and @bjope. Indeed, #75039 didn't resolve the issue and the generated bpf code is the same as without this patch, so some more works are needed. Thank you both for looking at the issue!

@eddyz87
Copy link
Contributor

eddyz87 commented Dec 13, 2023

Some additional details on BPF verifier failure.
Consider the following example:

@a = global i32 0, align 4
@g = global i32 0, align 4

define i64 @foo() {
entry:
  %a = load i32, ptr @a
  %a.cmp = icmp ugt i32 %a, 32
  br i1 %a.cmp, label %end, label %l1 ; establish %a in range [0, 32]

l1:
  %g = load i32, ptr @g
  %g.cmp.1 = icmp sgt i32 %g, -1
  %g.cmp.a = icmp slt i32 %g, %a
  %cond = and i1 %g.cmp.a, %g.cmp.1
  br i1 %cond, label %l2, label %end  ; establish %g in range [0, %a]

l2:
  %g.64 = sext i32 %g to i64
  ret i64 %g.64

end:
  ret i64 0
}

Also consider the following opt command:

opt -passes=instcombine,correlated-propagation -mtriple=bpf-pc-linux -S -o - t.ll

Before this MR generated code looked as follows:

...
entry:
  %a = load i32, ptr @a, align 4
  %a.cmp = icmp ugt i32 %a, 32
  br i1 %a.cmp, label %end, label %l1

l1:
  %g = load i32, ptr @g, align 4
  %g.cmp.1 = icmp sgt i32 %g, -1
  %g.cmp.a = icmp slt i32 %g, %a
  %cond = and i1 %g.cmp.a, %g.cmp.1
  br i1 %cond, label %l2, label %end

l2:
  %g.64 = zext nneg i32 %g to i64 ; <--- note zext nneg
  ret i64 %g.64
...

After this MR generated code looks as follows:

...
entry:
  %a = load i32, ptr @a, align 4
  %a.cmp = icmp ugt i32 %a, 32
  br i1 %a.cmp, label %end, label %l1

l1:
  %g = load i32, ptr @g, align 4
  %cond = icmp ult i32 %g, %a ; <--- conditions merged by instcombine
  br i1 %cond, label %l2, label %end

l2:
  %g.64 = sext i32 %g to i64  ; <--- note sext
  ret i64 %g.64
...

Updated instcombine replaces %g.cmp.1 and %g.cmp.a with a single ult, however correlated-propagation pass can no longer infer that %g is non-negative and thus does not replace g.64 = sext by %g.64 = zext nneg.
Probably because of variable %a, used in %cond = icmp (replacing this variable by a constant 32 allows correlated-propagation to add zext). As far as I understand, previously non-negativity of %g was inferred from %g.cmp.1 = icmp sgt i32 %g, -1 condition, which is now deleted.

I think that there are no issues with transformation applied by updated instcombine.
To resolve BPF regression we either need to extend correlated-propagation to better figure out range bounds (don't know how hard might that be), or teach verifier to understand that sext in this situation is equivalent to zext.

@nikic
Copy link
Contributor Author

nikic commented Dec 13, 2023

@eddyz87 Right, this is exactly the change I have been working on, see #75311. Unfortunately, it doesn't actually fix the BPF case, because that one involves a loop, and LVI is currently terrible at handling those. (Basically, even though the relevant condition is outside the loop, just querying something in the loop will form a cycle, which will be resolved to overdefined.)

@joanahalili
Copy link

heads-up, we are seeing some performance regressions due to this patch 8about 6-7%! We will add more details about this soon.

@bjope
Copy link
Collaborator

bjope commented Dec 13, 2023

Here is another thing that I noticed after this patch: https://godbolt.org/z/1P7bnKGjh

So early instcombine is eliminating an and operation (in the foo example), resulting in simplifycfg not being able to collapse the control flow any longer.

Maybe I should file a separate issue for that?

Not sure if it should be consider as a phase ordering problem. Had perhaps been nice if simplifycfg had been able to understand that it basically could match an and with a no-op by using a -1 mask, etc.

@XChy
Copy link
Member

XChy commented Dec 13, 2023

Here is another thing that I noticed after this patch: https://godbolt.org/z/1P7bnKGjh

So early instcombine is eliminating an and operation (in the foo example), resulting in simplifycfg not being able to collapse the control flow any longer.

I don't think it's a profitable transformation in simplifycfg with larger amount of instructions. In x86, backend generates much larger code for the example with collapsed cfg: https://godbolt.org/z/YP9GjjP8c.

@bjope
Copy link
Collaborator

bjope commented Dec 13, 2023

Here is another thing that I noticed after this patch: https://godbolt.org/z/1P7bnKGjh
So early instcombine is eliminating an and operation (in the foo example), resulting in simplifycfg not being able to collapse the control flow any longer.

I don't think it's a profitable transformation in simplifycfg with larger amount of instructions. In x86, backend generates much larger code for the example with collapsed cfg: https://godbolt.org/z/YP9GjjP8c.

Optimization pipeline is doing simplifications and canonicalizations. If you for example use -target amdcgn, then I think you will see that the codegen is impacted negatively when not simplifying the control flow. So it depends on the backend if one form is profitable or not.
I don't know really which form that should be considered best (simplest and easiest for most backends to deal with) here. Just saying that it changed. And that could indeed be one reason for regressions (as for our backend).

@XChy
Copy link
Member

XChy commented Dec 14, 2023

Optimization pipeline is doing simplifications and canonicalizations. If you for example use -target amdcgn, then I think you will see that the codegen is impacted negatively when not simplifying the control flow. So it depends on the backend if one form is profitable or not. I don't know really which form that should be considered best (simplest and easiest for most backends to deal with) here. Just saying that it changed. And that could indeed be one reason for regressions (as for our backend).

You're right, it depends on the backend. For GPU, it sounds good to hoist common operations as selects to realize it. But I'm not sure whether such transformation should happen in simplifycfg pass with specified target info(and option), or just in backend. If in simplifycfg, it may resist other optimizations between basic blocks.

dtcxzyw added a commit that referenced this pull request Dec 14, 2023
…ache (#75370)

This patch uses affected values from DomConditionCache(introduced by #73662), instead of a cheap/incomplete check `getSinglePredecessor`.
nikic added a commit that referenced this pull request Jan 2, 2024
Currently, LVI will only use conditions like "X < C" to constrain the
value of X on the relevant edge. This patch extends it to handle
conditions like "X < Y" by querying the known range of Y.

This means that getValueFromCondition() and various related APIs can now
return nullopt to indicate that they have pushed to the worklist, and
need to be called again later. This behavior is currently controlled by
a UseBlockValue option, and only enabled for actual edge value handling.
All other places deriving constraints from conditions keep using the
previous logic for now.

This change was originally motivated as a fix for the regression
reported in
#73662 (comment).
Unfortunately, it doesn't actually fix it, because we run into another
issue there (LVI currently is really bad at handling values used in
loops).

This change has some compile-time impact, but it's fairly small,
in the 0.05% range.
dtcxzyw added a commit that referenced this pull request Feb 5, 2024
…mily (#80657)

This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
#73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.

Example (extracted from
[fmt/format.h](https://github.com/fmtlib/fmt/blob/e17bc67547a66cdd378ca6a90c56b865d30d6168/include/fmt/format.h#L3555-L3566)):
```
define float @test(float %x, i1 %cond) {
  %i32 = bitcast float %x to i32
  %cmp = icmp slt i32 %i32, 0
  br i1 %cmp, label %if.then1, label %if.else

if.then1:
  %fneg = fneg float %x
  br label %if.end

if.else:
  br i1 %cond, label %if.then2, label %if.end

if.then2:
  br label %if.end

if.end:
  %value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
  %ret = call float @llvm.fabs.f32(float %value)
  ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
…mily (llvm#80657)

This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
llvm#73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.

Example (extracted from
[fmt/format.h](https://github.com/fmtlib/fmt/blob/e17bc67547a66cdd378ca6a90c56b865d30d6168/include/fmt/format.h#L3555-L3566)):
```
define float @test(float %x, i1 %cond) {
  %i32 = bitcast float %x to i32
  %cmp = icmp slt i32 %i32, 0
  br i1 %cmp, label %if.then1, label %if.else

if.then1:
  %fneg = fneg float %x
  br label %if.end

if.else:
  br i1 %cond, label %if.then2, label %if.end

if.then2:
  br label %if.end

if.end:
  %value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
  %ret = call float @llvm.fabs.f32(float %value)
  ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
@alexfh
Copy link
Contributor

alexfh commented Sep 13, 2024

Apologies for resurrecting this old thread, but I found a problem with this patch. It seems like it makes a comparator violate strict weak ordering requirements. This manifests as an assertion failure when Clang is compiled with the corresponding libc++ check:

clang -O1 -c -o /dev/null -x ir -
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: cold noreturn nounwind
declare void @llvm.ubsantrap(i8 immarg) #0

define void @f(ptr %cpi, ptr %x, i32 %i.0, i1 %cmp125.not) {
entry:
  br label %for.cond

trap3:                                            ; preds = %if.end147, %if.else141, %for.body
  call void @llvm.ubsantrap(i8 0)
  unreachable

for.cond:                                         ; preds = %cont93, %entry
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %cont93 ]
  %cmp91 = icmp ult i32 %i.01, 20
  br i1 %cmp91, label %for.body, label %for.cond116

for.body:                                         ; preds = %for.cond
  %0 = icmp eq ptr %x, null
  br i1 %0, label %cont93, label %trap3

cont93:                                           ; preds = %for.body
  %inc = or i32 %i.0, 1
  br label %for.cond

for.cond116:                                      ; preds = %cont148, %for.cond
  %i.1 = phi i32 [ %inc158, %cont148 ], [ 0, %for.cond ]
  %cmp117 = icmp ult i32 %i.1, 20
  br i1 %cmp117, label %cont120, label %cont213

cont120:                                          ; preds = %for.cond116
  br i1 %cmp125.not, label %if.else141, label %if.end147

if.else141:                                       ; preds = %cont120
  %1 = ptrtoint ptr %x to i64
  %2 = and i64 %1, 1
  %3 = icmp eq i64 %2, 0
  br i1 %3, label %cont146, label %trap3

cont146:                                          ; preds = %if.else141
  store i32 0, ptr %cpi, align 4
  br label %if.end147

if.end147:                                        ; preds = %cont146, %cont120
  %4 = ptrtoint ptr %x to i64
  %5 = and i64 %4, 1
  %6 = icmp eq i64 %5, 0
  br i1 %6, label %cont148, label %trap3

cont148:                                          ; preds = %if.end147
  %inc158 = or i32 %i.0, 1
  br label %for.cond116

cont213:                                          ; preds = %for.cond116
  ret void
}

attributes #0 = { cold noreturn nounwind }
include/c++/v1/__debug_utils/strict_weak_ordering_check.h:59: assertion __comp(*(__first + __a), *(__first + __b)) failed: Your comparator is not a valid strict-weak ordering
PLEASE submit a bug report to ... and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: blaze-bin/third_party/llvm/llvm-project/clang/clang -O1 -c -o /dev/null -x ir -
1.      Optimizer

@alexfh
Copy link
Contributor

alexfh commented Sep 13, 2024

Filed #108618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category llvm:analysis llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

missed optimization for y / x when the divisor has [0, 1] range
10 participants