Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tiling sizes heuristics for elementwise dispatches. #10179

Merged
merged 7 commits into from
Aug 29, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 100 additions & 16 deletions compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -933,22 +933,20 @@ static LogicalResult setRootConfig(func::FuncOp entryPointFn,
}

static void setX86WorkgroupTileSizes(
linalg::GenericOp genericOp, unsigned numLoops,
ArrayRef<int64_t> flowTileSizes, ArrayRef<int64_t> minTileSizes,
ArrayRef<int64_t> maxTileSizes,
SmallVectorImpl<int64_t> &workgroupTileSizes) {
linalg::GenericOp genericOp, ArrayRef<int64_t> flowTileSizes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between flowTileSizes, tileSizes and workgroupTileSizes. At least the first and last mean the same thing to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should revert this change, and send out a cleanup fix. I think they should be flowTileSizes, secondTileSizes, resultTileSizes. It looks into the information from flowTileSizes and secondTileSizes, and stores the tile sizes into resultTileSizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Do you want to add it to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I undo it in this PR, and will send it as a separate PR. Because they are not related to the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been scratching my head about this multiple times. A cleanup would be very welcome!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I know I wrote this! :D . Ok, I missed the return of workgroupTileSizes. My bad.

ArrayRef<int64_t> tileSizes, SmallVectorImpl<int64_t> &workgroupTileSizes,
bool allowIncompleteTile = false) {
unsigned numLoops = genericOp.getNumLoops();
workgroupTileSizes.append(numLoops, 0);
SmallVector<int64_t, 4> staticLoopRanges = genericOp.getStaticLoopRanges();
for (auto loopNum : llvm::seq<unsigned>(0, numLoops)) {
if (flowTileSizes[loopNum]) {
workgroupTileSizes[loopNum] =
getMaxTileSize(0, flowTileSizes[loopNum], minTileSizes[loopNum],
minTileSizes[loopNum]);
for (auto i : llvm::seq<unsigned>(0, numLoops)) {
if (flowTileSizes[i]) {
workgroupTileSizes[i] = getMaxTileSize(0, flowTileSizes[i], tileSizes[i],
tileSizes[i], allowIncompleteTile);
} else {
// If the flow level tile size is zero, and static loop range is 0 as
// If the flow level tile size is zero, and static loop range is 1 as
// well, set the tile sizes here to zero as well.
workgroupTileSizes[loopNum] =
staticLoopRanges[loopNum] == 1 ? 0 : minTileSizes[loopNum];
workgroupTileSizes[i] = staticLoopRanges[i] == 1 ? 0 : tileSizes[i];
}
}
}
Expand Down Expand Up @@ -1010,8 +1008,8 @@ static LogicalResult setDefaultGenericOpRootConfig(
// Set the next level tile sizes.
SmallVector<int64_t> parallelTileSizes;
SmallVector<int64_t> reductionTileSizes;
setX86WorkgroupTileSizes(genericOp, numLoops, flowTileSizes, minTileSizes,
maxTileSizes, parallelTileSizes);
setX86WorkgroupTileSizes(genericOp, flowTileSizes, minTileSizes,
parallelTileSizes);
splitParallelAndReductionTiles(genericOp, parallelTileSizes,
reductionTileSizes);

Expand Down Expand Up @@ -1089,8 +1087,8 @@ static LogicalResult setTransposeLikeOpRootConfig(func::FuncOp entryPointFn,

// Set the next level tile sizes.
SmallVector<int64_t> parallelTileSizes;
setX86WorkgroupTileSizes(genericOp, numLoops, flowTileSizes, minTileSizes,
maxTileSizes, parallelTileSizes);
setX86WorkgroupTileSizes(genericOp, flowTileSizes, minTileSizes,
parallelTileSizes);

TileSizesListType tileSizes;
tileSizes.push_back(flowTileSizes);
Expand All @@ -1106,11 +1104,97 @@ static LogicalResult setTransposeLikeOpRootConfig(func::FuncOp entryPointFn,
tileSizes, passPipeline);
}

static LogicalResult setElementwiseGenericOpRootConfig(
hanhanW marked this conversation as resolved.
Show resolved Hide resolved
func::FuncOp entryPointFn, linalg::GenericOp genericOp) {
if (getLoweringConfig(genericOp)) {
return success();
}

unsigned numLoops = genericOp.getNumLoops();
if (numLoops == 0) return success();
if (!linalg::isElementwise(genericOp)) return success();

// Set the flow level tiling to the default.
SmallVector<int64_t> minTileSizes =
getMinTilingSizesForEachDim(entryPointFn, genericOp);
SmallVector<int64_t> maxTileSizes(numLoops, defaultWorkgroupTileSize);
SmallVector<int64_t> flowTileSizes =
getDefaultDistributedLevelTileSizes(genericOp, minTileSizes, maxTileSizes,
/*allowIncompleteTile=*/true);

// Adjust the number of workload per workgroup to at least 4096.
hanhanW marked this conversation as resolved.
Show resolved Hide resolved
constexpr int64_t kMinimumWorkload = 4096;
auto shape = genericOp.getStaticLoopRanges();
int64_t numWorkload = 1;
for (auto en : llvm::enumerate(shape)) {
int64_t size = en.value();
if (size == ShapedType::kDynamicSize) {
numWorkload = ShapedType::kDynamicSize;
break;
}
int index = en.index();
if (flowTileSizes[index]) {
size = flowTileSizes[index];
}
numWorkload *= size;
}
for (unsigned currDim = 0;
numWorkload < kMinimumWorkload && currDim < numLoops;) {
int64_t currSize = flowTileSizes[currDim];
if (currSize == shape[currDim] || currSize == 0 ||
shape[currDim] == ShapedType::kDynamicSize ||
numWorkload == ShapedType::kDynamicSize) {
currDim++;
continue;
}
int64_t newSize = std::min<int64_t>(currSize * 2, shape[currDim]);
numWorkload = numWorkload / currSize * newSize;
flowTileSizes[currDim] = newSize;
}

// Adjust tiling sizes of vector levels to avoid large unroll factors.
SmallVector<int64_t> vecTileSizes(minTileSizes.begin(), minTileSizes.end());
for (auto operand : genericOp.getOutputOperands()) {
constexpr int64_t kMaxUnrollFactor = 8;
AffineMap map = genericOp.getTiedIndexingMap(operand);
int64_t vecSize = getVectorSize(entryPointFn, operand->get().getType());
int64_t currSize = 1;
for (auto dimExpr : llvm::reverse(map.getResults().drop_back())) {
unsigned pos = dimExpr.cast<AffineDimExpr>().getPosition();
if (vecTileSizes[pos] * currSize > vecSize * kMaxUnrollFactor) {
vecTileSizes[pos] = 1;
currSize = vecSize * kMaxUnrollFactor;
}
}
int fastestPos =
map.getResults().back().cast<AffineDimExpr>().getPosition();
vecTileSizes[fastestPos] =
std::min<int64_t>(vecTileSizes[fastestPos], kMaxUnrollFactor);
}

// Setting reduction tile sizes is a workaround to kick in peeling transform.
// The tiling won't happen because the sizes are zeros.
SmallVector<int64_t> zeros(numLoops, 0);

TileSizesListType tileSizes;
tileSizes.push_back(flowTileSizes);
tileSizes.push_back(vecTileSizes);
tileSizes.push_back(zeros);

hanhanW marked this conversation as resolved.
Show resolved Hide resolved
auto passPipeline =
genericOp.hasTensorSemantics()
? DispatchLoweringPassPipeline::CPUDoubleTilingPeelingExpert
: DispatchLoweringPassPipeline::CPUBufferOpsTileAndVectorize;
return setOpConfigAndEntryPointFnTranslation(entryPointFn, genericOp,
tileSizes, passPipeline);
}

/// Sets the lowering configuration for a generic op to use
/// CPUDoubleTilingExpert pipeline.
static LogicalResult setRootConfig(func::FuncOp entryPointFn,
linalg::GenericOp genericOp) {
if (failed(setTransposeLikeOpRootConfig(entryPointFn, genericOp)) ||
failed(setElementwiseGenericOpRootConfig(entryPointFn, genericOp)) ||
failed(setDefaultGenericOpRootConfig(entryPointFn, genericOp))) {
return failure();
}
Expand Down
1 change: 1 addition & 0 deletions compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ void addCPUBufferOpsTileAndVectorizePipeline(OpPassManager &passManager) {
LinalgSingleTilingExpertPassOptions options;
options.tilingLevel =
static_cast<int64_t>(StrategyTilingLevel::ParallelTiles);
options.peel = true;
options.vectorize = true;
nestedModulePM.addNestedPass<func::FuncOp>(
createLinalgSingleTilingExpertPass(options));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ hal.executable private @add {
}
}
}
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[32, 32], [1, 4], [0, 0]]>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[64, 64], [1, 4], [0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingPeelingExpert>
// CHECK: hal.executable.export public @add
// CHECK-SAME: translation_info = #[[TRANSLATION]]
Expand Down Expand Up @@ -275,7 +275,7 @@ hal.executable private @add4D {
}
}

// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 32, 32, 32], [1, 1, 1, 4], [0, 0, 0, 0]]>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 64, 64, 64], [1, 1, 1, 4], [0, 0, 0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingPeelingExpert>
// CHECK: hal.executable.export public @add4D
// CHECK-SAME: translation_info = #[[TRANSLATION]]
Expand Down Expand Up @@ -316,8 +316,8 @@ hal.executable private @add_static {
}
}
}
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 8, 16, 32], [1, 1, 1, 4], [0, 0, 0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingExpert>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 8, 16, 64], [1, 1, 1, 4], [0, 0, 0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingPeelingExpert>
// CHECK: hal.executable.export public @add_static
// CHECK-SAME: translation_info = #[[TRANSLATION]]
// CHECK: linalg.generic
Expand Down Expand Up @@ -408,7 +408,7 @@ hal.executable @copy_op_dynamic {
}
}

// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[32, 32], [1, 1], [0, 0]{{\]}}>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[64, 64], [1, 4], [0, 0]{{\]}}>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUBufferOpsTileAndVectorize>
// CHECK: hal.executable.export public @copy_op_dynamic
// CHECK-SAME: translation_info = #[[TRANSLATION]]
Expand Down Expand Up @@ -738,8 +738,8 @@ hal.executable private @generic_static {
}
}
}
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[16, 32], [16, 16], [0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingExpert>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[16, 96], [16, 8], [0, 0]]>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingPeelingExpert>
// CHECK: hal.executable.export public @generic_static
// CHECK-SAME: translation_info = #[[TRANSLATION]]
// CHECK: linalg.generic
Expand Down Expand Up @@ -1088,8 +1088,8 @@ hal.executable private @generic_unit_dims_dynamic {
}
}
}
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 0, 0, 0, 32, 32, 0, 32], [0, 1, 0, 0, 1, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0]{{\]}}>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingExpert>
// CHECK-DAG: #[[CONFIG:.+]] = #iree_codegen.lowering_config<tile_sizes = {{\[}}[0, 0, 0, 0, 64, 64, 0, 64], [1, 1, 1, 1, 1, 1, 1, 4], [0, 0, 0, 0, 0, 0, 0, 0]{{\]}}>
// CHECK-DAG: #[[TRANSLATION:.+]] = #iree_codegen.translation_info<CPUDoubleTilingPeelingExpert>
// CHECK: hal.executable.export public @generic_unit_dims_dynamic
// CHECK-SAME: translation_info = #[[TRANSLATION]]
// CHECK: linalg.generic
Expand Down