-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling directive to hoist the storage of the function #7915
Conversation
Thanks a lot for the review (and especially for the extra tests)! I addressed all of the comments, PTAL. |
src/StorageFlattening.cpp
Outdated
@@ -400,13 +488,31 @@ class FlattenDimensions : public IRMutator { | |||
} | |||
|
|||
Stmt visit(const For *op) override { | |||
for (auto &p : hoisted_storages) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scope
is too large here. Consider:
let a = b
hoist (f) {
..
let c = a
realize f[c] {
}
}
We'd want that to lower to:
let a = b
allocate f[a] {
...
}
but I think with the current state of the PR it will lower to:
let a = b
allocate f[b] {
...
}
This matters if there are a lot of lets outside the hoist level and b ends up as a very large expr.
Not sure what the best solution is. I guess there could be a separate scope per hoist node and then in the LetStmt visitor you push things onto the scopes of all live hoist nodes. That has the pitfall of doing a lot of repeated expand_expr+simplify work if you have lots of hoisted storage nodes at the same scope.
If you do go with one global scope then loop_bounds is a loop_invariant here and should be pulled out of this loop and only computed if !hoisted_storages.empty()
. Then in the LetStmt visitor you should only be pushing things onto the scope if !hoisted_storages.empty()
. That approach is a bit weird though because the lowering for an inner hoist node will vary depending on whether or not an unrelated outer hoist node exists.
Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node:
Expr expanded_min = op->min;
Expr expanded_extent = op->extent;
// Iterate from innermost outwards
for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
expanded_min = simplify(expand_expr(expanded_min, it->second.scope));
expanded_extent = expand_expr(expanded_extent, it->second.scope));
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have
separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node.
That's a nice idea. I've implemented it this way, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking of a loop that was linear in the number of hoisted storage nodes, not quadratic:
So instead of this:
for (auto &p : hoisted_storages) {
Expr expanded_min = op->min;
Expr expanded_extent = op->extent;
// Iterate from innermost outwards
for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
expanded_min = expand_expr(expanded_min, it->scope);
expanded_extent = expand_expr(expanded_extent, it->scope);
}
expanded_min = simplify(expanded_min);
Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1));
p.loop_vars.push(op->name, loop_bounds);
}
Something like this:
Expr expanded_min = op->min;
Expr expanded_extent = op->extent;
// Iterate from innermost outwards
for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
expanded_min = simplify(expand_expr(expanded_min, it->scope));
expanded_extent = expand_expr(expanded_extent, it->scope);
Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1));
it->loop_vars.push(op->name, loop_bounds);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, that's better for sure.
Some very minor performance gains, but mostly in the noise. Also switched the apps makefiles to emit stmt html by default instead of stmt, to take advantage of the new and improved stmt html.
Huh, looks like a legit test failure on the windows build bot. I can't find much information in the build bot logs other than it was a SEGFAULT and it's not reproducible for me on mac or linux, I guess I will have to push a commit with more debugging info enabled. |
Is it possible the custom malloc is returning something with insufficient alignment on x86-32? I don't see anywhere that should matter though... |
Yes, I think you're right. Thanks a lot for suggestion. I have reverted debugging commits and re-running the buildbots (there is one failure so far, but I don't think it's related). |
Yes, that failure is because I can't figure out how to set up webgpu testing correctly on one of the x86 mac minis |
Everything else is passing now, so should be ready to merge. |
* Minimal hoist_storage plumbing * HoistedStorage placeholder IR node * Basic hoist_storage test * Fully plumb through the HoistedStorage node * IRPrinter for HoistedStorage * Insert hoisted storage at the correct loop level * Progress * Formatted * Move out common code for creating Allocate node * Format * Emit Allocate at the HoistedStorage site * Collect all dependant vars * Basic test working * Progress * Substitute lets into allocation extents instead of lifting stuff * Infer bounds for the extends dependant on loop variables * Update tests * Remove old code * Remove old code * Better tests * More tests * Validate schedules with hoist_storage * Error test * Fix stupid mistake * More tests * Remove debug prints * Better errors * Add missing handler for inlined functions * Format * Comments * Format * Add some missing visit handlers * New line * Fix comment * Luckily we only have two build systems * Adds hoist_storage_root * Comment for IR node * Serialization support for HoistedStorage * Handle hoist_storage fo tuples * Handle multiple realize nodes * Move assert up * Better error message * Better loop bounds * Format * Updated error message * Happy clang-tidy happy me * An error message when compute is inlined, but store is not inlined * Only mutate lets which are needed * Update apps to use hoist_storage Some very minor performance gains, but mostly in the noise. Also switched the apps makefiles to emit stmt html by default instead of stmt, to take advantage of the new and improved stmt html. * Switch to stack of hoisted storages * Limit scope of lets for expansion * Break early * Skip substitute_in_all_lets * Re-use expanded min/extents * WebAssembly JIT does not support custom allocators * Change debug level to get more info about segfault * More debugging prints * Let's try aligned malloc * Revert "Change debug level to get more info about segfault" This reverts commit a5a689b. * Revert "More debugging prints" This reverts commit bb6b8c1. --------- Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
This PR adds a new directive which allows to hoist a storage of the function to the specified loop level. This is different from Func::store_at, because hoist_storage simply moves an actual allocation to a given loop level and doesn't trigger any of the optimizations such as sliding window. Hoisting storage is optional and can be used as an optimization to avoid unnecessary allocations by moving it out from an inner loop.
hoist_storage can be used together with Func::store_at and Func::fold_storage (for example, to hoist the storage allocated
after sliding window optimization).