Scheduling directive to hoist the storage of the function #7915

vksnk · 2023-10-24T17:53:35Z

This PR adds a new directive which allows to hoist a storage of the function to the specified loop level. This is different from Func::store_at, because hoist_storage simply moves an actual allocation to a given loop level and doesn't trigger any of the optimizations such as sliding window. Hoisting storage is optional and can be used as an optimization to avoid unnecessary allocations by moving it out from an inner loop.

hoist_storage can be used together with Func::store_at and Func::fold_storage (for example, to hoist the storage allocated
after sliding window optimization).

vksnk · 2023-10-25T04:13:10Z

Thanks a lot for the review (and especially for the extra tests)! I addressed all of the comments, PTAL.

abadams · 2023-10-25T18:51:41Z

src/StorageFlattening.cpp

@@ -400,13 +488,31 @@ class FlattenDimensions : public IRMutator {
    }

    Stmt visit(const For *op) override {
+        for (auto &p : hoisted_storages) {


I think scope is too large here. Consider:

let a = b hoist (f) { .. let c = a realize f[c] { } }

We'd want that to lower to:

let a = b allocate f[a] { ... }

but I think with the current state of the PR it will lower to:

let a = b allocate f[b] { ... }

This matters if there are a lot of lets outside the hoist level and b ends up as a very large expr.

Not sure what the best solution is. I guess there could be a separate scope per hoist node and then in the LetStmt visitor you push things onto the scopes of all live hoist nodes. That has the pitfall of doing a lot of repeated expand_expr+simplify work if you have lots of hoisted storage nodes at the same scope.

If you do go with one global scope then loop_bounds is a loop_invariant here and should be pulled out of this loop and only computed if !hoisted_storages.empty(). Then in the LetStmt visitor you should only be pushing things onto the scope if !hoisted_storages.empty(). That approach is a bit weird though because the lowering for an inner hoist node will vary depending on whether or not an unrelated outer hoist node exists.

Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node:

Expr expanded_min = op->min; Expr expanded_extent = op->extent; // Iterate from innermost outwards for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) { expanded_min = simplify(expand_expr(expanded_min, it->second.scope)); expanded_extent = expand_expr(expanded_extent, it->second.scope)); ... }

Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have
separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node.

That's a nice idea. I've implemented it this way, PTAL.

I was actually thinking of a loop that was linear in the number of hoisted storage nodes, not quadratic:

So instead of this:

for (auto &p : hoisted_storages) { Expr expanded_min = op->min; Expr expanded_extent = op->extent; // Iterate from innermost outwards for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) { expanded_min = expand_expr(expanded_min, it->scope); expanded_extent = expand_expr(expanded_extent, it->scope); } expanded_min = simplify(expanded_min); Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1)); p.loop_vars.push(op->name, loop_bounds); }

Something like this:

Expr expanded_min = op->min; Expr expanded_extent = op->extent; // Iterate from innermost outwards for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) { expanded_min = simplify(expand_expr(expanded_min, it->scope)); expanded_extent = expand_expr(expanded_extent, it->scope); Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1)); it->loop_vars.push(op->name, loop_bounds); }

Oh, yeah, that's better for sure.

Some very minor performance gains, but mostly in the noise. Also switched the apps makefiles to emit stmt html by default instead of stmt, to take advantage of the new and improved stmt html.

src/StorageFlattening.cpp

vksnk · 2023-10-26T20:07:53Z

Huh, looks like a legit test failure on the windows build bot. I can't find much information in the build bot logs other than it was a SEGFAULT and it's not reproducible for me on mac or linux, I guess I will have to push a commit with more debugging info enabled.

abadams · 2023-10-26T22:44:21Z

Is it possible the custom malloc is returning something with insufficient alignment on x86-32? I don't see anywhere that should matter though...

This reverts commit a5a689b.

This reverts commit bb6b8c1.

vksnk · 2023-10-27T17:40:07Z

Is it possible the custom malloc is returning something with insufficient alignment on x86-32? I don't see anywhere that should matter though...

Yes, I think you're right. Thanks a lot for suggestion.

I have reverted debugging commits and re-running the buildbots (there is one failure so far, but I don't think it's related).

abadams · 2023-10-27T17:44:01Z

Yes, that failure is because I can't figure out how to set up webgpu testing correctly on one of the x86 mac minis

vksnk · 2023-10-27T21:01:29Z

Yes, that failure is because I can't figure out how to set up webgpu testing correctly on one of the x86 mac minis

Everything else is passing now, so should be ready to merge.

* Minimal hoist_storage plumbing * HoistedStorage placeholder IR node * Basic hoist_storage test * Fully plumb through the HoistedStorage node * IRPrinter for HoistedStorage * Insert hoisted storage at the correct loop level * Progress * Formatted * Move out common code for creating Allocate node * Format * Emit Allocate at the HoistedStorage site * Collect all dependant vars * Basic test working * Progress * Substitute lets into allocation extents instead of lifting stuff * Infer bounds for the extends dependant on loop variables * Update tests * Remove old code * Remove old code * Better tests * More tests * Validate schedules with hoist_storage * Error test * Fix stupid mistake * More tests * Remove debug prints * Better errors * Add missing handler for inlined functions * Format * Comments * Format * Add some missing visit handlers * New line * Fix comment * Luckily we only have two build systems * Adds hoist_storage_root * Comment for IR node * Serialization support for HoistedStorage * Handle hoist_storage fo tuples * Handle multiple realize nodes * Move assert up * Better error message * Better loop bounds * Format * Updated error message * Happy clang-tidy happy me * An error message when compute is inlined, but store is not inlined * Only mutate lets which are needed * Update apps to use hoist_storage Some very minor performance gains, but mostly in the noise. Also switched the apps makefiles to emit stmt html by default instead of stmt, to take advantage of the new and improved stmt html. * Switch to stack of hoisted storages * Limit scope of lets for expansion * Break early * Skip substitute_in_all_lets * Re-use expanded min/extents * WebAssembly JIT does not support custom allocators * Change debug level to get more info about segfault * More debugging prints * Let's try aligned malloc * Revert "Change debug level to get more info about segfault" This reverts commit a5a689b. * Revert "More debugging prints" This reverts commit bb6b8c1. --------- Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

vksnk added 30 commits October 2, 2023 10:09

Minimal hoist_storage plumbing

751df87

HoistedStorage placeholder IR node

8a4ff0f

Basic hoist_storage test

1deb752

Fully plumb through the HoistedStorage node

e6ede8a

IRPrinter for HoistedStorage

260d08f

Insert hoisted storage at the correct loop level

d2a14ce

Progress

e208847

Formatted

2c2a3d5

Move out common code for creating Allocate node

d8f4329

Format

5915a26

Emit Allocate at the HoistedStorage site

9abaffd

Collect all dependant vars

48a4a32

Basic test working

3ddd617

Progress

34aa7b1

Substitute lets into allocation extents instead of lifting stuff

13a33d9

Infer bounds for the extends dependant on loop variables

bde1d98

Update tests

d955197

Remove old code

6f45939

Remove old code

bf5e7b8

Better tests

bd79393

More tests

38bc6ce

Merge branch 'main' into vksnk/hoist_storage

696f80e

Validate schedules with hoist_storage

22e586b

Merge branch 'main' into vksnk/hoist_storage

a8441be

Error test

1b3a59c

Fix stupid mistake

a03066c

More tests

82d0195

Remove debug prints

6e00f20

Better errors

36e9dd3

Add missing handler for inlined functions

108b5ea

Only mutate lets which are needed

6b97d0e

abadams reviewed Oct 25, 2023

View reviewed changes

abadams and others added 3 commits October 25, 2023 13:23

Update apps to use hoist_storage

f7eaa70

Some very minor performance gains, but mostly in the noise. Also switched the apps makefiles to emit stmt html by default instead of stmt, to take advantage of the new and improved stmt html.

Switch to stack of hoisted storages

a2e89e2

Limit scope of lets for expansion

de660db

abadams reviewed Oct 25, 2023

View reviewed changes

src/StorageFlattening.cpp Show resolved Hide resolved

abadams reviewed Oct 25, 2023

View reviewed changes

src/StorageFlattening.cpp Outdated Show resolved Hide resolved

Break early

b40a46d

abadams added the release_notes For changes that may warrant a note in README for official releases. label Oct 25, 2023

vksnk added 3 commits October 25, 2023 16:48

Skip substitute_in_all_lets

9f21873

Re-use expanded min/extents

014be5f

WebAssembly JIT does not support custom allocators

3f3c515

This was referenced Oct 26, 2023

[Xtensa] Adds a pass to lift allocations out of the loop #7629

Closed

If you store_at without a compute_at, the error message isn't very good: #7920

Closed

Change debug level to get more info about segfault

a5a689b

vksnk added 4 commits October 26, 2023 18:59

More debugging prints

bb6b8c1

Let's try aligned malloc

a5f363f

Revert "Change debug level to get more info about segfault"

841b718

This reverts commit a5a689b.

Revert "More debugging prints"

839205a

This reverts commit bb6b8c1.

abadams approved these changes Oct 27, 2023

View reviewed changes

vksnk merged commit 97573c6 into main Oct 27, 2023
3 checks passed

vksnk deleted the vksnk/hoist_storage branch October 27, 2023 21:30

BrewTestBot mentioned this pull request Feb 2, 2024

halide 17.0.0 Homebrew/homebrew-core#161602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling directive to hoist the storage of the function #7915

Scheduling directive to hoist the storage of the function #7915

vksnk commented Oct 24, 2023

vksnk commented Oct 25, 2023

abadams Oct 25, 2023

vksnk Oct 25, 2023

abadams Oct 26, 2023

vksnk Oct 26, 2023

vksnk commented Oct 26, 2023

abadams commented Oct 26, 2023

vksnk commented Oct 27, 2023

abadams commented Oct 27, 2023

vksnk commented Oct 27, 2023

Scheduling directive to hoist the storage of the function #7915

Scheduling directive to hoist the storage of the function #7915

Conversation

vksnk commented Oct 24, 2023

vksnk commented Oct 25, 2023

abadams Oct 25, 2023

Choose a reason for hiding this comment

vksnk Oct 25, 2023

Choose a reason for hiding this comment

abadams Oct 26, 2023

Choose a reason for hiding this comment

vksnk Oct 26, 2023

Choose a reason for hiding this comment

vksnk commented Oct 26, 2023

abadams commented Oct 26, 2023

vksnk commented Oct 27, 2023

abadams commented Oct 27, 2023

vksnk commented Oct 27, 2023