Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling directive to hoist the storage of the function #7915

Merged
merged 62 commits into from
Oct 27, 2023

Conversation

vksnk
Copy link
Member

@vksnk vksnk commented Oct 24, 2023

This PR adds a new directive which allows to hoist a storage of the function to the specified loop level. This is different from Func::store_at, because hoist_storage simply moves an actual allocation to a given loop level and doesn't trigger any of the optimizations such as sliding window. Hoisting storage is optional and can be used as an optimization to avoid unnecessary allocations by moving it out from an inner loop.

hoist_storage can be used together with Func::store_at and Func::fold_storage (for example, to hoist the storage allocated
after sliding window optimization).

@vksnk
Copy link
Member Author

vksnk commented Oct 25, 2023

Thanks a lot for the review (and especially for the extra tests)! I addressed all of the comments, PTAL.

@@ -400,13 +488,31 @@ class FlattenDimensions : public IRMutator {
}

Stmt visit(const For *op) override {
for (auto &p : hoisted_storages) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scope is too large here. Consider:

let a = b
hoist (f) {
  ..
  let c = a
  realize f[c] {
  }
}

We'd want that to lower to:

let a = b
allocate f[a] {
  ...
}

but I think with the current state of the PR it will lower to:

let a = b
allocate f[b] {
  ...
}

This matters if there are a lot of lets outside the hoist level and b ends up as a very large expr.

Not sure what the best solution is. I guess there could be a separate scope per hoist node and then in the LetStmt visitor you push things onto the scopes of all live hoist nodes. That has the pitfall of doing a lot of repeated expand_expr+simplify work if you have lots of hoisted storage nodes at the same scope.

If you do go with one global scope then loop_bounds is a loop_invariant here and should be pulled out of this loop and only computed if !hoisted_storages.empty(). Then in the LetStmt visitor you should only be pushing things onto the scope if !hoisted_storages.empty(). That approach is a bit weird though because the lowering for an inner hoist node will vary depending on whether or not an unrelated outer hoist node exists.

Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node:

Expr expanded_min = op->min;
Expr expanded_extent = op->extent;
// Iterate from innermost outwards
for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
  expanded_min = simplify(expand_expr(expanded_min, it->second.scope));
  expanded_extent = expand_expr(expanded_extent, it->second.scope));
  ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, what if hoisted_storages is a stack of pairs instead of a map, with the innermost hoist node on top, you have
separate scopes per hoist node, but in the letstmt visitor you only add things to the scope of the topmost hoisted_storage node, and then in this for loop you incrementally expand exprs down to the correct level for each given hoist node.

That's a nice idea. I've implemented it this way, PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking of a loop that was linear in the number of hoisted storage nodes, not quadratic:

So instead of this:

      for (auto &p : hoisted_storages) {
            Expr expanded_min = op->min;
            Expr expanded_extent = op->extent;
            // Iterate from innermost outwards
            for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
                expanded_min = expand_expr(expanded_min, it->scope);
                expanded_extent = expand_expr(expanded_extent, it->scope);
            }
            expanded_min = simplify(expanded_min);
            Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1));
            p.loop_vars.push(op->name, loop_bounds);
        }

Something like this:

        Expr expanded_min = op->min;
        Expr expanded_extent = op->extent;
        // Iterate from innermost outwards
        for (auto it = hoisted_storages.rbegin(); it != hoisted_storages.rend(); it++) {
            expanded_min = simplify(expand_expr(expanded_min, it->scope));
            expanded_extent = expand_expr(expanded_extent, it->scope);
            Interval loop_bounds = Interval(expanded_min, simplify(expanded_min + expanded_extent - 1));
            it->loop_vars.push(op->name, loop_bounds);
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, that's better for sure.

abadams and others added 3 commits October 25, 2023 13:23
Some very minor performance gains, but mostly in the noise.

Also switched the apps makefiles to emit stmt html by default instead of
stmt, to take advantage of the new and improved stmt html.
src/StorageFlattening.cpp Outdated Show resolved Hide resolved
@abadams abadams added the release_notes For changes that may warrant a note in README for official releases. label Oct 25, 2023
@vksnk
Copy link
Member Author

vksnk commented Oct 26, 2023

Huh, looks like a legit test failure on the windows build bot. I can't find much information in the build bot logs other than it was a SEGFAULT and it's not reproducible for me on mac or linux, I guess I will have to push a commit with more debugging info enabled.

@abadams
Copy link
Member

abadams commented Oct 26, 2023

Is it possible the custom malloc is returning something with insufficient alignment on x86-32? I don't see anywhere that should matter though...

@vksnk
Copy link
Member Author

vksnk commented Oct 27, 2023

Is it possible the custom malloc is returning something with insufficient alignment on x86-32? I don't see anywhere that should matter though...

Yes, I think you're right. Thanks a lot for suggestion.

I have reverted debugging commits and re-running the buildbots (there is one failure so far, but I don't think it's related).

@abadams
Copy link
Member

abadams commented Oct 27, 2023

Yes, that failure is because I can't figure out how to set up webgpu testing correctly on one of the x86 mac minis

@vksnk
Copy link
Member Author

vksnk commented Oct 27, 2023

Yes, that failure is because I can't figure out how to set up webgpu testing correctly on one of the x86 mac minis

Everything else is passing now, so should be ready to merge.

@vksnk vksnk merged commit 97573c6 into main Oct 27, 2023
3 checks passed
@vksnk vksnk deleted the vksnk/hoist_storage branch October 27, 2023 21:30
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* Minimal hoist_storage plumbing

* HoistedStorage placeholder IR node

* Basic hoist_storage test

* Fully plumb through the HoistedStorage node

* IRPrinter for HoistedStorage

* Insert hoisted storage at the correct loop level

* Progress

* Formatted

* Move out common code for creating Allocate node

* Format

* Emit Allocate at the HoistedStorage site

* Collect all dependant vars

* Basic test working

* Progress

* Substitute lets into allocation extents instead of lifting stuff

* Infer bounds for the extends dependant on loop variables

* Update tests

* Remove old code

* Remove old code

* Better tests

* More tests

* Validate schedules with hoist_storage

* Error test

* Fix stupid mistake

* More tests

* Remove debug prints

* Better errors

* Add missing handler for inlined functions

* Format

* Comments

* Format

* Add some missing visit handlers

* New line

* Fix comment

* Luckily we only have two build systems

* Adds hoist_storage_root

* Comment for IR node

* Serialization support for HoistedStorage

* Handle hoist_storage fo tuples

* Handle multiple realize nodes

* Move assert up

* Better error message

* Better loop bounds

* Format

* Updated error message

* Happy clang-tidy happy me

* An error message when compute is inlined, but store is not inlined

* Only mutate lets which are needed

* Update apps to use hoist_storage

Some very minor performance gains, but mostly in the noise.

Also switched the apps makefiles to emit stmt html by default instead of
stmt, to take advantage of the new and improved stmt html.

* Switch to stack of hoisted storages

* Limit scope of lets for expansion

* Break early

* Skip substitute_in_all_lets

* Re-use expanded min/extents

* WebAssembly JIT does not support custom allocators

* Change debug level to get more info about segfault

* More debugging prints

* Let's try aligned malloc

* Revert "Change debug level to get more info about segfault"

This reverts commit a5a689b.

* Revert "More debugging prints"

This reverts commit bb6b8c1.

---------

Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants