Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR] Implement API for padded layout transformations #12720

Merged
merged 27 commits into from
Sep 19, 2022

Conversation

Lunderberg
Copy link
Contributor

@Lunderberg Lunderberg commented Sep 6, 2022

Implementation of API in tvm.tir.schedule for layout transformations with padding, as part of #12261, item "Insert pad value into generated TIR, using tir::if_then_else, builtin::assume, and builtin::undef".

Following the RFC discussion here and here, this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of tir::if_then_else.

cc @Hzfengsy @junrushao1994

@Lunderberg Lunderberg requested a review from vinx13 September 6, 2022 21:27
@Hzfengsy
Copy link
Member

Hzfengsy commented Sep 7, 2022

cc @wrongtest-intellif @vinx13

Unless specifically testing opaque blocks, all unit tests for the
transform layout scheduling primitive now operate on non-opaque
blocks.
include/tvm/tir/schedule/schedule.h Outdated Show resolved Hide resolved
Comment on lines +2493 to +2495
If an IndexMap or Callable, the transformation is the
value to be present in the padding in terms of the
transformed index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp side only accepts Optional[PrimExpr], seems this is not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I had been thinking of it as the (const Array<Var>&, const Array<PrimExpr>&) call signature on the TE side for the transformation, and was avoiding introducing additional structures. I had forgotten that the TIR schedule accepts an IndexMap for the transformation, and agree that the C++ side would be better expressed as an Optional<IndexMap> instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates made to pass Optional<IndexMap> pad_value throughout C++ API, mimicking how IndexMap index_map is passed, along with a unit test to validate the functionality.

Comment on lines 584 to 588
std::vector<WriteInfo> write_info_;
std::vector<For> active_loops_;
std::unordered_map<const VarNode*, std::pair<size_t, size_t>> loop_depth_lookup_;
std::unordered_map<const VarNode*, PrimExpr> active_let_bindings_;
Optional<BlockRealize> innermost_block_realize_{NullOpt};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document these fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, and documentation added here for member vars, along with how they are used when collecting WriteInfo.

#include "../../../arith/ir_mutator_with_analyzer.h"
#include "../utils.h"

namespace tvm {
namespace tir {

class LayoutTransformPlanner : private StmtExprVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document the high level algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, and documentation added here for the general algorithm, and when each handling of padding may be used.

Specifically calling attention to how `pad_value` interacts with input
buffers, that correctness depends on the calling scope providing the
specified `pad_value`.
The previous name `LayoutTransformPlanner` didn't follow the pattern
of `TransformLayoutWriter`.  Therefore, renaming to
`TransformLayoutPlanner`.
@Lunderberg
Copy link
Contributor Author

Looks like the final failing unit test is due to an incorrect mapping in tir.tensor_intrin.cuda.shared_32x16_to_ldmatrix_32x16_layout. It currently returns [(i % 4) + 4 * (j % 8), 8 * (j // 8) + (i // 16) * 4 + i % 4], which doesn't fill all 32x16 indices, and fails when attempted to use NonSurjectiveInverse on it.

@@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j):


def shared_32x16_to_ldmatrix_32x16_layout(i, j):
thread_id = (i % 4) + 4 * (j % 8)
thread_id = (i % 16) // 4 + 4 * (j % 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @masahi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tagging @masahi, I had forgotten to do so. I think I have it set up correctly, based on Nvidia documentation and similarity to the (16,32) shape, but couldn't verify definitively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@masahi masahi Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I was talking about shared_16x16_to_ldmatrix_32x8_layout. I need to remember how I came up with shared_32x16_to_ldmatrix_32x16_layout. I think it is used for int8 MMA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the index map is incorrect, it doesn't affect the correctness of tensorized MMA since the index map is only used for pattern matching purpose...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into it! I wasn't able to find any tests that explicitly validate the transform (e.g. use the transform to generate data in a specific layout, then pass through the mma), as all the tests either started with transformed data, only used the 16x16 shape, or replaced everything with the tensor intrinsic.

I had put together this standalone test to convince myself on it. The main issue with the current index map is that it doesn't map to unique locations (512 input indices map to 128 output indices). It only arose as an issue in this PR, because it generates the inverse in order to determine whether/where padding is required.

Previous version mapped the 512 input indices in a `(32,16)` array to
only 128 output indices.  This wasn't caught before, because the
bijectivity assertion was only triggered for TE schedules.
Copy link
Member

@vinx13 vinx13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just some comments

)

try:
iter(mapping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this? According to the doc the mapping function should return a List, it might also need update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to allow the mapping function to return a single PrimExpr, or something that the ffi can convert into a PrimExpr. Since it wouldn't make sense for the pad value to provide multiple outputs, I found myself frequently writing lambda i,j : i+j instead of lambda i,j: [i+j]. I figured that since I was frequently making that mistake, later users would also likely make it as well, so it would be best to support that functionality.

Good call on the documentation, and I'll update the documentation for from_func and from_func_with_separators accordingly.

@@ -2479,6 +2480,31 @@ def transform_layout(
primitive will be called in addition to the
TransformLayout primitive.

pad_value: Optional[Union[int, float, PrimExpr, IndexMap, Callable]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the assumption when pad_value is IndexMap. I remember in the RFC we assume it should contain no BufferLoad from buffers except the current buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, and the docstring has been updated. I've also added two unit tests, one that validates that an error is raised when the pad value loads from a different buffer, and one that specifies the intended behavior for pad value that loads from the transformed buffer. The latter is currently marked with pytest.mark.xfail, as the support isn't implemented yet.

@Lunderberg
Copy link
Contributor Author

@Hzfengsy Can you review/verify that the requested changes (use non-opaque blocks in unit tests) are made? I think that's the only item remaining on the PR.

@Lunderberg Lunderberg merged commit 2af9b90 into apache:main Sep 19, 2022
@Lunderberg Lunderberg deleted the padded_layout_api branch September 19, 2022 13:20
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
Implementation of API in `tvm.tir.schedule` for layout transformations
with padding, as part of apache#12261,
item "Insert pad value into generated TIR, using `tir::if_then_else`,
`builtin::assume`, and `builtin::undef`".

Following the RFC discussion in
apache/tvm-rfcs#77 (comment) and
apache/tvm-rfcs#77 (comment),
this commit preferentially rewrites the loops that surround a padded
transformation where possible, in order to express padding in terms of
`tir::if_then_else`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants