Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR][Schedule] DecomposePadding #12174

Merged
merged 1 commit into from
Jul 27, 2022

Conversation

wrongtest-intellif
Copy link
Contributor

Hi there, this PR wants to introduce a new TIR schedule primitive schd.decompose_padding(block, loop).

For padded conv or pooling ops, there is a typical padding pattern:

Pad[_] = T.if_then_else(pad_predicate, X[_], pad_value)

which could be decomposed into two parts:

  • One filling pad values with memset routine
  • One filling in-bound values with memcpy routine

The primitive's signature is alike decompose_reduction, which provides a target block and a loop position to insert newly created "init" block. It is helpful for infrastructures with high-performance memset/memcpy routines, and leverage the complexity to process padding conditions in the main compute block.

Example

  • Before
@T.prim_func
def before_decompose(x: T.Buffer[128, "int32"], y: T.Buffer[140, "int32"]):
    for i in range(140):
        with T.block("block"):
            vi = T.axis.remap("S", [i])
            y[vi] = T.if_then_else(vi >= 6 and vi < 134, x[vi - 6], 0, dtype="int32")
  • After decompose_padding(block, i)
@T.prim_func
def after_decompose(x: T.Buffer[128, "int32"], y: T.Buffer[140, "int32"]):
    for i in T.serial(140):
        with T.block("block_pad_const"):
            vi = T.axis.spatial(140, i)
            y[vi] = 0
    for i in T.serial(128):
        with T.block("block"):
            vi = T.axis.spatial(128, i)
            y[vi + 6] = x[vi]

Alternatives and drawbacks

  • From the graph perspective, one may be able to totally fold out the block which pad the input buffer. While the primitive seems to be more useful when one wants to perform padding in the intra-primfunc buffers.

  • One could also compute-inline the block perform padding, this introduces conditions in the main computation block, which may or may-not get optimized well, depending on the concrete target.

  • Currently there are schedule ability limitations on created blocks after decomposition. They can not be then compute-ated or compute-inlined. Because multiple blocks write to the same buffer break the stage pipeline property.

@vinx13
Copy link
Member

vinx13 commented Jul 25, 2022

cc @Lunderberg

@vinx13 vinx13 self-assigned this Jul 25, 2022
Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @junrushao1994

@vinx13 vinx13 merged commit ca30e5e into apache:main Jul 27, 2022
wrongtest-intellif added a commit that referenced this pull request Aug 30, 2022
Co-authored-by: baoxinqi <wrongtest@intellif.com>
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
Co-authored-by: baoxinqi <wrongtest@intellif.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants