Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule] Refactor ScheduleRule Attributes #13195

Merged

Conversation

junrushao
Copy link
Member

@junrushao junrushao commented Oct 25, 2022

This PR refactors the existing schedule_rule block annotation-based dispatch into a more organized folder structure, which follows the exact same mechanism as AutoTVM.

In the example below, when target is Target("llvm -keys=cuda,gpu") and the block annotation is conv2d_nchw_winograd_inverse:

with T.block("some_block"):
  T.block_attr({
    "scehdule_rule": "conv2d_nchw_winograd_inverse",
  })

the system will find the following global packed functions in order:

  • meta_schedule.cuda.conv2d_nchw_winograd_inverse
  • meta_schedule.gpu.conv2d_nchw_winograd_inverse
    whose function signatures are:
def schedule_rule(
  sch: tvm.tir.Schedule,
  block: tvm.tir.schedule.BlockRV,
) -> List[tvm.tir.Schedule]:

In terms of code organization, for example, for target key cuda, the schedule functions are supposed to be defined in:

  • include/tvm/meta_schedule/schedule/cuda for public methods
  • src/meta_schedule/schedule/cuda for private methods
  • python/tvm/meta_schedule/schedule/cuda for direct python side definition

@tvm-bot
Copy link
Collaborator

tvm-bot commented Oct 25, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@junrushao junrushao force-pushed the feature/2022-10-10/autotvm-debug branch 9 times, most recently from 3d002b9 to 25335d1 Compare November 1, 2022 17:23
@junrushao junrushao force-pushed the feature/2022-10-10/autotvm-debug branch 13 times, most recently from 061af8f to 10e2d45 Compare November 5, 2022 17:55
@junrushao junrushao changed the title [WIP][MetaSchedule] Refactor ScheduleRule Attributes [MetaSchedule] Refactor ScheduleRule Attributes Nov 5, 2022
This PR refactors the existing `schedule_rule` block annotation-based dispatch into a more organized
folder structure, which follows the exact same mechanism as AutoTVM.

In the example below, when target is `Target("llvm -keys=cuda,gpu")` and the block annotation is
`conv2d_nchw_winograd_inverse`:

```python
with T.block("some_block"):
  T.block_attr({
    "scehdule_rule": "conv2d_nchw_winograd_inverse",
  })
```

the system will find the following global packed functions in order:
- `meta_schedule.cuda.conv2d_nchw_winograd_inverse`
- `meta_schedule.gpu.conv2d_nchw_winograd_inverse`
whose function signatures are:

```python
def schedule_rule(
  sch: tvm.tir.Schedule,
  block: tvm.tir.schedule.BlockRV,
) -> List[tvm.tir.Schedule]:
```

In terms of code organization, for example, for target key `cuda`, the schedule functions are
supposed to be defined in:
- `include/tvm/meta_schedule/schedule/cuda` for public methods
- `src/meta_schedule/schedule/cuda` for private methods
- `python/tvm/meta_schedule/schedule/cuda` for direct python side definition
@junrushao junrushao force-pushed the feature/2022-10-10/autotvm-debug branch from 10e2d45 to 0e16e61 Compare November 7, 2022 17:37
@junrushao junrushao marked this pull request as ready for review November 7, 2022 20:15
@spectrometerHBH spectrometerHBH merged commit b16a64d into apache:main Nov 7, 2022
Copy link
Member

@zxybazh zxybazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I find the introduction of ApplyCustomRule pretty helpful -- no need to couple that with PostOrderApply. Only got 2 nits. And thanks for fixing the typos.

(2 < kh < 8 and 2 < kw < 8 and kh == kw)
N, _, H, W = get_const_tuple(data.shape)
CO, CI, KH, KW = get_const_tuple(kernel.shape)
(_, _, judge_winograd_auto_scheduler) = judge_winograd(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are resuing the policy of judge_winograd for MS, probably good to call it judge_winograd_decision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certainly - it definitely sounds better

output : tvm.te.Tensor
4-D with shape [batch, out_height, out_width, out_channel]
"""
tile_size = 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm if tile_size is limited to 4 here is for simplicity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for simplicity but for consistency with auto-scheduler's nhwc case a couple of lines above. note that this configuration is also overridden by cuda dispatch, so we dont have to worry too much about it

@@ -58,7 +58,7 @@ class EvolutionarySearch(SearchStrategy):
def __init__(
self,
*,
population_size: int = 2048,
population_size: int = 512,
Copy link
Member

@masahi masahi Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zxybazh Is this change intended to go with this PR, or is it a left-over from development?

This change is so small and looks unrelated to this PR, but it has huge implications. Tuning time will become much shorter, which I like, but could there be a concern for perf regression due to this change?

I'm asking this because I have been doing perf and tuning time improvement for int8 TC, and after rebasing I'm directly affected by this change.

Copy link
Member Author

@junrushao junrushao Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi Thanks for asking, and it's definitely a valid concern. The reason is that we found it's somehow set to an unreasonably large number (we were consistently using 512 on downstream), and now wanted to scale it back, because it negatively affects tuning time. Also note that the population calculation is slightly different than AutoScheduler, which includes invalid candidates in population, while ours always prune ahead of time.

We did make sure we have proper numbers in hand before merging, and please refer to the table below for details:

Mainline (ms) This PR (ms) Difference
resnet_50 1.829128008 1.73111331 5.66%
mobilenet_v2 0.4773168361 0.4806015715 -0.68%
resnet_18 0.678600832 0.6301909627 7.68%
mobilenet_v3 0.6649458484 0.668666894 -0.56%
wide_resnet_50 3.753781691 3.077549256 21.97%
densenet_121 2.361973117 2.288411393 3.21%
inception_v3 3.53431975 3.478368823 1.61%
resnet3d_18 7.791658449 7.465777971 4.36%

}

TVM_REGISTER_NODE_TYPE(ApplyCustomRuleNode);
TVM_REGISTER_GLOBAL("meta_schedule.ScheduleApplyCustomRule")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo? Shouldn't it be ScheduleRuleApplyCustomRule.

schedule_rule/apply_custom_rule.py refers to ScheduleRuleApplyCustomRule that doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a typo, given it's part of a ScheduleRule and the ffi function call here on the python side doesn't have corresponding registry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is a typo. Feel free to send a quick patch :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sent a patch #13353

applied = (*custom_schedule_fn)(sch, block_rv);
if (!ScheduleRule::IsApplyCustomRule(sch_rule)) {
if (tir::GetAnn<String>(sch->GetSRef(block_rv), "schedule_rule").defined()) {
stack.emplace_back(sch, blocks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this change broke auto tensorization for VNNI and Hexagon vrmpy. They target the TE compute that happens to be annotated with schedule_rule (specifically, NCHWc int8 conv2d https://github.com/apache/tvm/blob/main/python/tvm/topi/nn/conv2d.py#L534-L552). After this change, MultiLevelTilingWithIntrin will never be applied to the block corresponding to this te compute.

So for auto tensorization, we just want to ignore this annotation.

Copy link
Member Author

@junrushao junrushao Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see! I was not aware the usecase where schedule_rule is hardcoded but intentionally ignored. In this particular case, how about we add a line below in ApplyCustomRule after the warning is emitted:

sch->Unannotate(block, "schedule_rule");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Unannotate fixed the issue.

}
}
block.CopyOnWrite()->annotations.Set(topi_attr, new_buffers);
Copy link
Member

@masahi masahi Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional? Just found that this change broke RewriteLayout when link-params = True. It reverts the change I made in #12991. We need to preserve layout_free_placeholders annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the Buffer objects in the annotation will not be updated when doing transformation, and that’s why I removed this line. I was not aware this is useful for link-param-specific hack.

Feel free to add it back if it’s needed. Shall we add a flag in pass config which defaults to off, and hexagon-specific users could turn it on when needed?

Copy link
Member

@masahi masahi Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, for now I reverted this change in #13384 to unblock tuning with link params. I don't have a good solution for now, other than enforcing transformation passes to take buffers in layout_free_placeholders into account.

I'd say the same problem applies to the current approach which specifies the layout free buffers by their indices in the parameter list, since it is also technically possible (although unlikely) that some transformation may change the signature of the prim func, which invalidates the parameter indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants