-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule] Refactor ScheduleRule Attributes #13195
[MetaSchedule] Refactor ScheduleRule Attributes #13195
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
3d002b9
to
25335d1
Compare
061af8f
to
10e2d45
Compare
This PR refactors the existing `schedule_rule` block annotation-based dispatch into a more organized folder structure, which follows the exact same mechanism as AutoTVM. In the example below, when target is `Target("llvm -keys=cuda,gpu")` and the block annotation is `conv2d_nchw_winograd_inverse`: ```python with T.block("some_block"): T.block_attr({ "scehdule_rule": "conv2d_nchw_winograd_inverse", }) ``` the system will find the following global packed functions in order: - `meta_schedule.cuda.conv2d_nchw_winograd_inverse` - `meta_schedule.gpu.conv2d_nchw_winograd_inverse` whose function signatures are: ```python def schedule_rule( sch: tvm.tir.Schedule, block: tvm.tir.schedule.BlockRV, ) -> List[tvm.tir.Schedule]: ``` In terms of code organization, for example, for target key `cuda`, the schedule functions are supposed to be defined in: - `include/tvm/meta_schedule/schedule/cuda` for public methods - `src/meta_schedule/schedule/cuda` for private methods - `python/tvm/meta_schedule/schedule/cuda` for direct python side definition
10e2d45
to
0e16e61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I find the introduction of ApplyCustomRule
pretty helpful -- no need to couple that with PostOrderApply
. Only got 2 nits. And thanks for fixing the typos.
(2 < kh < 8 and 2 < kw < 8 and kh == kw) | ||
N, _, H, W = get_const_tuple(data.shape) | ||
CO, CI, KH, KW = get_const_tuple(kernel.shape) | ||
(_, _, judge_winograd_auto_scheduler) = judge_winograd( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are resuing the policy of judge_winograd
for MS
, probably good to call it judge_winograd_decision
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly - it definitely sounds better
output : tvm.te.Tensor | ||
4-D with shape [batch, out_height, out_width, out_channel] | ||
""" | ||
tile_size = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm if tile_size
is limited to 4 here is for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for simplicity but for consistency with auto-scheduler's nhwc case a couple of lines above. note that this configuration is also overridden by cuda dispatch, so we dont have to worry too much about it
@@ -58,7 +58,7 @@ class EvolutionarySearch(SearchStrategy): | |||
def __init__( | |||
self, | |||
*, | |||
population_size: int = 2048, | |||
population_size: int = 512, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zxybazh Is this change intended to go with this PR, or is it a left-over from development?
This change is so small and looks unrelated to this PR, but it has huge implications. Tuning time will become much shorter, which I like, but could there be a concern for perf regression due to this change?
I'm asking this because I have been doing perf and tuning time improvement for int8 TC, and after rebasing I'm directly affected by this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi Thanks for asking, and it's definitely a valid concern. The reason is that we found it's somehow set to an unreasonably large number (we were consistently using 512 on downstream), and now wanted to scale it back, because it negatively affects tuning time. Also note that the population calculation is slightly different than AutoScheduler, which includes invalid candidates in population, while ours always prune ahead of time.
We did make sure we have proper numbers in hand before merging, and please refer to the table below for details:
Mainline (ms) | This PR (ms) | Difference | |
---|---|---|---|
resnet_50 | 1.829128008 | 1.73111331 | 5.66% |
mobilenet_v2 | 0.4773168361 | 0.4806015715 | -0.68% |
resnet_18 | 0.678600832 | 0.6301909627 | 7.68% |
mobilenet_v3 | 0.6649458484 | 0.668666894 | -0.56% |
wide_resnet_50 | 3.753781691 | 3.077549256 | 21.97% |
densenet_121 | 2.361973117 | 2.288411393 | 3.21% |
inception_v3 | 3.53431975 | 3.478368823 | 1.61% |
resnet3d_18 | 7.791658449 | 7.465777971 | 4.36% |
} | ||
|
||
TVM_REGISTER_NODE_TYPE(ApplyCustomRuleNode); | ||
TVM_REGISTER_GLOBAL("meta_schedule.ScheduleApplyCustomRule") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo? Shouldn't it be ScheduleRuleApplyCustomRule
.
schedule_rule/apply_custom_rule.py
refers to ScheduleRuleApplyCustomRule
that doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a typo, given it's part of a ScheduleRule
and the ffi function call here on the python side doesn't have corresponding registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is a typo. Feel free to send a quick patch :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just sent a patch #13353
applied = (*custom_schedule_fn)(sch, block_rv); | ||
if (!ScheduleRule::IsApplyCustomRule(sch_rule)) { | ||
if (tir::GetAnn<String>(sch->GetSRef(block_rv), "schedule_rule").defined()) { | ||
stack.emplace_back(sch, blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this change broke auto tensorization for VNNI and Hexagon vrmpy
. They target the TE compute that happens to be annotated with schedule_rule
(specifically, NCHWc int8 conv2d https://github.com/apache/tvm/blob/main/python/tvm/topi/nn/conv2d.py#L534-L552). After this change, MultiLevelTilingWithIntrin
will never be applied to the block corresponding to this te compute.
So for auto tensorization, we just want to ignore this annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! I was not aware the usecase where schedule_rule is hardcoded but intentionally ignored. In this particular case, how about we add a line below in ApplyCustomRule
after the warning is emitted:
sch->Unannotate(block, "schedule_rule");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Unannotate
fixed the issue.
} | ||
} | ||
block.CopyOnWrite()->annotations.Set(topi_attr, new_buffers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intentional? Just found that this change broke RewriteLayout
when link-params = True
. It reverts the change I made in #12991. We need to preserve layout_free_placeholders
annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the Buffer objects in the annotation will not be updated when doing transformation, and that’s why I removed this line. I was not aware this is useful for link-param-specific hack.
Feel free to add it back if it’s needed. Shall we add a flag in pass config which defaults to off, and hexagon-specific users could turn it on when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, for now I reverted this change in #13384 to unblock tuning with link params
. I don't have a good solution for now, other than enforcing transformation passes to take buffers in layout_free_placeholders
into account.
I'd say the same problem applies to the current approach which specifies the layout free buffers by their indices in the parameter list, since it is also technically possible (although unlikely) that some transformation may change the signature of the prim func, which invalidates the parameter indices.
This PR refactors the existing
schedule_rule
block annotation-based dispatch into a more organized folder structure, which follows the exact same mechanism as AutoTVM.In the example below, when target is
Target("llvm -keys=cuda,gpu")
and the block annotation isconv2d_nchw_winograd_inverse
:the system will find the following global packed functions in order:
meta_schedule.cuda.conv2d_nchw_winograd_inverse
meta_schedule.gpu.conv2d_nchw_winograd_inverse
whose function signatures are:
In terms of code organization, for example, for target key
cuda
, the schedule functions are supposed to be defined in:include/tvm/meta_schedule/schedule/cuda
for public methodssrc/meta_schedule/schedule/cuda
for private methodspython/tvm/meta_schedule/schedule/cuda
for direct python side definition