[BugFix][MetaSchedule] MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul #17012
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Below script can be used to reproduce the issue. You may need to run it multiple times to reproduce, because sample_perfect_tile may sometime to hide the issue with some decision.
The error log is something like below.
The root cause is, the Batch Loop will be treated the same way as the other two spacial loops, M and N Loops, the Batch Loop will be decomposed following the SSSRRSRS fashion. But
MultiLevelTilingTensorCoreNode::TransformIntermediateOutputLayout
has the assumption that the inner most two S should only have M and N loops' segments, which causes that, the followingAddWriteReuseTensorCore
adds an "wmma.accumulator" cache write block, and fuses some loop vars, which only belong to M and N Loops, and binds them to "threadIdx.y". But the previous, also the first, fused loop bound to "threadIdx.y" contains Batch Loop's segment. So the inconsistency arises.The fix is simple, just skip the outer Batch Loop from sample_perfect_tile process and fuse it into "blockIdx.y".
Actually I also tried more complex strategy that decomposes Batch Loop into SSS, with each segment binds to "blockIdx.y" "blockIdx.x" "threadIdx.y" separately, so inner most two Ss contain no Batch Loop segment. But this strategy is less performant for several typical workload. I think that's because
AddWriteReuseTensorCore
will reorder the inner loop var across the Batch Loop segment, which cause less data locality.Below I also paste the trace before the fix (including postproc trace), you can replay it line by line and print the each loop extent to verify the inconsistency.