-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Add hexagon user DMA intrins for tensorization #13719
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
T.evaluate( | ||
T.tvm_call_packed( | ||
"device_api.hexagon.dma_copy", | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronous DMA uses queue ID -1. See here. This is so as not to interfere with async DMA flow which uses queue IDs starting with 0. Please use queue -1 and add some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
"""Generator of dma_load intrins""" | ||
|
||
@T.prim_func | ||
def dma_load_desc(a: T.handle, c: T.handle) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like this to be called "sync_dma_load_desc" with some comments to distinguish between async and sync (copy and immediate wait) flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
C[vii] = A[vii] | ||
|
||
@T.prim_func | ||
def dma_load_impl(a: T.handle, c: T.handle) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_dma_load_impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
T.address_of(C[0], dtype="handle"), | ||
T.address_of(A[0], dtype="handle"), | ||
size, | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need comments, at least to indicate that this is for bypass. Better would be to tie the setting of this bit to tir.experimental_dma_bypass_cache
annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to just add a comment about this for now. These intrins don't currently have any planned use so figure if we find one we can add increased functionality.
dtype="int32", | ||
) | ||
) | ||
T.evaluate(T.tvm_call_packed("device_api.hexagon.dma_wait", 0, 0, dtype="int32")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queue = -1. Comments that Wait(queue, 0) means to wait for the queue to drain which is the sum total of the previous dma_copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
# 20 * KB, | ||
# 40 * KB, | ||
# 80 * KB, | ||
# 160 * KB, | ||
# 320 * KB, | ||
640 * KB, | ||
# MB, | ||
# 2 * MB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to uncomment this? Makes the test run longer in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check if running in CI.
number = 1 | ||
repeat = 1 | ||
number = 10 | ||
repeat = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to change this? Makes the test run longer in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, should have added a check for CI awhile ago.
@@ -163,3 +204,27 @@ def dot_product_32x2_i16i16i32_vdmpy(a: T.handle, b: T.handle, c: T.handle) -> N | |||
|
|||
VRMPY_u8i8i32_VTCM_INTRIN = "dot_32x4_u8i8i32_vtcm_vrmpy" | |||
TensorIntrin.register(VRMPY_u8i8i32_VTCM_INTRIN, *generate_dot_product_32x4_u8i8i32("global.vtcm")) | |||
|
|||
DMA_READ_1_u8 = "dma_read_1_u8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see users for most of these. Seems like it might be better to delete and allow users to create what is needed based on the test case or schedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed ✅
Nice work. Just wanted to note that we will need to (and have plans to) put our heads together on a unified approach to DMA lowering. Right now, the async DMA flow stems from TIR annotations added by the InjectSoftwarePipe pass. This PR suggests a tensorization approach for synchronous DMA. I feel we could benefit from a unified approach. I recently deleted an old (TE based) tensorization approach for sync DMA scheduling. Glad to have this new TIR based version and it may in fact be the path forward for a unified approach to DMA lowering. Let's put our heads together and get the right design. |
Added some intrins for user dma on hexagon. Currently these seem to perform worse than all other options used in the test.
Added some intrins for user dma on hexagon. Currently these seem to perform worse than all other options used in the test.
@adstraw