Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Add hexagon user DMA intrins for tensorization #13719

Merged
merged 4 commits into from
Jan 17, 2023

Conversation

nverke
Copy link
Contributor

@nverke nverke commented Jan 6, 2023

Added some intrins for user dma on hexagon. Currently these seem to perform worse than all other options used in the test.

@adstraw

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jan 6, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

T.evaluate(
T.tvm_call_packed(
"device_api.hexagon.dma_copy",
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronous DMA uses queue ID -1. See here. This is so as not to interfere with async DMA flow which uses queue IDs starting with 0. Please use queue -1 and add some comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Generator of dma_load intrins"""

@T.prim_func
def dma_load_desc(a: T.handle, c: T.handle) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like this to be called "sync_dma_load_desc" with some comments to distinguish between async and sync (copy and immediate wait) flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C[vii] = A[vii]

@T.prim_func
def dma_load_impl(a: T.handle, c: T.handle) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_dma_load_impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T.address_of(C[0], dtype="handle"),
T.address_of(A[0], dtype="handle"),
size,
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need comments, at least to indicate that this is for bypass. Better would be to tie the setting of this bit to tir.experimental_dma_bypass_cache annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to just add a comment about this for now. These intrins don't currently have any planned use so figure if we find one we can add increased functionality.

dtype="int32",
)
)
T.evaluate(T.tvm_call_packed("device_api.hexagon.dma_wait", 0, 0, dtype="int32"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queue = -1. Comments that Wait(queue, 0) means to wait for the queue to drain which is the sum total of the previous dma_copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 20 * KB,
# 40 * KB,
# 80 * KB,
# 160 * KB,
# 320 * KB,
640 * KB,
# MB,
# 2 * MB,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to uncomment this? Makes the test run longer in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check if running in CI.

number = 1
repeat = 1
number = 10
repeat = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to change this? Makes the test run longer in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, should have added a check for CI awhile ago.

@@ -163,3 +204,27 @@ def dot_product_32x2_i16i16i32_vdmpy(a: T.handle, b: T.handle, c: T.handle) -> N

VRMPY_u8i8i32_VTCM_INTRIN = "dot_32x4_u8i8i32_vtcm_vrmpy"
TensorIntrin.register(VRMPY_u8i8i32_VTCM_INTRIN, *generate_dot_product_32x4_u8i8i32("global.vtcm"))

DMA_READ_1_u8 = "dma_read_1_u8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see users for most of these. Seems like it might be better to delete and allow users to create what is needed based on the test case or schedule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed ✅

@adstraw
Copy link
Contributor

adstraw commented Jan 13, 2023

Nice work. Just wanted to note that we will need to (and have plans to) put our heads together on a unified approach to DMA lowering. Right now, the async DMA flow stems from TIR annotations added by the InjectSoftwarePipe pass. This PR suggests a tensorization approach for synchronous DMA. I feel we could benefit from a unified approach. I recently deleted an old (TE based) tensorization approach for sync DMA scheduling. Glad to have this new TIR based version and it may in fact be the path forward for a unified approach to DMA lowering. Let's put our heads together and get the right design.

@mehrdadh mehrdadh merged commit e25e618 into apache:main Jan 17, 2023
fzi-peccia pushed a commit to fzi-peccia/tvm that referenced this pull request Mar 27, 2023
Added some intrins for user dma on hexagon. Currently these seem to perform worse than all other options used in the test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants