-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Async DMA pipelining test suite #13005
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Nice work. Requesting a few small changes. And, happy to work with you on the request to schedule the mem_copy
rather than hard code.
) | ||
return expected_output | ||
|
||
def get_single_dma_schedule(size_a, size_w): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This schedule looks correct. But, we also have to duplicate the compute statement here and hand-code the mem_copy
intrinsics for synchronous DMA. Wondering if we can schedule the mem_copy
intrinsics rather than hard-coding. There is a TE example of how to do this here. Could we apply that example to this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempted this and was unable to get the tensor desc to match because of some weirdness 😔.
print() | ||
|
||
|
||
class TestMatMulVec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Class name needs updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! ✅
T.reinterpret(W[vi, T.ramp(0, 1, 128)], dtype="int32x32"), | ||
dtype="int32x32", | ||
) | ||
T.evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might add a comment here that this is necessary only for purposes of getting accurate timings. That is, we want to flush all async DMAs before we stop the clock to check perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Added ✅
number = 1 | ||
repeat = 1 | ||
else: | ||
number = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 100: Should this be a parameter (even if a single value) or a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems a bit circular since it would be
timer_number = 100
timer_repeat = 100
if tvm.testing.utils.IS_IN_CI:
# These are reduced for CI
number = 1
repeat = 1
else:
number = timer_number
repeat = timer_repeat
timer = module.time_evaluator(
"__tvm_main__", hexagon_session.device, number=number, repeat=repeat
)
Which makes for quite a bit of unnecessary code. Also this is to some extent a magic number. I just chose it from experience and not tied to any functionality. The other option is I could do this
if tvm.testing.utils.IS_IN_CI:
# These are reduced for CI
timer = module.time_evaluator(
"__tvm_main__", hexagon_session.device, number=1, repeat=1
)
else:
timer = module.time_evaluator(
"__tvm_main__", hexagon_session.device, number=100, repeat=100
)
But not sure that is better. Anyway let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like option B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! ✅
|
||
@tvm.testing.fixture | ||
def input_a(size_a): | ||
return default_rng().integers(0, 8, (size_a, 128), dtype="uint8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 128: Should this be a parameter (even if a single value) or a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all occurrences of 128 and 32 to be constants. Makes it a little more verbose but also a little more clear good idea! ✅
T.func_attr({"global_symbol": "main", "tir.noalias": True}) | ||
A = T.match_buffer(a, [size_a, 128], dtype="uint8", align=128, mem_scope="global") | ||
W = T.match_buffer(b, [size_w, 128], dtype="uint8", align=128, mem_scope="global") | ||
C = T.match_buffer(c, [size_a, 32], dtype="int32", align=128, mem_scope="global") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 32: Should this be a parameter (even if a single value) or a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed were possible! T.ramp() does not allow variables. ✅
sch = get_single_dma_schedule(size_a, size_w) | ||
single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output) | ||
|
||
transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments here about this math would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output) | ||
|
||
transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2) | ||
complexity = round(size_a * size_w * (128 * 4) / 1e9, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon. * Formatting updates. * Update comments and reformatting. * Skip long tests in CI.
A_global_vtcm = T.alloc_buffer(a_shape, dtype="uint8", mem_scope="global.vtcm") | ||
W_global_vtcm = T.alloc_buffer(w_shape, dtype="uint8", mem_scope="global.vtcm") | ||
C_global_vtcm = T.alloc_buffer(out_shape, dtype="int32", mem_scope="global.vtcm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nverke just wanted to report an issue that i happened to detect using the new TVMScript parser
T.alloc_buffer
doesn't accept the parameter mem_scope
, but instead it uses scope
. Current generation of TVMScript parser doesn't report any error of such misuses (which is weird) but instead assumes scope="global"
.
To debug what's exactly happening, you may print out the method using operator.show
, which is:
@T.prim_func
def func(a_buffer: T.Buffer[(1024, 128), "uint8"], w_buffer: T.Buffer[(1, 128), "uint8"], c_buffer: T.Buffer[(1024, 32), "int32"]):
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
a_global_vtcm = T.alloc_buffer([1024, 128], dtype="uint8") ## <==== note it's "global" rather than "global.vtcm"
w_global_vtcm = T.alloc_buffer([1, 128], dtype="uint8")
c_global_vtcm = T.alloc_buffer([1024, 32], dtype="int32")
T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(a_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(a_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32"))
T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(w_global_vtcm.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(w_buffer.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 128), dtype="int32"))
for n, index_0 in T.grid(1024, 1):
with T.block("c_buffer"):
vn_index, vi_index = T.axis.remap("SR", [n, index_0])
T.reads(a_global_vtcm[vn_index, 0 : 128], w_global_vtcm[vi_index, 0 : 128], c_global_vtcm[vn_index, 0 : 32])
T.writes(c_global_vtcm[vn_index, 0 : 32])
with T.init():
for x in T.serial(32):
c_global_vtcm[vn_index, x] = 0
c_global_vtcm[vn_index, 0:32] = c_global_vtcm[vn_index, 0:32] + T.call_llvm_intrin(3885, T.uint32(2), T.reinterpret(a_global_vtcm[vn_index, 0:128], dtype="int32x32"), T.reinterpret(w_global_vtcm[vi_index, 0:128], dtype="int32x32"), dtype="int32x32")
T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(c_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.tvm_stack_make_array(c_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32"))
* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon. * Formatting updates. * Update comments and reformatting. * Skip long tests in CI.
…ing on hexagon.
The purpose of this test is to show how to create a pipeline that utilizes async dma copies to vtcm for performance speedup. It compares performance results between several different schedules and should serve as a good starting point for others wishing to take advantage of the new features.
Each column specifies the data copy method used for the Activation, Weight, and Output vectors respectively with the following options.
N = No copying to VTCM
S = Synchronous DMA copies to VTCM
B = Basic/Naive copies to VTCM
A = Asynchronous DMA copies to VTCM
For example B, B, A uses Naive copies for the Activation and Weight input vectors and uses Async DMA copies for the output vector (VTCM -> DDR)
cc @adstraw @tmoreau89