[Hexagon] Async DMA pipelining test suite #13005

nverke · 2022-10-06T23:07:29Z

…ing on hexagon.

The purpose of this test is to show how to create a pipeline that utilizes async dma copies to vtcm for performance speedup. It compares performance results between several different schedules and should serve as a good starting point for others wishing to take advantage of the new features.

Approximated Activation Shape	Approximated Weight Shape	Approximated complexity (GOPS)	Total Memory Transferred (MB)	N, N, N (ms)	S, S, S (ms)	B, B, B (ms)	A, B, B (ms)	B, B, A (ms)	A, B, A (ms)	Async DMA Speedup (Compared to N, N, N)
(1, 32, 32, 128)	(1, 1, 1, 128)	0.001	0.26	0.0547	0.0715	0.6492	0.6172	0.1398	0.1033	0.53x
(1, 32, 32, 128)	(1, 3, 3, 128)	0.005	0.26	0.0678	0.085	0.6636	0.6328	0.1526	0.1201	0.56x
(1, 32, 32, 128)	(1, 7, 7, 128)	0.026	0.27	0.1264	0.1463	0.7283	0.6952	0.2178	0.1843	0.69x
(1, 32, 32, 128)	(1, 9, 9, 128)	0.042	0.27	0.1689	0.1954	0.7742	0.7426	0.2626	0.2317	0.73x
(1, 64, 64, 128)	(1, 1, 1, 128)	0.002	1.05	0.2237	0.2812	2.6897	2.3175	0.4083	0.2715	0.82x
(1, 64, 64, 128)	(1, 3, 3, 128)	0.019	1.05	0.375	0.3362	2.7396	2.3716	0.4621	0.3262	1.15x
(1, 64, 64, 128)	(1, 7, 7, 128)	0.103	1.05	0.6882	0.6202	3.008	2.6313	0.7207	0.5853	1.18x
(1, 64, 64, 128)	(1, 9, 9, 128)	0.17	1.06	0.879	0.8318	3.1913	2.8071	0.8981	0.7623	1.15x
(1, 128, 128, 128)	(1, 1, 1, 128)	0.008	4.19	1.0594	1.4324	11.6076	9.2136	2.7771	0.9508	1.11x
(1, 128, 128, 128)	(1, 3, 3, 128)	0.075	4.2	2.2994	2.4282	11.9346	9.4362	3.0186	1.1717	1.96x
(1, 128, 128, 128)	(1, 7, 7, 128)	0.411	4.2	5.0044	5.1789	12.9821	10.4726	3.8554	2.2011	2.27x
(1, 128, 128, 128)	(1, 9, 9, 128)	0.679	4.2	7.6593	7.0915	13.7597	11.1763	4.7059	2.9003	2.64x

Each column specifies the data copy method used for the Activation, Weight, and Output vectors respectively with the following options.
N = No copying to VTCM
S = Synchronous DMA copies to VTCM
B = Basic/Naive copies to VTCM
A = Asynchronous DMA copies to VTCM

For example B, B, A uses Naive copies for the Activation and Weight input vectors and uses Async DMA copies for the output vector (VTCM -> DDR)

cc @adstraw @tmoreau89

…ing on hexagon.

adstraw

Looks good overall! Nice work. Requesting a few small changes. And, happy to work with you on the request to schedule the mem_copy rather than hard code.

adstraw · 2022-10-07T20:29:45Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+                    )
+    return expected_output
+
+def get_single_dma_schedule(size_a, size_w):


This schedule looks correct. But, we also have to duplicate the compute statement here and hand-code the mem_copy intrinsics for synchronous DMA. Wondering if we can schedule the mem_copy intrinsics rather than hard-coding. There is a TE example of how to do this here. Could we apply that example to this test?

Attempted this and was unable to get the tensor desc to match because of some weirdness 😔.

adstraw · 2022-10-07T20:30:01Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+    print()
+
+
+class TestMatMulVec:


Nit: Class name needs updating.

adstraw · 2022-10-07T20:30:58Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+                    T.reinterpret(W[vi, T.ramp(0, 1, 128)], dtype="int32x32"),
+                    dtype="int32x32",
+                )
+        T.evaluate(


You might add a comment here that this is necessary only for purposes of getting accurate timings. That is, we want to flush all async DMAs before we stop the clock to check perf.

Good idea. Added ✅

adstraw · 2022-10-07T20:32:01Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+        number = 1
+        repeat = 1
+    else:
+        number = 100


Magic number 100: Should this be a parameter (even if a single value) or a constant?

This one seems a bit circular since it would be

timer_number = 100 timer_repeat = 100 if tvm.testing.utils.IS_IN_CI: # These are reduced for CI number = 1 repeat = 1 else: number = timer_number repeat = timer_repeat timer = module.time_evaluator( "__tvm_main__", hexagon_session.device, number=number, repeat=repeat )

Which makes for quite a bit of unnecessary code. Also this is to some extent a magic number. I just chose it from experience and not tied to any functionality. The other option is I could do this

if tvm.testing.utils.IS_IN_CI: # These are reduced for CI timer = module.time_evaluator( "__tvm_main__", hexagon_session.device, number=1, repeat=1 ) else: timer = module.time_evaluator( "__tvm_main__", hexagon_session.device, number=100, repeat=100 )

But not sure that is better. Anyway let me know what you think.

I like option B.

Updated! ✅

adstraw · 2022-10-07T20:33:59Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+
+@tvm.testing.fixture
+def input_a(size_a):
+    return default_rng().integers(0, 8, (size_a, 128), dtype="uint8")


Magic number 128: Should this be a parameter (even if a single value) or a constant?

Changed all occurrences of 128 and 32 to be constants. Makes it a little more verbose but also a little more clear good idea! ✅

adstraw · 2022-10-07T20:36:07Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+        T.func_attr({"global_symbol": "main", "tir.noalias": True})
+        A = T.match_buffer(a, [size_a, 128], dtype="uint8", align=128, mem_scope="global")
+        W = T.match_buffer(b, [size_w, 128], dtype="uint8", align=128, mem_scope="global")
+        C = T.match_buffer(c, [size_a, 32], dtype="int32", align=128, mem_scope="global")


Magic number 32: Should this be a parameter (even if a single value) or a constant?

Changed were possible! T.ramp() does not allow variables. ✅

adstraw · 2022-10-07T20:38:53Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+        sch = get_single_dma_schedule(size_a, size_w)
+        single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output)
+
+        transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2)


Some comments here about this math would be good.

adstraw · 2022-10-07T20:39:00Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+        single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output)
+
+        transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2)
+        complexity = round(size_a * size_w * (128 * 4) / 1e9, 3)


Same as above.

adstraw

LGTM!

tvm-bot · 2022-10-17T21:23:46Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Built docs for commit a547bd6 can be found here.

_{Generated by tvm-bot}

* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon. * Formatting updates. * Update comments and reformatting. * Skip long tests in CI.

junrushao · 2022-11-11T19:22:49Z

tests/python/contrib/test_hexagon/test_async_dma_pipeline.py

+        A_global_vtcm = T.alloc_buffer(a_shape, dtype="uint8", mem_scope="global.vtcm")
+        W_global_vtcm = T.alloc_buffer(w_shape, dtype="uint8", mem_scope="global.vtcm")
+        C_global_vtcm = T.alloc_buffer(out_shape, dtype="int32", mem_scope="global.vtcm")


@nverke just wanted to report an issue that i happened to detect using the new TVMScript parser

T.alloc_buffer doesn't accept the parameter mem_scope, but instead it uses scope. Current generation of TVMScript parser doesn't report any error of such misuses (which is weird) but instead assumes scope="global".

To debug what's exactly happening, you may print out the method using operator.show, which is:

@T.prim_func def func(a_buffer: T.Buffer[(1024, 128), "uint8"], w_buffer: T.Buffer[(1, 128), "uint8"], c_buffer: T.Buffer[(1024, 32), "int32"]): # function attr dict T.func_attr({"global_symbol": "main", "tir.noalias": True}) # body # with T.block("root") a_global_vtcm = T.alloc_buffer([1024, 128], dtype="uint8") ## <==== note it's "global" rather than "global.vtcm" w_global_vtcm = T.alloc_buffer([1, 128], dtype="uint8") c_global_vtcm = T.alloc_buffer([1024, 32], dtype="int32") T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(a_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(a_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32")) T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(w_global_vtcm.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(w_buffer.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 128), dtype="int32")) for n, index_0 in T.grid(1024, 1): with T.block("c_buffer"): vn_index, vi_index = T.axis.remap("SR", [n, index_0]) T.reads(a_global_vtcm[vn_index, 0 : 128], w_global_vtcm[vi_index, 0 : 128], c_global_vtcm[vn_index, 0 : 32]) T.writes(c_global_vtcm[vn_index, 0 : 32]) with T.init(): for x in T.serial(32): c_global_vtcm[vn_index, x] = 0 c_global_vtcm[vn_index, 0:32] = c_global_vtcm[vn_index, 0:32] + T.call_llvm_intrin(3885, T.uint32(2), T.reinterpret(a_global_vtcm[vn_index, 0:128], dtype="int32x32"), T.reinterpret(w_global_vtcm[vi_index, 0:128], dtype="int32x32"), dtype="int32x32") T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(c_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.tvm_stack_make_array(c_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32"))

* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon. * Formatting updates. * Update comments and reformatting. * Skip long tests in CI.

[Hexagon] Add tests to show how to properly utilize async dma pipelin…

35ae9bc

…ing on hexagon.

masahi mentioned this pull request Oct 6, 2022

[Hexagon] disable cache_write schedule type from sw pipeline test #13004

Merged

nverke changed the title ~~[Hexagon] Add tests to show how to properly utilize async dma pipelin…~~ [Hexagon] Async DMA pipelining test suite Oct 7, 2022

Formatting updates.

3f1fd1c

github-actions bot requested a review from tmoreau89 October 7, 2022 16:06

adstraw suggested changes Oct 7, 2022

View reviewed changes

Update comments and reformatting.

7a49450

adstraw approved these changes Oct 10, 2022

View reviewed changes

Skip long tests in CI.

a547bd6

nverke closed this by deleting the head repository Oct 17, 2022

nverke reopened this Oct 17, 2022

masahi approved these changes Oct 17, 2022

View reviewed changes

masahi merged commit 8ccc434 into apache:main Oct 17, 2022

junrushao reviewed Nov 11, 2022

View reviewed changes

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hexagon] Async DMA pipelining test suite #13005

[Hexagon] Async DMA pipelining test suite #13005

nverke commented Oct 6, 2022 •

edited

Loading

adstraw left a comment

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw Oct 7, 2022

nverke Oct 10, 2022 •

edited

Loading

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw Oct 10, 2022

nverke Oct 10, 2022

adstraw Oct 7, 2022

nverke Oct 10, 2022 •

edited

Loading

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw Oct 7, 2022

nverke Oct 10, 2022

adstraw left a comment

tvm-bot commented Oct 17, 2022

junrushao Nov 11, 2022

[Hexagon] Async DMA pipelining test suite #13005

[Hexagon] Async DMA pipelining test suite #13005

Conversation

nverke commented Oct 6, 2022 • edited Loading

adstraw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nverke Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nverke Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adstraw left a comment

Choose a reason for hiding this comment

tvm-bot commented Oct 17, 2022

Choose a reason for hiding this comment

nverke commented Oct 6, 2022 •

edited

Loading

nverke Oct 10, 2022 •

edited

Loading

nverke Oct 10, 2022 •

edited

Loading