Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Async DMA pipelining test suite #13005

Merged
merged 4 commits into from
Oct 17, 2022
Merged

[Hexagon] Async DMA pipelining test suite #13005

merged 4 commits into from
Oct 17, 2022

Conversation

nverke
Copy link
Contributor

@nverke nverke commented Oct 6, 2022

…ing on hexagon.

The purpose of this test is to show how to create a pipeline that utilizes async dma copies to vtcm for performance speedup. It compares performance results between several different schedules and should serve as a good starting point for others wishing to take advantage of the new features.

Approximated Activation Shape Approximated Weight Shape Approximated complexity (GOPS) Total Memory Transferred (MB) N, N, N (ms) S, S, S (ms) B, B, B (ms) A, B, B (ms) B, B, A (ms) A, B, A (ms) Async DMA Speedup (Compared to N, N, N)
(1, 32, 32, 128) (1, 1, 1, 128) 0.001 0.26 0.0547 0.0715 0.6492 0.6172 0.1398 0.1033 0.53x
(1, 32, 32, 128) (1, 3, 3, 128) 0.005 0.26 0.0678 0.085 0.6636 0.6328 0.1526 0.1201 0.56x
(1, 32, 32, 128) (1, 7, 7, 128) 0.026 0.27 0.1264 0.1463 0.7283 0.6952 0.2178 0.1843 0.69x
(1, 32, 32, 128) (1, 9, 9, 128) 0.042 0.27 0.1689 0.1954 0.7742 0.7426 0.2626 0.2317 0.73x
(1, 64, 64, 128) (1, 1, 1, 128) 0.002 1.05 0.2237 0.2812 2.6897 2.3175 0.4083 0.2715 0.82x
(1, 64, 64, 128) (1, 3, 3, 128) 0.019 1.05 0.375 0.3362 2.7396 2.3716 0.4621 0.3262 1.15x
(1, 64, 64, 128) (1, 7, 7, 128) 0.103 1.05 0.6882 0.6202 3.008 2.6313 0.7207 0.5853 1.18x
(1, 64, 64, 128) (1, 9, 9, 128) 0.17 1.06 0.879 0.8318 3.1913 2.8071 0.8981 0.7623 1.15x
(1, 128, 128, 128) (1, 1, 1, 128) 0.008 4.19 1.0594 1.4324 11.6076 9.2136 2.7771 0.9508 1.11x
(1, 128, 128, 128) (1, 3, 3, 128) 0.075 4.2 2.2994 2.4282 11.9346 9.4362 3.0186 1.1717 1.96x
(1, 128, 128, 128) (1, 7, 7, 128) 0.411 4.2 5.0044 5.1789 12.9821 10.4726 3.8554 2.2011 2.27x
(1, 128, 128, 128) (1, 9, 9, 128) 0.679 4.2 7.6593 7.0915 13.7597 11.1763 4.7059 2.9003 2.64x

Each column specifies the data copy method used for the Activation, Weight, and Output vectors respectively with the following options.
N = No copying to VTCM
S = Synchronous DMA copies to VTCM
B = Basic/Naive copies to VTCM
A = Asynchronous DMA copies to VTCM

For example B, B, A uses Naive copies for the Activation and Weight input vectors and uses Async DMA copies for the output vector (VTCM -> DDR)

cc @adstraw @tmoreau89

@nverke nverke changed the title [Hexagon] Add tests to show how to properly utilize async dma pipelin… [Hexagon] Async DMA pipelining test suite Oct 7, 2022
@github-actions github-actions bot requested a review from tmoreau89 October 7, 2022 16:06
Copy link
Contributor

@adstraw adstraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Nice work. Requesting a few small changes. And, happy to work with you on the request to schedule the mem_copy rather than hard code.

)
return expected_output

def get_single_dma_schedule(size_a, size_w):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schedule looks correct. But, we also have to duplicate the compute statement here and hand-code the mem_copy intrinsics for synchronous DMA. Wondering if we can schedule the mem_copy intrinsics rather than hard-coding. There is a TE example of how to do this here. Could we apply that example to this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted this and was unable to get the tensor desc to match because of some weirdness 😔.

print()


class TestMatMulVec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Class name needs updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! ✅

T.reinterpret(W[vi, T.ramp(0, 1, 128)], dtype="int32x32"),
dtype="int32x32",
)
T.evaluate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might add a comment here that this is necessary only for purposes of getting accurate timings. That is, we want to flush all async DMAs before we stop the clock to check perf.

Copy link
Contributor Author

@nverke nverke Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added ✅

number = 1
repeat = 1
else:
number = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 100: Should this be a parameter (even if a single value) or a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems a bit circular since it would be

timer_number = 100 
timer_repeat = 100 
if tvm.testing.utils.IS_IN_CI:
     # These are reduced for CI
     number = 1
     repeat = 1
 else:
     number = timer_number
     repeat = timer_repeat


 timer = module.time_evaluator(
     "__tvm_main__", hexagon_session.device, number=number, repeat=repeat
 )

Which makes for quite a bit of unnecessary code. Also this is to some extent a magic number. I just chose it from experience and not tied to any functionality. The other option is I could do this

    if tvm.testing.utils.IS_IN_CI:
        # These are reduced for CI
        timer = module.time_evaluator(
            "__tvm_main__", hexagon_session.device, number=1, repeat=1
        )
    else:
        timer = module.time_evaluator(
            "__tvm_main__", hexagon_session.device, number=100, repeat=100
        )

But not sure that is better. Anyway let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like option B.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! ✅


@tvm.testing.fixture
def input_a(size_a):
return default_rng().integers(0, 8, (size_a, 128), dtype="uint8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 128: Should this be a parameter (even if a single value) or a constant?

Copy link
Contributor Author

@nverke nverke Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all occurrences of 128 and 32 to be constants. Makes it a little more verbose but also a little more clear good idea! ✅

T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [size_a, 128], dtype="uint8", align=128, mem_scope="global")
W = T.match_buffer(b, [size_w, 128], dtype="uint8", align=128, mem_scope="global")
C = T.match_buffer(c, [size_a, 32], dtype="int32", align=128, mem_scope="global")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 32: Should this be a parameter (even if a single value) or a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed were possible! T.ramp() does not allow variables. ✅

sch = get_single_dma_schedule(size_a, size_w)
single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output)

transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments here about this math would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single_dma_runtime = evaluate(hexagon_session, sch, input_a, input_w, size_a, expected_output)

transfer_mb = round((2 * size_a * 128 + size_w * 128) / 1e6, 2)
complexity = round(size_a * size_w * (128 * 4) / 1e9, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@adstraw adstraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nverke nverke closed this by deleting the head repository Oct 17, 2022
@nverke nverke reopened this Oct 17, 2022
@tvm-bot
Copy link
Collaborator

tvm-bot commented Oct 17, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@masahi masahi merged commit 8ccc434 into apache:main Oct 17, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 10, 2022
* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon.

* Formatting updates.

* Update comments and reformatting.

* Skip long tests in CI.
Comment on lines +139 to +141
A_global_vtcm = T.alloc_buffer(a_shape, dtype="uint8", mem_scope="global.vtcm")
W_global_vtcm = T.alloc_buffer(w_shape, dtype="uint8", mem_scope="global.vtcm")
C_global_vtcm = T.alloc_buffer(out_shape, dtype="int32", mem_scope="global.vtcm")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nverke just wanted to report an issue that i happened to detect using the new TVMScript parser

T.alloc_buffer doesn't accept the parameter mem_scope, but instead it uses scope. Current generation of TVMScript parser doesn't report any error of such misuses (which is weird) but instead assumes scope="global".

To debug what's exactly happening, you may print out the method using operator.show, which is:

@T.prim_func
def func(a_buffer: T.Buffer[(1024, 128), "uint8"], w_buffer: T.Buffer[(1, 128), "uint8"], c_buffer: T.Buffer[(1024, 32), "int32"]):
    # function attr dict
    T.func_attr({"global_symbol": "main", "tir.noalias": True})
    # body
    # with T.block("root")
    a_global_vtcm = T.alloc_buffer([1024, 128], dtype="uint8") ## <==== note it's "global" rather than "global.vtcm"
    w_global_vtcm = T.alloc_buffer([1, 128], dtype="uint8")
    c_global_vtcm = T.alloc_buffer([1024, 32], dtype="int32")
    T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(a_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(a_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32"))
    T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(w_global_vtcm.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.tvm_stack_make_array(w_buffer.data, T.tvm_stack_make_shape(1, 128, dtype="handle"), 0, 2, "uint8", 0, dtype="handle"), T.Cast("int32", 128), dtype="int32"))
    for n, index_0 in T.grid(1024, 1):
        with T.block("c_buffer"):
            vn_index, vi_index = T.axis.remap("SR", [n, index_0])
            T.reads(a_global_vtcm[vn_index, 0 : 128], w_global_vtcm[vi_index, 0 : 128], c_global_vtcm[vn_index, 0 : 32])
            T.writes(c_global_vtcm[vn_index, 0 : 32])
            with T.init():
                for x in T.serial(32):
                    c_global_vtcm[vn_index, x] = 0
            c_global_vtcm[vn_index, 0:32] = c_global_vtcm[vn_index, 0:32] + T.call_llvm_intrin(3885, T.uint32(2), T.reinterpret(a_global_vtcm[vn_index, 0:128], dtype="int32x32"), T.reinterpret(w_global_vtcm[vi_index, 0:128], dtype="int32x32"), dtype="int32x32")
    T.evaluate(T.tvm_call_packed("device_api.hexagon.mem_copy_DLTensor", T.tvm_stack_make_array(c_buffer.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.tvm_stack_make_array(c_global_vtcm.data, T.tvm_stack_make_shape(1024, 128, dtype="handle"), 0, 2, "int32", 0, dtype="handle"), T.Cast("int32", 131072), dtype="int32"))

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [Hexagon] Add tests to show how to properly utilize async dma pipelining on hexagon.

* Formatting updates.

* Update comments and reformatting.

* Skip long tests in CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants