Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix][TIR] Fix dynamic smem merge leaf alloc #16216

Merged
merged 4 commits into from
Dec 14, 2023

Conversation

nox-410
Copy link
Contributor

@nox-410 nox-410 commented Dec 8, 2023

The MergeDynamicSharedMemoryAllocations currently will first free and then allocate for each scope. However, when a buffer is allocated and freed within a leaf scope, it will run into a free before alloc bug.

This commit solves this bug by delaying the leaf free after the alloc is done.

A simple test case is also added

import tvm
import tvm.testing
from tvm.script import tir as T

class TestLeafAllocFree(tvm.testing.CompareBeforeAfter):
    transform = tvm.tir.transform.MergeDynamicSharedMemoryAllocations()

    def before(self):
        @T.prim_func
        def func():
            threadIdx_x = T.launch_thread("threadIdx.x", 128)
            A_sh_data = T.allocate([128], "float32", "shared.dyn")
            B_sh_data = T.allocate([128], "float32", "shared.dyn")
            A_sh = T.decl_buffer([128], "float32", data=A_sh_data, scope="shared.dyn")
            B_sh = T.decl_buffer([128], "float32", data=B_sh_data, scope="shared.dyn")
            B_sh[threadIdx_x] = A_sh[threadIdx_x]
        return func

    def expected(self):
        @T.prim_func
        def func():
            threadIdx_x = T.launch_thread("threadIdx.x", 128)
            buf_dyn_shmem = T.allocate([1024], "uint8", "shared.dyn")
            A_sh = T.decl_buffer((128,), data=buf_dyn_shmem, scope="shared.dyn")
            B_sh = T.decl_buffer((128,), data=buf_dyn_shmem, scope="shared.dyn")
            B_sh[threadIdx_x + 128] = A_sh[threadIdx_x]
        return func


if __name__ == "__main__":
    tvm.testing.main()

@nox-410 nox-410 marked this pull request as ready for review December 9, 2023 11:28
@nox-410 nox-410 marked this pull request as draft December 10, 2023 06:27
@nox-410 nox-410 marked this pull request as ready for review December 10, 2023 14:48
@Hzfengsy
Copy link
Member

cc @Lunderberg @jinhongyii. Would be great if you can help review.

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It looks like this edge case occurs when (1) there's a dynamic memory allocation and (2) there is only a single statement that ever accesses that dynamic memory allocation. This is definitely a case that we should handle, and the implementation looks good.

That said, the existence of this edge case looks like we could provide even better. If a buffer is only ever accessed in a single location, it suggests that there could be additional opportunities to improve the behavior.

  • BufferLoad: The only access of the buffer is a read from it. This is a read to uninitialized memory, which is already undefined behavior. If the only access to a buffer is a BufferLoad, we can raise a compile-time error.

  • BufferStore: The only access of the buffer is a store into it. This is a write to memory that is never accessed, which has no impact on observable behavior. If the only access to a buffer is a BufferStore, we can replace the BufferStore(buf, indices, value) with Evaluate(value).

  • builtin::tvm_access_ptr(): This is the only case when a buffer accessed exactly once can be well-defined and have an observable effect. For example, making a data pointer to pass to an external function for use as a scratch space. We probably should track whether this access pointer is assigned to a variable, as the linear_seq_ is incorrect if the access pointer is defined once, but used multiple times.

(These aren't improvements that should be blocking for the current PR, as this PR is already an improvement on the current implementation, but pointing out potential further improvements. If you'd like to handle any of the follow-ups, ping me and I can review them.)

@nox-410
Copy link
Contributor Author

nox-410 commented Dec 14, 2023

Thank you for the detailed comments @Lunderberg , I found this bug when I try to allocate a workspace for an extern function (your 3rd case). I have not met the linear_seq_ problem you mentioned. I alloc a new workspace once I make an func call and the pass seems to reuse them correctly for now.

@tqchen tqchen merged commit c8bfdb2 into apache:main Dec 14, 2023
19 checks passed
@Lunderberg
Copy link
Contributor

I have not met the linear_seq_ problem you mentioned. I alloc a new workspace once I make an func call and the pass seems to reuse them correctly for now.

Sounds good, and so long as you're aware of the possibility. I wasn't able to make a test case that would trigger the issue, as attempting to make it resulted in other errors occurring while collecting the shared allocations. It's enough of an edge case that I'm not too worried about it.

@nox-410 nox-410 deleted the merge_dyn_leaf_alloc_free branch December 16, 2023 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants