[Metal] Support dynamic memory allocation on GPU #1174

k-ye · 2020-06-07T11:41:26Z

Concisely describe the proposed feature

I will add the support for the dynamic memory allocation on GPU for the Metal backend. This approach should be able to work for OpenGL in theory @archibate, although I guess it's more difficult due to the limited expressiveness of GLSL.

Background

In LLVM, the backbone of GPU memory management is ListManager:

taichi/taichi/runtime/llvm/runtime.cpp

Lines 357 to 365 in fbf2cc7

    
           struct ListManager { 
        
             static constexpr std::size_t max_num_chunks = 1024; 
        
             Ptr chunks[max_num_chunks]; 
        
             std::size_t element_size; 
        
             std::size_t max_num_elements_per_chunk; 
        
             i32 log2chunk_num_elements; 
        
             i32 lock; 
        
             i32 num_elements; 
        
             LLVMRuntime *runtime;

The data in this list is stored inside chunks. Each chunk can hold up to max_num_elements_per_chunk elements. These chunks are allocated dynamically, hence the Ptr type. The maximum number of elements in total in ListManager is therefore max_num_elements_per_chunk * max_num_chunks (1024).

When appending a new element to this ListManager, it will be given a globally unique index g_i within the list. g_i uniquely identifies a chunk, and the slot index within that chunk (analogy: page directory/page table). When g_i maps to a chunk c_i that hasn't been allocated yet, i.e. chunks[c_i] == 0, the kernel requests memory for that chunk from the memory allocator.

Metal has already implemented ListManager, but currently there is no dynamic memory allocation yet. So we just pre-allocate enough memory to hold the theoretical max number of elements, which is a huge waste of memory. Note that Metal's ListManager::memory_begin points to the beginning of the memory for data storage.

taichi/taichi/backends/metal/shaders/runtime_structs.metal.h

Lines 40 to 50 in fbf2cc7

    
           struct ListManager { 
        
             int32_t element_stride = 0; 
        
             // Total number of this SNode in the hierarchy. 
        
             // Same as |total_num_self_from_root| of this SNode. 
        
             int32_t max_num_elems = 0; 
        
             // Index to the next element in this list. 
        
             // |next| can never go beyond |max_num_elems|. 
        
             int32_t next = 0; 
        
             // The data offset from the runtime memory beginning. 
        
             int32_t mem_begin = 0; 
        
           };

I will shift this implementation to match how the LLVM backend works.

Describe the solution you'd like (if any)

The GPU side memory allocator itself is very simple. All that we need is to pre-allocate a large chunk of Metal buffer as the memory pool. The head of this buffer stores the atomic next counter. E.g.

// Memory pool buffer

| next |<====== IN USE ======><- - - - - 0 0 0 .... 0 0 0 - - - - - >|
                              ^- next

What is difficult is how to make sure kernel threads can request a new chunk in an co-operative way.

Apparently, we need some sort of locking here to prevent multiple threads who discovered that chunks[c_i] == 0 from requesting memory at the same time. The LLVM backend (including CUDA) has actually implemented a spin lock here. Unfortunately, this approach isn't applicable on Metal, due to its lack of memory order enforcement mechanism: We don't have an instruction that is equivalent to CUDA's __threadfence().

That said, we can synchronize on the chunk pointer itself. Here's the algorithm for a single kernel thread to either load or request a new chunk c_i safely:

read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.
if ch == 0, try atomically CAS chunks[c_i] to 1
1. if CAS failed, go back to 1 and read again
2. otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.
else if ch > 1, return ch
else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.

The synchronization is done on chunks[c_i], whose value transitions from 0 -> 1 -> allocated memory address. So the only requirement here is that the memory allocator must return an address that is greater than 1.

Once ListManager can allocate the memory in this way, the rest should be easy to implement. Here's the milestone:

Implement dynamic memory allocation in ListManager.
Implement NodeManager, which is backed by ListManager and supports GC.
Refactor the existing sparse runtime so that each SNode struct uses a Rep.
Support pointer and dynamic

One shortcoming of this design is that we have no way to prevent buffer overflow at the kernel side. So we might need to check that memory buffer hasn't overflowed every time the program synchronizes.

The text was updated successfully, but these errors were encountered:

yuanming-hu · 2020-06-10T02:05:35Z

read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.

if ch == 0, try atomically CAS chunks[c_i] to 1

if CAS failed, go back to 1 and read again

otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.

else if ch > 1, return ch

else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.

The algorithm sounds reasonable to me. Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS... Not sure how it behaves in practice though :-) We need some benchmarking here.

k-ye · 2020-06-10T11:31:10Z

Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS...

That is true. I feel like there is a way to extend beyond this limit by using a segmentation selector : 32-bit address, but I didn't give much thought on it...

Not sure how it behaves in practice though :-) We need some benchmarking here.

I only tested on taichi_sparse.py, but didn't see any improvement or regression. Hopefully we can push this to a point where we can run particle_renderer.py ASAP. I guess the chunk mechanism could also alleviate some of the write conflict in the memory allocator itself..

k-ye added feature request metal labels Jun 7, 2020

k-ye self-assigned this Jun 7, 2020

k-ye added this to the v0.7.0 milestone Jun 7, 2020

k-ye mentioned this issue Jun 7, 2020

[metal] Add kernel side memory allocator #1175

Merged

This was referenced Jul 9, 2020

[metal] Refactor runtime ListManager utils #1444

Merged

[metal] Plug in the SNodeRep structs into codegen #1480

Merged

k-ye mentioned this issue Jul 16, 2020

[metal] Fix listgen when iterating over children of a bitmasked SNode #1511

Merged

k-ye mentioned this issue Sep 26, 2020

[metal] Add KernelManagerData in prep for pointer SNode #1895

Merged

This was referenced Oct 25, 2020

[metal] Refactor sparse shader impl in prep for pointer SNode #1994

Merged

[metal] Add SNodeRep_pointer #2000

Merged

[metal] Revise NodeManager's implementation due to weak memory order #2008

Merged

[metal] Support pointer SNode in codegen #2015

Merged

k-ye mentioned this issue Apr 16, 2021

[metal] Add 3-stage GC Metal kernels #2268

Merged

This was referenced Jun 19, 2021

[Metal] Support pointer SNode on Metal #2441

Merged

[metal] Always initialize root SNode ListManager #2443

Merged

k-ye closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metal] Support dynamic memory allocation on GPU #1174

[Metal] Support dynamic memory allocation on GPU #1174

k-ye commented Jun 7, 2020 •

edited

Loading

yuanming-hu commented Jun 10, 2020

k-ye commented Jun 10, 2020 •

edited

Loading

[Metal] Support dynamic memory allocation on GPU #1174

[Metal] Support dynamic memory allocation on GPU #1174

Comments

k-ye commented Jun 7, 2020 • edited Loading

yuanming-hu commented Jun 10, 2020

k-ye commented Jun 10, 2020 • edited Loading

k-ye commented Jun 7, 2020 •

edited

Loading

k-ye commented Jun 10, 2020 •

edited

Loading