You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I will add the support for the dynamic memory allocation on GPU for the Metal backend. This approach should be able to work for OpenGL in theory @archibate, although I guess it's more difficult due to the limited expressiveness of GLSL.
Background
In LLVM, the backbone of GPU memory management is ListManager:
The data in this list is stored inside chunks. Each chunk can hold up to max_num_elements_per_chunk elements. These chunks are allocated dynamically, hence the Ptr type. The maximum number of elements in total in ListManager is therefore max_num_elements_per_chunk * max_num_chunks (1024).
When appending a new element to this ListManager, it will be given a globally unique index g_i within the list. g_i uniquely identifies a chunk, and the slot index within that chunk (analogy: page directory/page table). When g_i maps to a chunk c_i that hasn't been allocated yet, i.e. chunks[c_i] == 0, the kernel requests memory for that chunk from the memory allocator.
Metal has already implemented ListManager, but currently there is no dynamic memory allocation yet. So we just pre-allocate enough memory to hold the theoretical max number of elements, which is a huge waste of memory. Note that Metal's ListManager::memory_begin points to the beginning of the memory for data storage.
// Same as |total_num_self_from_root| of this SNode.
int32_t max_num_elems = 0;
// Index to the next element in this list.
// |next| can never go beyond |max_num_elems|.
int32_t next = 0;
// The data offset from the runtime memory beginning.
int32_t mem_begin = 0;
};
I will shift this implementation to match how the LLVM backend works.
Describe the solution you'd like (if any)
The GPU side memory allocator itself is very simple. All that we need is to pre-allocate a large chunk of Metal buffer as the memory pool. The head of this buffer stores the atomic next counter. E.g.
// Memory pool buffer
| next |<====== IN USE ======><- - - - - 0 0 0 .... 0 0 0 - - - - - >|
^- next
What is difficult is how to make sure kernel threads can request a new chunk in an co-operative way.
Apparently, we need some sort of locking here to prevent multiple threads who discovered that chunks[c_i] == 0 from requesting memory at the same time. The LLVM backend (including CUDA) has actually implemented a spin lock here. Unfortunately, this approach isn't applicable on Metal, due to its lack of memory order enforcement mechanism: We don't have an instruction that is equivalent to CUDA's __threadfence().
That said, we can synchronize on the chunk pointer itself. Here's the algorithm for a single kernel thread to either load or request a new chunk c_i safely:
read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.
if ch == 0, try atomically CAS chunks[c_i] to 1
if CAS failed, go back to 1 and read again
otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.
else if ch > 1, return ch
else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.
The synchronization is done on chunks[c_i], whose value transitions from 0 -> 1 -> allocated memory address. So the only requirement here is that the memory allocator must return an address that is greater than 1.
Once ListManager can allocate the memory in this way, the rest should be easy to implement. Here's the milestone:
Implement dynamic memory allocation in ListManager.
Implement NodeManager, which is backed by ListManager and supports GC.
Refactor the existing sparse runtime so that each SNode struct uses a Rep.
Support pointer and dynamic
One shortcoming of this design is that we have no way to prevent buffer overflow at the kernel side. So we might need to check that memory buffer hasn't overflowed every time the program synchronizes.
The text was updated successfully, but these errors were encountered:
read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.
if ch == 0, try atomically CAS chunks[c_i] to 1
if CAS failed, go back to 1 and read again
otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.
else if ch > 1, return ch
else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.
The algorithm sounds reasonable to me. Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS... Not sure how it behaves in practice though :-) We need some benchmarking here.
Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS...
That is true. I feel like there is a way to extend beyond this limit by using a segmentation selector : 32-bit address, but I didn't give much thought on it...
Not sure how it behaves in practice though :-) We need some benchmarking here.
I only tested on taichi_sparse.py, but didn't see any improvement or regression. Hopefully we can push this to a point where we can run particle_renderer.py ASAP. I guess the chunk mechanism could also alleviate some of the write conflict in the memory allocator itself..
Concisely describe the proposed feature
I will add the support for the dynamic memory allocation on GPU for the Metal backend. This approach should be able to work for OpenGL in theory @archibate, although I guess it's more difficult due to the limited expressiveness of GLSL.
Background
In LLVM, the backbone of GPU memory management is
ListManager
:taichi/taichi/runtime/llvm/runtime.cpp
Lines 357 to 365 in fbf2cc7
The data in this list is stored inside
chunks
. Eachchunk
can hold up tomax_num_elements_per_chunk
elements. Thesechunks
are allocated dynamically, hence thePtr
type. The maximum number of elements in total inListManager
is thereforemax_num_elements_per_chunk * max_num_chunks (1024)
.When appending a new element to this
ListManager
, it will be given a globally unique indexg_i
within the list.g_i
uniquely identifies a chunk, and the slot index within that chunk (analogy: page directory/page table). Wheng_i
maps to a chunkc_i
that hasn't been allocated yet, i.e.chunks[c_i] == 0
, the kernel requests memory for that chunk from the memory allocator.Metal has already implemented
ListManager
, but currently there is no dynamic memory allocation yet. So we just pre-allocate enough memory to hold the theoretical max number of elements, which is a huge waste of memory. Note that Metal'sListManager::memory_begin
points to the beginning of the memory for data storage.taichi/taichi/backends/metal/shaders/runtime_structs.metal.h
Lines 40 to 50 in fbf2cc7
I will shift this implementation to match how the LLVM backend works.
Describe the solution you'd like (if any)
The GPU side memory allocator itself is very simple. All that we need is to pre-allocate a large chunk of Metal buffer as the memory pool. The head of this buffer stores the atomic
next
counter. E.g.What is difficult is how to make sure kernel threads can request a new
chunk
in an co-operative way.Apparently, we need some sort of locking here to prevent multiple threads who discovered that
chunks[c_i] == 0
from requesting memory at the same time. The LLVM backend (including CUDA) has actually implemented a spin lock here. Unfortunately, this approach isn't applicable on Metal, due to its lack of memory order enforcement mechanism: We don't have an instruction that is equivalent to CUDA's__threadfence()
.That said, we can synchronize on the chunk pointer itself. Here's the algorithm for a single kernel thread to either load or request a new chunk
c_i
safely:ch = chunks[c_i]
. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.ch == 0
, try atomically CASchunks[c_i]
to1
m
, atomically storem
tochunks[c_i]
and returnm
.ch > 1
, returnch
ch == 1
), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.The synchronization is done on
chunks[c_i]
, whose value transitions from0 -> 1 -> allocated memory address
. So the only requirement here is that the memory allocator must return an address that is greater than1
.Once
ListManager
can allocate the memory in this way, the rest should be easy to implement. Here's the milestone:ListManager
.NodeManager
, which is backed byListManager
and supports GC.Rep
.pointer
anddynamic
One shortcoming of this design is that we have no way to prevent buffer overflow at the kernel side. So we might need to check that memory buffer hasn't overflowed every time the program synchronizes.
The text was updated successfully, but these errors were encountered: