Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metal] Support dynamic memory allocation on GPU #1174

Closed
4 tasks done
k-ye opened this issue Jun 7, 2020 · 2 comments
Closed
4 tasks done

[Metal] Support dynamic memory allocation on GPU #1174

k-ye opened this issue Jun 7, 2020 · 2 comments
Assignees
Labels
feature request Suggest an idea on this project metal Metal backend
Milestone

Comments

@k-ye
Copy link
Member

k-ye commented Jun 7, 2020

Concisely describe the proposed feature

I will add the support for the dynamic memory allocation on GPU for the Metal backend. This approach should be able to work for OpenGL in theory @archibate, although I guess it's more difficult due to the limited expressiveness of GLSL.

Background

In LLVM, the backbone of GPU memory management is ListManager:

struct ListManager {
static constexpr std::size_t max_num_chunks = 1024;
Ptr chunks[max_num_chunks];
std::size_t element_size;
std::size_t max_num_elements_per_chunk;
i32 log2chunk_num_elements;
i32 lock;
i32 num_elements;
LLVMRuntime *runtime;

The data in this list is stored inside chunks. Each chunk can hold up to max_num_elements_per_chunk elements. These chunks are allocated dynamically, hence the Ptr type. The maximum number of elements in total in ListManager is therefore max_num_elements_per_chunk * max_num_chunks (1024).

When appending a new element to this ListManager, it will be given a globally unique index g_i within the list. g_i uniquely identifies a chunk, and the slot index within that chunk (analogy: page directory/page table). When g_i maps to a chunk c_i that hasn't been allocated yet, i.e. chunks[c_i] == 0, the kernel requests memory for that chunk from the memory allocator.


Metal has already implemented ListManager, but currently there is no dynamic memory allocation yet. So we just pre-allocate enough memory to hold the theoretical max number of elements, which is a huge waste of memory. Note that Metal's ListManager::memory_begin points to the beginning of the memory for data storage.

struct ListManager {
int32_t element_stride = 0;
// Total number of this SNode in the hierarchy.
// Same as |total_num_self_from_root| of this SNode.
int32_t max_num_elems = 0;
// Index to the next element in this list.
// |next| can never go beyond |max_num_elems|.
int32_t next = 0;
// The data offset from the runtime memory beginning.
int32_t mem_begin = 0;
};

I will shift this implementation to match how the LLVM backend works.

Describe the solution you'd like (if any)

The GPU side memory allocator itself is very simple. All that we need is to pre-allocate a large chunk of Metal buffer as the memory pool. The head of this buffer stores the atomic next counter. E.g.

// Memory pool buffer

| next |<====== IN USE ======><- - - - - 0 0 0 .... 0 0 0 - - - - - >|
                              ^- next

What is difficult is how to make sure kernel threads can request a new chunk in an co-operative way.

Apparently, we need some sort of locking here to prevent multiple threads who discovered that chunks[c_i] == 0 from requesting memory at the same time. The LLVM backend (including CUDA) has actually implemented a spin lock here. Unfortunately, this approach isn't applicable on Metal, due to its lack of memory order enforcement mechanism: We don't have an instruction that is equivalent to CUDA's __threadfence().

That said, we can synchronize on the chunk pointer itself. Here's the algorithm for a single kernel thread to either load or request a new chunk c_i safely:

  1. read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.
  2. if ch == 0, try atomically CAS chunks[c_i] to 1
    1. if CAS failed, go back to 1 and read again
    2. otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.
  3. else if ch > 1, return ch
  4. else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.

The synchronization is done on chunks[c_i], whose value transitions from 0 -> 1 -> allocated memory address. So the only requirement here is that the memory allocator must return an address that is greater than 1.

Once ListManager can allocate the memory in this way, the rest should be easy to implement. Here's the milestone:

  • Implement dynamic memory allocation in ListManager.
  • Implement NodeManager, which is backed by ListManager and supports GC.
  • Refactor the existing sparse runtime so that each SNode struct uses a Rep.
  • Support pointer and dynamic

One shortcoming of this design is that we have no way to prevent buffer overflow at the kernel side. So we might need to check that memory buffer hasn't overflowed every time the program synchronizes.

@k-ye k-ye added feature request Suggest an idea on this project metal Metal backend labels Jun 7, 2020
@k-ye k-ye self-assigned this Jun 7, 2020
@k-ye k-ye added this to the v0.7.0 milestone Jun 7, 2020
@yuanming-hu
Copy link
Member

  • read ch = chunks[c_i]. Ideally this should be done atomically, but non-atomic read doesn't break the correctness.

  • if ch == 0, try atomically CAS chunks[c_i] to 1

    1. if CAS failed, go back to 1 and read again
    2. otherwise, request from the memory allocator. Denoting the returned address as m, atomically store m to chunks[c_i] and return m.
  • else if ch > 1, return ch

  • else (ch == 1), go back to 1. This means another thread is allocating memory at the same time. Eventually, that thread will finish, so this will exit from 3.

The algorithm sounds reasonable to me. Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS... Not sure how it behaves in practice though :-) We need some benchmarking here.

@k-ye
Copy link
Member Author

k-ye commented Jun 10, 2020

Unfortunately without a lock the critical piece of memory is restricted to the maximum width of atomicCAS...

That is true. I feel like there is a way to extend beyond this limit by using a segmentation selector : 32-bit address, but I didn't give much thought on it...

Not sure how it behaves in practice though :-) We need some benchmarking here.

I only tested on taichi_sparse.py, but didn't see any improvement or regression. Hopefully we can push this to a point where we can run particle_renderer.py ASAP. I guess the chunk mechanism could also alleviate some of the write conflict in the memory allocator itself..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project metal Metal backend
Projects
None yet
Development

No branches or pull requests

2 participants