llama : combine expert tensors into a single tensor #6082

ggerganov · 2024-03-15T12:55:03Z

Currently, we store separate tensors for each expert:

Lines 4442 to 4455 in 3020327

    
           result->op   = GGML_OP_MUL_MAT_ID; 
        
           result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL; 
        
           result->src[0] = ids; 
        
           result->src[1] = b; 
        
           for (int i = 0; i < n_as; i++) { 
        
               struct ggml_tensor * a = as[i]; 
        
               GGML_ASSERT(ggml_are_same_shape(as[0], a)); 
        
               GGML_ASSERT(ggml_can_mul_mat(a, b)); 
        
               GGML_ASSERT(!ggml_is_transposed(a)); 
        
               result->src[i + 2] = a; 
        
           }

This leads to large number of possible "source" tensors for the _id ops which increases significantly the size of struct ggml_tensor on the stack:

llama.cpp/ggml.h

Lines 573 to 576 in 3020327

    
           struct ggml_tensor * grad; 
        
           struct ggml_tensor * src[GGML_MAX_SRC];

Additionally, the Metal implementation is currently hacked to support up to 8 experts and extension to more than that is not completely obvious:

llama.cpp/ggml-metal.m

Lines 1750 to 1759 in 3020327

    
           // TODO: how to make this an array? read Metal docs 
        
           for (int j = 0; j < 8; ++j) { 
        
               // NOTE: this is done like this to avoid uninitialized kernel arguments when n_as < 8 
        
               struct ggml_tensor * src_cur = dst->src[2 + (j % n_as)]; 
        
               size_t offs_src_cur = 0; 
        
               id<MTLBuffer> id_src_cur = ggml_metal_get_buffer(src_cur, &offs_src_cur); 
        
               [encoder setBuffer:id_src_cur offset:offs_src_cur atIndex:19 + j]; 
        
           }

We should improve this, with one possible way being to store the data for the experts into a single tensor and address is with appropriate offsets

The text was updated successfully, but these errors were encountered:

slaren · 2024-03-27T19:33:52Z

I am not sure if we can implement this change while maintaining compatibility with existing models without breaking mmap, since we need to modify the layout of the tensors. I think that maintaining backwards compatibility with models with split experts is important, we should not ask people to re-download 50GB models, but we may have to disable mmap with old models.

ggerganov added the refactoring Refactoring label Mar 15, 2024

ggerganov added this to ggml : roadmap Mar 15, 2024

ggerganov moved this to Todo in ggml : roadmap Mar 15, 2024

ggerganov mentioned this issue Mar 27, 2024

Add support for DBRX models: dbrx-base and dbrx-instruct #6344

Closed

4 tasks

ggerganov added the high priority Very important issue label Mar 27, 2024

slaren mentioned this issue Mar 29, 2024

ggml : update mul_mat_id to use the same tensor for all the experts #6387

Merged

10 tasks

ggerganov closed this as completed in #6387 Apr 3, 2024

ggerganov moved this from Todo to Done in ggml : roadmap Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : combine expert tensors into a single tensor #6082

llama : combine expert tensors into a single tensor #6082

ggerganov commented Mar 15, 2024

slaren commented Mar 27, 2024

llama : combine expert tensors into a single tensor #6082

llama : combine expert tensors into a single tensor #6082

Comments

ggerganov commented Mar 15, 2024

slaren commented Mar 27, 2024