You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure if we can implement this change while maintaining compatibility with existing models without breaking mmap, since we need to modify the layout of the tensors. I think that maintaining backwards compatibility with models with split experts is important, we should not ask people to re-download 50GB models, but we may have to disable mmap with old models.
Currently, we store separate tensors for each expert:
llama.cpp/ggml.c
Lines 4442 to 4455 in 3020327
This leads to large number of possible "source" tensors for the
_id
ops which increases significantly the size ofstruct ggml_tensor
on the stack:llama.cpp/ggml.h
Lines 573 to 576 in 3020327
Additionally, the Metal implementation is currently hacked to support up to 8 experts and extension to more than that is not completely obvious:
llama.cpp/ggml-metal.m
Lines 1750 to 1759 in 3020327
We should improve this, with one possible way being to store the data for the experts into a single tensor and address is with appropriate offsets
The text was updated successfully, but these errors were encountered: