-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible of implementing mamba ssm #4353
Comments
has anyone picked this up? I'm excited and would like to use it in my current llama.cpp stack |
I hope these Mamba models are actually going to scale well with bigger parameter counts. Still very exciting to see such fast paced development in AI space. |
Would this be out of scope for llama.cpp? Perhaps it would be more beneficial to make a new library and add as an example to ggml. |
It would be good to be able to host and use mamba models in gguf format with all the quantization approaches that they can be linked with. And people happily build on llama.cpp. I'd be up for working on an implementation, but need some confirmation that adding it here makes sense. But adding a new model type might be a big pain so moving this to another project is possibly cleaner. |
I don't think anything prevents from adding this model to |
If anybody is picking up, they can have a look at an implementation from @johnma2006: https://github.com/johnma2006/mamba-minimal/tree/master |
https://github.com/LegallyCoder/mamba-hf |
Interested in this. Would like to use in LM Studio |
Maybe, but it's not fully finished, it needs to be developed. But nowly almost all functions support, except some, I'm trying to fix them. Sure if you tell me |
Regarding quantization, this is from the README:
|
The 2 implementations above only have the convolution mode (good for training) and lack the recurrent mode (good for inference) The original implementation has support for both modes. You can pay attention to the This is my fork that works on CPU (the original requires CUDA and a GPU) There is an implementation for tinygrad that also has support for the recurrent mode |
Forgive me if I misunderstood the whole mamba paper but recurrent mode means it's like RNN (like RWKV) and that also means it has constant memory usage and therefore it cannot see "backwards". Which means:
I am not trying to bash the model, I think RNNs are great for many things, I'd just like to understand limitations correctly. |
@cztomsik the state spaces are effectively a compressed memory of the previous conversation history, so in practice you could store a snapshot of the state-space at each message. The model can sort of see "backwards" but not like a transformer which keeps the context in memory and calculates attention for every token. The results are promising and we should expect to see a scaled up Mamba quite soon Scaled mamba is in the works from discussions with people involved in it, but not yet announced
I don't believe this is the necessarily case, think of a state-space as an "embedding" of the previous conversation that contains the compressed information, if it receives a task at the end it should in theory still have access to the information needed to complete that task. Ie: If it's trained on internet data, and long-form conversations, it will learn to retain key facts across different sections of the content.
I also don't see why this would be the case, and how its any different to making the temperature higher in a transformer (which also makes it start to write nonsense text)
This is partly true, but also partly false. It does not see text in the same way that a transformer does, rather it sees a compressed "embedding" or vector space of the context. Being trained on the internet, it would in theory learn how to compress the semantics and meaning of the context in order to become more accurate at predicting the next token. |
Inference of Mamba models in pure C |
Super useful! I think this looks manageable and we got pretty much all ops already implemented. I think we lack just GPU kernels for |
I've been working on this during the past week. (for the curious: https://github.com/compilade/llama.cpp/commits/support-mamba-ssm/) Today, I managed to make the recurrent mode (with I'll make a pull request in a few days. My branch is not yet ready for review; I still need to fix and clean up a few things. |
Let's go !! 😄 |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
As a frontier project in edge inferencing, mamba ssm can have huge impact if the quality of the arch is as good as claimed. It will boost edge inference to another level.
Motivation
https://arxiv.org/abs/2312.00752
The text was updated successfully, but these errors were encountered: