Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible of implementing mamba ssm #4353

Closed
4 tasks done
tikikun opened this issue Dec 7, 2023 · 17 comments · Fixed by #5328
Closed
4 tasks done

Possible of implementing mamba ssm #4353

tikikun opened this issue Dec 7, 2023 · 17 comments · Fixed by #5328
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tikikun
Copy link
Contributor

tikikun commented Dec 7, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

As a frontier project in edge inferencing, mamba ssm can have huge impact if the quality of the arch is as good as claimed. It will boost edge inference to another level.

Motivation

https://arxiv.org/abs/2312.00752

@tikikun tikikun added the enhancement New feature or request label Dec 7, 2023
@ggerganov ggerganov added the help wanted Extra attention is needed label Dec 7, 2023
@C0deMunk33
Copy link

has anyone picked this up? I'm excited and would like to use it in my current llama.cpp stack

@paryska99
Copy link

I hope these Mamba models are actually going to scale well with bigger parameter counts. Still very exciting to see such fast paced development in AI space.

@bachittle
Copy link
Contributor

bachittle commented Dec 11, 2023

Would this be out of scope for llama.cpp? Perhaps it would be more beneficial to make a new library and add as an example to ggml.

@ekg
Copy link
Contributor

ekg commented Dec 13, 2023

It would be good to be able to host and use mamba models in gguf format with all the quantization approaches that they can be linked with. And people happily build on llama.cpp. I'd be up for working on an implementation, but need some confirmation that adding it here makes sense. But adding a new model type might be a big pain so moving this to another project is possibly cleaner.

@ggerganov
Copy link
Owner

I don't think anything prevents from adding this model to llama.cpp, though I don't have a good estimate of the amount of effort that it would require to do so. I suppose we will give it a shot sometime in the future after dealing with some of the higher-priority things. Unless of course there is help from the community in the meantime.

@rahuldshetty
Copy link

If anybody is picking up, they can have a look at an implementation from @johnma2006: https://github.com/johnma2006/mamba-minimal/tree/master

@LegallyCoder
Copy link

https://github.com/LegallyCoder/mamba-hf
My implementation is here.

@AlbertMarashi
Copy link

AlbertMarashi commented Dec 27, 2023

Interested in this. Would like to use in LM Studio

@LegallyCoder
Copy link

Interested in this. Would like to use in LM Studio

Maybe, but it's not fully finished, it needs to be developed. But nowly almost all functions support, except some, I'm trying to fix them. Sure if you tell me

@kroggen
Copy link

kroggen commented Dec 31, 2023

Regarding quantization, this is from the README:

Precision

Our models were trained using PyTorch AMP for mixed precision. AMP keeps model parameters in float32 and casts to half precision when necessary. On the other hand, other frameworks like DeepSpeed store parameters in float16 and upcasts when necessary (e.g. for optimizer accumulation).

We've observed that higher precision for the main model parameters may be necessary, because SSMs are sensitive to their recurrent dynamics. If you are experiencing instabilities, as a first step please try a framework storing parameters in fp32 (such as AMP).

@kroggen
Copy link

kroggen commented Dec 31, 2023

The 2 implementations above only have the convolution mode (good for training) and lack the recurrent mode (good for inference)

The original implementation has support for both modes. You can pay attention to the inference_params and _decoding_cache used in the recurrent mode, as well as mamba_simple.py

This is my fork that works on CPU (the original requires CUDA and a GPU)

There is an implementation for tinygrad that also has support for the recurrent mode

@cztomsik
Copy link
Contributor

cztomsik commented Jan 9, 2024

Forgive me if I misunderstood the whole mamba paper but recurrent mode means it's like RNN (like RWKV) and that also means it has constant memory usage and therefore it cannot see "backwards".

Which means:

  • you need to rephrase tasks in prefix way (not a big issue)
  • it's way more sensitive to sampling (randomness can help but it can also damage the state significantly)
  • it might derail chat conversation in a totally useless way, because no matter what you see on the screen, it might not be in the "state" anymore.

I am not trying to bash the model, I think RNNs are great for many things, I'd just like to understand limitations correctly.

image

@AlbertMarashi
Copy link

AlbertMarashi commented Jan 10, 2024

@cztomsik the state spaces are effectively a compressed memory of the previous conversation history, so in practice you could store a snapshot of the state-space at each message. The model can sort of see "backwards" but not like a transformer which keeps the context in memory and calculates attention for every token.

The results are promising and we should expect to see a scaled up Mamba quite soon

Scaled mamba is in the works from discussions with people involved in it, but not yet announced

you need to rephrase tasks in prefix way (not a big issue)

I don't believe this is the necessarily case, think of a state-space as an "embedding" of the previous conversation that contains the compressed information, if it receives a task at the end it should in theory still have access to the information needed to complete that task.

Ie: If it's trained on internet data, and long-form conversations, it will learn to retain key facts across different sections of the content.

it's way more sensitive to sampling (randomness can help but it can also damage the state significantly)

I also don't see why this would be the case, and how its any different to making the temperature higher in a transformer (which also makes it start to write nonsense text)

it might derail chat conversation in a totally useless way, because no matter what you see on the screen, it might not be in the "state" anymore.

This is partly true, but also partly false. It does not see text in the same way that a transformer does, rather it sees a compressed "embedding" or vector space of the context. Being trained on the internet, it would in theory learn how to compress the semantics and meaning of the context in order to become more accurate at predicting the next token.

@kroggen
Copy link

kroggen commented Jan 14, 2024

Inference of Mamba models in pure C

https://github.com/kroggen/mamba.c

@ggerganov
Copy link
Owner

ggerganov commented Jan 14, 2024

https://github.com/kroggen/mamba.c

Super useful! I think this looks manageable and we got pretty much all ops already implemented. I think we lack just GPU kernels for out_prod, but initial CPU implementation should be already possible. Anyone interested in giving this a try?

@compilade
Copy link
Collaborator

Anyone interested in giving this a try?

I've been working on this during the past week. (for the curious: https://github.com/compilade/llama.cpp/commits/support-mamba-ssm/)
Only 2 operators were missing, ggml_exp and ggml_soft_plus.

Today, I managed to make the recurrent mode (with --batch-size 1) work (and get coherent text from mamba-130m!!! (Thank you @kroggen for laying out a clear implementation in mamba.c)). Currently, a batch size bigger than 1 is not supported. I'm planning to also implement the convolution mode for that.

I'll make a pull request in a few days. My branch is not yet ready for review; I still need to fix and clean up a few things.

@ggerganov
Copy link
Owner

Let's go !! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.