Possible of implementing mamba ssm #4353

tikikun · 2023-12-07T04:36:19Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

As a frontier project in edge inferencing, mamba ssm can have huge impact if the quality of the arch is as good as claimed. It will boost edge inference to another level.

Motivation

https://arxiv.org/abs/2312.00752

C0deMunk33 · 2023-12-09T23:29:41Z

has anyone picked this up? I'm excited and would like to use it in my current llama.cpp stack

paryska99 · 2023-12-10T18:34:57Z

I hope these Mamba models are actually going to scale well with bigger parameter counts. Still very exciting to see such fast paced development in AI space.

bachittle · 2023-12-11T17:36:38Z

Would this be out of scope for llama.cpp? Perhaps it would be more beneficial to make a new library and add as an example to ggml.

ekg · 2023-12-13T11:05:30Z

It would be good to be able to host and use mamba models in gguf format with all the quantization approaches that they can be linked with. And people happily build on llama.cpp. I'd be up for working on an implementation, but need some confirmation that adding it here makes sense. But adding a new model type might be a big pain so moving this to another project is possibly cleaner.

ggerganov · 2023-12-13T13:34:46Z

I don't think anything prevents from adding this model to llama.cpp, though I don't have a good estimate of the amount of effort that it would require to do so. I suppose we will give it a shot sometime in the future after dealing with some of the higher-priority things. Unless of course there is help from the community in the meantime.

rahuldshetty · 2023-12-21T05:02:12Z

If anybody is picking up, they can have a look at an implementation from @johnma2006: https://github.com/johnma2006/mamba-minimal/tree/master

LegallyCoder · 2023-12-25T15:38:19Z

https://github.com/LegallyCoder/mamba-hf
My implementation is here.

AlbertMarashi · 2023-12-27T15:26:00Z

Interested in this. Would like to use in LM Studio

LegallyCoder · 2023-12-28T16:46:52Z

Interested in this. Would like to use in LM Studio

Maybe, but it's not fully finished, it needs to be developed. But nowly almost all functions support, except some, I'm trying to fix them. Sure if you tell me

kroggen · 2023-12-31T18:46:29Z

Regarding quantization, this is from the README:

Precision

Our models were trained using PyTorch AMP for mixed precision. AMP keeps model parameters in float32 and casts to half precision when necessary. On the other hand, other frameworks like DeepSpeed store parameters in float16 and upcasts when necessary (e.g. for optimizer accumulation).

We've observed that higher precision for the main model parameters may be necessary, because SSMs are sensitive to their recurrent dynamics. If you are experiencing instabilities, as a first step please try a framework storing parameters in fp32 (such as AMP).

kroggen · 2023-12-31T21:00:05Z

The 2 implementations above only have the convolution mode (good for training) and lack the recurrent mode (good for inference)

The original implementation has support for both modes. You can pay attention to the inference_params and _decoding_cache used in the recurrent mode, as well as mamba_simple.py

This is my fork that works on CPU (the original requires CUDA and a GPU)

There is an implementation for tinygrad that also has support for the recurrent mode

cztomsik · 2024-01-09T10:16:12Z

Forgive me if I misunderstood the whole mamba paper but recurrent mode means it's like RNN (like RWKV) and that also means it has constant memory usage and therefore it cannot see "backwards".

Which means:

you need to rephrase tasks in prefix way (not a big issue)
it's way more sensitive to sampling (randomness can help but it can also damage the state significantly)
it might derail chat conversation in a totally useless way, because no matter what you see on the screen, it might not be in the "state" anymore.

I am not trying to bash the model, I think RNNs are great for many things, I'd just like to understand limitations correctly.

AlbertMarashi · 2024-01-10T00:36:36Z

@cztomsik the state spaces are effectively a compressed memory of the previous conversation history, so in practice you could store a snapshot of the state-space at each message. The model can sort of see "backwards" but not like a transformer which keeps the context in memory and calculates attention for every token.

The results are promising and we should expect to see a scaled up Mamba quite soon

Scaled mamba is in the works from discussions with people involved in it, but not yet announced

you need to rephrase tasks in prefix way (not a big issue)

I don't believe this is the necessarily case, think of a state-space as an "embedding" of the previous conversation that contains the compressed information, if it receives a task at the end it should in theory still have access to the information needed to complete that task.

Ie: If it's trained on internet data, and long-form conversations, it will learn to retain key facts across different sections of the content.

it's way more sensitive to sampling (randomness can help but it can also damage the state significantly)

I also don't see why this would be the case, and how its any different to making the temperature higher in a transformer (which also makes it start to write nonsense text)

it might derail chat conversation in a totally useless way, because no matter what you see on the screen, it might not be in the "state" anymore.

This is partly true, but also partly false. It does not see text in the same way that a transformer does, rather it sees a compressed "embedding" or vector space of the context. Being trained on the internet, it would in theory learn how to compress the semantics and meaning of the context in order to become more accurate at predicting the next token.

kroggen · 2024-01-14T03:07:21Z

Inference of Mamba models in pure C

https://github.com/kroggen/mamba.c

ggerganov · 2024-01-14T08:20:09Z

https://github.com/kroggen/mamba.c

Super useful! I think this looks manageable and we got pretty much all ops already implemented. I think we lack just GPU kernels for out_prod, but initial CPU implementation should be already possible. Anyone interested in giving this a try?

compilade · 2024-01-28T23:06:23Z

Anyone interested in giving this a try?

I've been working on this during the past week. (for the curious: https://github.com/compilade/llama.cpp/commits/support-mamba-ssm/)
Only 2 operators were missing, ggml_exp and ggml_soft_plus.

Today, I managed to make the recurrent mode (with --batch-size 1) work (and get coherent text from mamba-130m!!! (Thank you @kroggen for laying out a clear implementation in mamba.c)). Currently, a batch size bigger than 1 is not supported. I'm planning to also implement the convolution mode for that.

I'll make a pull request in a few days. My branch is not yet ready for review; I still need to fix and clean up a few things.

ggerganov · 2024-01-29T08:06:19Z

Let's go !! 😄

tikikun added the enhancement New feature or request label Dec 7, 2023

ggerganov added the help wanted Extra attention is needed label Dec 7, 2023

justusmattern27 mentioned this issue Dec 12, 2023

How could I run this on windows 10? redotvideo/mamba-chat#10

Open

compilade mentioned this issue Feb 5, 2024

llama : support Mamba Selective State Space Models #5328

Merged

8 tasks

RookieIndieDev mentioned this issue Mar 2, 2024

Possibility of using Mamba SSM Mobile-Artificial-Intelligence/maid#390

Closed

compilade closed this as completed in #5328 Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible of implementing mamba ssm #4353

Possible of implementing mamba ssm #4353

tikikun commented Dec 7, 2023

C0deMunk33 commented Dec 9, 2023

paryska99 commented Dec 10, 2023

bachittle commented Dec 11, 2023 •

edited

Loading

ekg commented Dec 13, 2023

ggerganov commented Dec 13, 2023

rahuldshetty commented Dec 21, 2023

LegallyCoder commented Dec 25, 2023

AlbertMarashi commented Dec 27, 2023 •

edited

Loading

LegallyCoder commented Dec 28, 2023

kroggen commented Dec 31, 2023 •

edited

Loading

kroggen commented Dec 31, 2023 •

edited

Loading

cztomsik commented Jan 9, 2024

AlbertMarashi commented Jan 10, 2024 •

edited

Loading

kroggen commented Jan 14, 2024

ggerganov commented Jan 14, 2024 •

edited

Loading

compilade commented Jan 28, 2024

ggerganov commented Jan 29, 2024

Possible of implementing mamba ssm #4353

Possible of implementing mamba ssm #4353

Comments

tikikun commented Dec 7, 2023

Prerequisites

Feature Description

Motivation

C0deMunk33 commented Dec 9, 2023

paryska99 commented Dec 10, 2023

bachittle commented Dec 11, 2023 • edited Loading

ekg commented Dec 13, 2023

ggerganov commented Dec 13, 2023

rahuldshetty commented Dec 21, 2023

LegallyCoder commented Dec 25, 2023

AlbertMarashi commented Dec 27, 2023 • edited Loading

LegallyCoder commented Dec 28, 2023

kroggen commented Dec 31, 2023 • edited Loading

kroggen commented Dec 31, 2023 • edited Loading

cztomsik commented Jan 9, 2024

AlbertMarashi commented Jan 10, 2024 • edited Loading

Scaled mamba is in the works from discussions with people involved in it, but not yet announced

kroggen commented Jan 14, 2024

ggerganov commented Jan 14, 2024 • edited Loading

compilade commented Jan 28, 2024

ggerganov commented Jan 29, 2024

bachittle commented Dec 11, 2023 •

edited

Loading

AlbertMarashi commented Dec 27, 2023 •

edited

Loading

kroggen commented Dec 31, 2023 •

edited

Loading

kroggen commented Dec 31, 2023 •

edited

Loading

AlbertMarashi commented Jan 10, 2024 •

edited

Loading

ggerganov commented Jan 14, 2024 •

edited

Loading