llama : simplify Mamba with advanced batch splits #8526

compilade · 2024-07-17T01:51:37Z

As promised in #7531 (comment), I've been extracting the advanced batch splits out of the Jamba PR (#7531).

I've also backported the contiguous allocation of recurrent state slots, which makes it possible to also include the changes from #7531 which simplify the ggml operators used specifically for Mamba. Hopefully this isn't too much at once.

See #7531 (comment) for an explanation of the batch splits.

Summary

ggml.c
- Simplify ggml_ssm_conv and ggml_ssm_scan by assuming batched sequences have the same number of new tokens, and that the states are contiguous and ordered correctly.
- Allow ggml_concat to work with a non-contiguous second argument.
  - The CPU implementation already supported this, but it was guarded with an assertion. Meanwhile, I think the CUDA implementation already supports this too, and does not prevent its usage (not totally sure), so I did not change it.
llama.cpp
- Advanced batch splits handled with lctx.sbatch for persistent buffers
  - Refactor "helpers for smoother batch API transition", by handling them in llama_sbatch, which allows avoiding repeated allocations by re-using the same buffers.
  - Simple batch splits should be equivalent to the previous behavior and are made with lctx.sbatch.split_simple(n_tokens) to build a llama_ubatch with a max size of n_tokens.
  - Equal-sequence-lengths splits are made with lctx.sbatch.split_equal(n_tokens), and are used to simplify the operators of recurrent models.
  - Add llama_ubatch. Similar to llama_batch, but aware of equal-length sequences.
    - Make llama_set_inputs (and others) use llama_ubatch instead of llama_batch.
- Make recurrent state slot allocation contiguous in llama_kv_cache_find_slot
- Add llm_build_mamba to build a Mamba block, used for Mamba, and will be used for Jamba
- Add llm_build_copy_mask_state (maybe not a good name) to abstract away the shuffling and masking of recurrent states. Used for Mamba, and it should be usable for other recurrent architectures too.
- Simplify the sanity checks for qs.n_attention_wv in llama_model_quantize_internal to make it future proof for hybrid models.
- Reorder the outputs when using advanced batch splits like split_equal in conjunction with llama_get_logits, because the API makes it so that the outputs should have the same order they had as the user-provided batch, not something based on batch split rules.

For simplicity, this does not include the separation of the KV cache and the recurrent state cache. Both still use the same buffers (lctx.kv_self.k_l, and lctx.kv_self.v_l, as on master). The separation (necessary for hybrid models) will be introduced at the same time as Jamba.

TODO

Test the slot allocation of llama_kv_cache_find_slot with the --hellaswag benchmark in llama-perplexity with a Mamba model
- This uses lots of parallel sequences in an unusual way, and so I think it's a great stress test.
Session file saving and reloading
- Reloading needs to rebuild the tail metadata for recurrent states. (i.e. which cell is the end of which sequence)
- The server tests need to pass
Make sure T5 still works
Make sure the pooled embeddings still work
- tested bge-small with llama-embeddings with parallel prompts with --pooling cls, --pooling last and --pooling mean; results exactly match master.
Make sure Gemma's sliding window mask still works
Decide whether to rename llama_reorder_outputs to llama_output_reorder and move it close to llama_output_reserve.
- renamed and moved

Future ideas

whole-sequence splits for embeddings
handle pooling types like cls and last within the ubatch.outputs when splitting a batch; inp_cls is redundant with inp_out_ids.

I have read the contributing guidelines
Self-reported review complexity:
- Medium

This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators

Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion.

ggerganov

Tested t5-small and it currently segfaults - let me know if you need help with resolving it

src/llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov · 2024-07-18T15:16:03Z

Make sure Gemma's sliding window mask still works

The following command produces identical perplexity on master and this branch:

./llama-perplexity \
    -m models/gemma-2-9b/ggml-model-f16.gguf \
    -f build/wikitext-2-raw/wiki.test.raw \
    -ngl 99 -c 8192

Is this enough to confirm the SWA functionality?

compilade · 2024-07-19T06:31:20Z

Is this enough to confirm the SWA functionality?

I think so. Might also be relevant to test SWA with parallel sequences too (I think this is what using a bigger -b (and -ub?) than -c does with llama-perplexity).

hackey · 2024-07-23T14:27:54Z

Guys, is there any progress in supporting Mamba2 (I'm interested in the new mamba-codestral)?

compilade · 2024-07-24T04:23:19Z

Guys, is there any progress in supporting Mamba2 (I'm interested in the new mamba-codestral)?

Still waiting on some upstream changes (see https://huggingface.co/mistralai/mamba-codestral-7B-v0.1/discussions/1), but otherwise I'm beginning to investigate the conversion for Mamba2 models, at least to have some GGUFs (even with no inference support) to experiment with implementing it.

First thing I'm noticing is the lack of metadata in the config.json of Mamba2 models. No state size, no convolution kernel size, no time step rank, and in the case of mamba-codestral-7B-v0.1, no indication that it's a Mamba2 model, except from the tensor names and sizes.
For the state sizes, I guess these are hardcoded in the state-spaces/mamba implementation, in which case I'll hardcode them too and/or find what is used to calculate them.

I've also recently started to simplify the session file save & restore code in llama.cpp (but I'll likely open a separate PR, since I think that refactor is best tested on its own), because I'm noticing that it's often causing me problems to adapt it to changes to the KV cache structure, due to there being at least 4 places needing to be updated and/or considered for each change (read/write + seq read/write). So I'll be unifying these code paths to make them easier to maintain.

hackey · 2024-07-24T06:35:12Z

Guys, is there any progress in supporting Mamba2 (I'm interested in the new mamba-codestral)?

Still waiting on some upstream changes (see https://huggingface.co/mistralai/mamba-codestral-7B-v0.1/discussions/1), but otherwise I'm beginning to investigate the conversion for Mamba2 models, at least to have some GGUFs (even with no inference support) to experiment with implementing it.

First thing I'm noticing is the lack of metadata in the config.json of Mamba2 models. No state size, no convolution kernel size, no time step rank, and in the case of mamba-codestral-7B-v0.1, no indication that it's a Mamba2 model, except from the tensor names and sizes. For the state sizes, I guess these are hardcoded in the state-spaces/mamba implementation, in which case I'll hardcode them too and/or find what is used to calculate them.

I've also recently started to simplify the session file save & restore code in llama.cpp (but I'll likely open a separate PR, since I think that refactor is best tested on its own), because I'm noticing that it's often causing me problems to adapt it to changes to the KV cache structure, due to there being at least 4 places needing to be updated and/or considered for each change (read/write + seq read/write). So I'll be unifying these code paths to make them easier to maintain.

I also encountered difficulties running mamba-codestral. I tried to run this model on https://github.com/state-spaces/mamba. But there is no config.json in the model repository. mamba-codestral includes a new tokenizer v3.
Although Mistral writes that the model can be run on state-spaces/mamba, nothing worked for me.

Please see the discussion here:
NVIDIA/TensorRT-LLM#1968
and a few hours ago an example for running mamba appeared:
https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba

Maybe this will help development.

ggerganov · 2024-07-24T10:08:53Z

For the state sizes, I guess these are hardcoded in the state-spaces/mamba implementation, in which case I'll hardcode them too and/or find what is used to calculate them.

Yes, we can hardcode initially

I've also recently started to simplify the session file save & restore code in llama.cpp (but I'll likely open a separate PR, since I think that refactor is best tested on its own), because I'm noticing that it's often causing me problems to adapt it to changes to the KV cache structure, due to there being at least 4 places needing to be updated and/or considered for each change (read/write + seq read/write). So I'll be unifying these code paths to make them easier to maintain.

Sounds good - a separate PR would be easier to review

Regarding Codestral - want to highlight again the comment by Mistral team about ngroups = 8: #8519 (comment). Seems important

awgr · 2024-07-29T18:57:30Z

Guys, is there any progress in supporting Mamba2 (I'm interested in the new mamba-codestral)?

Still waiting on some upstream changes (see https://huggingface.co/mistralai/mamba-codestral-7B-v0.1/discussions/1), but otherwise I'm beginning to investigate the conversion for Mamba2 models, at least to have some GGUFs (even with no inference support) to experiment with implementing it.

First thing I'm noticing is the lack of metadata in the config.json of Mamba2 models. No state size, no convolution kernel size, no time step rank, and in the case of mamba-codestral-7B-v0.1, no indication that it's a Mamba2 model, except from the tensor names and sizes. For the state sizes, I guess these are hardcoded in the state-spaces/mamba implementation, in which case I'll hardcode them too and/or find what is used to calculate them.

I've also recently started to simplify the session file save & restore code in llama.cpp (but I'll likely open a separate PR, since I think that refactor is best tested on its own), because I'm noticing that it's often causing me problems to adapt it to changes to the KV cache structure, due to there being at least 4 places needing to be updated and/or considered for each change (read/write + seq read/write). So I'll be unifying these code paths to make them easier to maintain.

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2_simple.py

This includes some details that may be interesting for you.

Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs

ggml-ci

ggerganov

In the future we should refactor the KV cache using object-oriented design so that the implementation of non-recurrent, recurrent and other modes are better separated and easier to read.

ggerganov · 2024-08-09T11:22:50Z

src/llama.cpp

+        );
+    }
+};
+


In a follow-up PRs we can move the batch structs into llama-batch.h/.cpp and write some unit tests

compilade · 2024-08-19T04:47:58Z

I'll be re-running a few tests before merging this in hopefully less than 2 days. There is now both Mamba-2 and RWKV v6 which kind of need this to simplify the implementation.

Still, I don't want to accidentally have broken something with the batch splits, so I'll try to convince myself that there is no problem by running more tests.

compilade · 2024-08-21T02:43:22Z

Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings.

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

compilade · 2024-08-21T04:05:41Z

I've fixed the pooled embeddings problem with Mamba in b264edd by making it only process a single sequence per ubatch. When the sequences are short, this is slightly slower than processing them all at once, unfortunately.

In the future, the pooled embeddings will be refactored to allow causal embeddings to be split across ubatches. It should also be possible to remove inp_cls, because it's redundant with inp_out_ids. LLAMA_POOLING_TYPE_CLS and LLAMA_POOLING_TYPE_LAST could be handled directly when splitting batches, since they only affect which tokens get their output selected. LLAMA_POOLING_TYPE_MEAN will be a bit harder to allow splitting, but since the total number of tokens per sequence per batch is known in advance, there might still be a way.

I'm postponing that pooled embeddings refactor to another PR. I consider this ready.

ggerganov

Great work as always!

Let's merge and resolve any remaining issues from master. I'll follow up with the initial SSM Metal kernels shortly after (#8546)

awgr · 2024-08-22T04:33:28Z

@compilade btw, I have the SSD implementation on CPU, more or less, if it's interesting for you.

…

On Wed, Aug 21, 2024 at 2:58 PM compilade ***@***.***> wrote: Merged #8526 <#8526> into master. — Reply to this email directly, view it on GitHub <#8526 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGA7QVODOGJ22XVGJENS4D3ZSUEQ5AVCNFSM6AAAAABK7U7OGKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTHE3DOMBVGMZDINY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

mann1x · 2024-08-24T13:38:13Z

@compilade

I get this error quantizing deepseek2 since the merge of this PR:
#9155

…8526)" This reverts commit a1631e5.

* llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

compilade added 2 commits July 16, 2024 20:38

llama : advanced batch splits

c51daef

This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators

Merge branch 'master' into compilade/batch-splits

22504ec

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 17, 2024

compilade added refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels Jul 17, 2024

compilade marked this pull request as draft July 17, 2024 01:54

compilade added 2 commits July 16, 2024 22:12

llama : fix integer signedness mixing

2e4adb4

llama : logits_all has priority over batch->logits

7b7db0b

Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion.

ggerganov mentioned this pull request Jul 17, 2024

Feature Request: Support Codestral Mamba #8519

Open

ggerganov reviewed Jul 17, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

github-actions bot added the testing Everything test related label Jul 17, 2024

ggerganov force-pushed the compilade/batch-splits branch from 345d590 to 7b7db0b Compare July 17, 2024 18:37

ggerganov mentioned this pull request Jul 17, 2024

ggml : add SSM Metal kernels #8546

Merged

compilade and others added 2 commits July 17, 2024 14:48

llama : apply suggestions

1fb5d4f

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : fix t5 segfault

1725de7

Merge branch 'master' into compilade/batch-splits

9c0a61f

compilade mentioned this pull request Jul 23, 2024

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

Draft

4 tasks

dlippold mentioned this pull request Jul 23, 2024

Add support for mamba-codestral oobabooga/text-generation-webui#6249

Closed

compilade mentioned this pull request Jul 26, 2024

llama : refactor session file management #8699

Merged

13 tasks

compilade added 2 commits July 28, 2024 01:20

Merge branch 'master' into compilade/batch-splits

0dea426

llama : fix Mamba session save and restore

704a303

compilade added 2 commits August 4, 2024 17:23

llama : minor cosmetic changes

952ed35

Merge branch 'master' into compilade/batch-splits

5679a3b

llama : rename llama_reorder_outputs to llama_output_reorder

cfd5a11

Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs

compilade marked this pull request as ready for review August 8, 2024 01:20

minor : add struct members for clarity

0596a99

ggml-ci

ggerganov approved these changes Aug 9, 2024

View reviewed changes

MollySophia mentioned this pull request Aug 11, 2024

llama : support RWKV v6 models #8980

Merged

2 tasks

txhno mentioned this pull request Aug 12, 2024

Mistral Codestral Mamba 7B ollama/ollama#5725

Open

Merge branch 'master' into compilade/batch-splits

702e199

llama : fix T5 segfault again

652e9b0

compilade added 2 commits August 20, 2024 23:29

llama : fix Mamba pooled embeddings with multiple sequences

b264edd

Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings.

llama : add llama_model_is_recurrent to simplify figuring that out

1be5ea7

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

compilade mentioned this pull request Aug 21, 2024

Fix memory leak in src/llama.cpp #8958

Closed

4 tasks

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Aug 21, 2024

Merge branch 'master' into compilade/batch-splits

80d9d2a

ggerganov approved these changes Aug 21, 2024

View reviewed changes

llama : fix simple splits when the batch contains embeddings

8062650

compilade mentioned this pull request Aug 21, 2024

llama : initial Mamba-2 support #9126

Open

9 tasks

compilade merged commit a1631e5 into master Aug 21, 2024
53 checks passed

This was referenced Aug 24, 2024

llama : fix qs.n_attention_wv for DeepSeek-V2 #9156

Merged

llama : support Jamba hybrid Transformer-Mamba models #7531

Draft

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 27, 2024

Revert "llama : simplify Mamba with advanced batch splits (ggerganov#…

2b41d79

…8526)" This reverts commit a1631e5.

compilade mentioned this pull request Sep 2, 2024

Bug: runtime error in llama_get_logits_ith after simplify Mamba with advanced batch splits commit. #9224

Closed

ggerganov mentioned this pull request Sep 3, 2024

changelog : libllama API #9289

Open

compilade mentioned this pull request Sep 9, 2024

llama : update llm_build_copy_mask_state comment [no ci] #9385

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : simplify Mamba with advanced batch splits #8526

llama : simplify Mamba with advanced batch splits #8526

compilade commented Jul 17, 2024 •

edited

Loading

ggerganov left a comment

ggerganov commented Jul 18, 2024

compilade commented Jul 19, 2024

hackey commented Jul 23, 2024

compilade commented Jul 24, 2024 •

edited

Loading

hackey commented Jul 24, 2024

ggerganov commented Jul 24, 2024

awgr commented Jul 29, 2024

ggerganov left a comment

ggerganov Aug 9, 2024

compilade commented Aug 19, 2024 •

edited

Loading

compilade commented Aug 21, 2024 •

edited

Loading

compilade commented Aug 21, 2024

ggerganov left a comment

awgr commented Aug 22, 2024 via email

mann1x commented Aug 24, 2024

llama : simplify Mamba with advanced batch splits #8526

llama : simplify Mamba with advanced batch splits #8526

Conversation

compilade commented Jul 17, 2024 • edited Loading

Summary

TODO

Future ideas

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Jul 18, 2024

compilade commented Jul 19, 2024

hackey commented Jul 23, 2024

compilade commented Jul 24, 2024 • edited Loading

hackey commented Jul 24, 2024

ggerganov commented Jul 24, 2024

awgr commented Jul 29, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Aug 9, 2024

Choose a reason for hiding this comment

compilade commented Aug 19, 2024 • edited Loading

compilade commented Aug 21, 2024 • edited Loading

compilade commented Aug 21, 2024

ggerganov left a comment

Choose a reason for hiding this comment

awgr commented Aug 22, 2024 via email

mann1x commented Aug 24, 2024

compilade commented Jul 17, 2024 •

edited

Loading

compilade commented Jul 24, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading

compilade commented Aug 21, 2024 •

edited

Loading