Data preparation methods #155

RonanKMcGovern · 2024-02-02T10:18:05Z

I was somewhat surprised to see that Mamba's context length is finite.

I assume this means Mamba was trained on 2k token batches, but resetting the state of "h" in some kind of aggregate way.

Would it be better to feed the value of h at the end of one 2k sequence to be used as the starting value for the next sequence (which would appear in the next batch)?

In principle, this should result in the model being capable of almost infinite length. Or am I misunderstanding something?

tridao · 2024-02-02T10:24:27Z

That'd be interesting to try! Transformer-XL did something similar.

radarFudan · 2024-02-02T11:06:34Z

This stateful approach seems to be common in RNN models (https://arxiv.org/abs/1803.00144).

radarFudan · 2024-02-02T11:08:08Z

For example, the famous LSTM language model by salesforce is using this idea: https://github.com/salesforce/awd-lstm-lm/blob/master/model.py

jzhang38 · 2024-02-03T01:46:20Z

I had similar ideas. Unfortunately, mamba currently does not support setting the initial hidden state during training.
#146
CUDA master Tri Dao said it will be implemented soon tho.
I plan to try it ASAP once this functionality is ready.
@RonanKMcGovern @tridao @radarFudan or anyone else wanna work together on this? My discord handle is peiyuan007
(btw I have already scaled the mamba fine-tuning length to 16384 and the test PPL keeps going down until ~40K on proof pile. I may write a note/blog soon regarding my findings so far. I optimistically believe we can realize infinite context length. This may also lead to something publishable.)

radarFudan · 2024-02-03T01:52:24Z

I am also interested in this! I have some slow solution to achieve the setting hidden states during training. (#51)

jzhang38 · 2024-02-03T03:01:56Z

Yeah the simplest way is to just substitute selective_scan with a torch implementation, which is what I am currently doing.

RonanKMcGovern · 2024-02-03T13:30:00Z

@jzhang38 that's cool. How much memory did you need to train the 16k context. I tried on an A100 (80 GB) and ran OOM. Did you also train all modules or freeze some?

CompRhys · 2024-02-06T20:57:28Z

+1 for setting the initial state, setting the initial state is likely to be important when using the model in industrial control settings

jzhang38 · 2024-02-09T00:58:17Z

https://github.com/jzhang38/LongMamba

RonanKMcGovern · 2024-02-09T09:13:54Z

@jzhang38 that is so good, thanks

JulienSiems · 2024-07-03T11:01:02Z

A related paper just came out: https://arxiv.org/pdf/2406.02080v1

RonanKMcGovern closed this as completed Feb 9, 2024

CompRhys mentioned this issue Feb 14, 2024

[Summary] Ability to pass in initial state #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data preparation methods #155

Data preparation methods #155

RonanKMcGovern commented Feb 2, 2024

tridao commented Feb 2, 2024

radarFudan commented Feb 2, 2024

radarFudan commented Feb 2, 2024

jzhang38 commented Feb 3, 2024 •

edited

Loading

radarFudan commented Feb 3, 2024 •

edited

Loading

jzhang38 commented Feb 3, 2024

RonanKMcGovern commented Feb 3, 2024

CompRhys commented Feb 6, 2024

jzhang38 commented Feb 9, 2024

RonanKMcGovern commented Feb 9, 2024

JulienSiems commented Jul 3, 2024

Data preparation methods #155

Data preparation methods #155

Comments

RonanKMcGovern commented Feb 2, 2024

tridao commented Feb 2, 2024

radarFudan commented Feb 2, 2024

radarFudan commented Feb 2, 2024

jzhang38 commented Feb 3, 2024 • edited Loading

radarFudan commented Feb 3, 2024 • edited Loading

jzhang38 commented Feb 3, 2024

RonanKMcGovern commented Feb 3, 2024

CompRhys commented Feb 6, 2024

jzhang38 commented Feb 9, 2024

RonanKMcGovern commented Feb 9, 2024

JulienSiems commented Jul 3, 2024

jzhang38 commented Feb 3, 2024 •

edited

Loading

radarFudan commented Feb 3, 2024 •

edited

Loading