-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data preparation methods #155
Comments
That'd be interesting to try! Transformer-XL did something similar. |
This stateful approach seems to be common in RNN models (https://arxiv.org/abs/1803.00144). |
For example, the famous LSTM language model by salesforce is using this idea: https://github.com/salesforce/awd-lstm-lm/blob/master/model.py |
I had similar ideas. Unfortunately, mamba currently does not support setting the initial hidden state during training. |
I am also interested in this! I have some slow solution to achieve the setting hidden states during training. (#51) |
Yeah the simplest way is to just substitute selective_scan with a torch implementation, which is what I am currently doing. |
@jzhang38 that's cool. How much memory did you need to train the 16k context. I tried on an A100 (80 GB) and ran OOM. Did you also train all modules or freeze some? |
+1 for setting the initial state, setting the initial state is likely to be important when using the model in industrial control settings |
@jzhang38 that is so good, thanks |
A related paper just came out: https://arxiv.org/pdf/2406.02080v1 |
I was somewhat surprised to see that Mamba's context length is finite.
I assume this means Mamba was trained on 2k token batches, but resetting the state of "h" in some kind of aggregate way.
Would it be better to feed the value of h at the end of one 2k sequence to be used as the starting value for the next sequence (which would appear in the next batch)?
In principle, this should result in the model being capable of almost infinite length. Or am I misunderstanding something?
The text was updated successfully, but these errors were encountered: