Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data preparation methods #155

Closed
RonanKMcGovern opened this issue Feb 2, 2024 · 11 comments
Closed

Data preparation methods #155

RonanKMcGovern opened this issue Feb 2, 2024 · 11 comments

Comments

@RonanKMcGovern
Copy link

I was somewhat surprised to see that Mamba's context length is finite.

I assume this means Mamba was trained on 2k token batches, but resetting the state of "h" in some kind of aggregate way.

Would it be better to feed the value of h at the end of one 2k sequence to be used as the starting value for the next sequence (which would appear in the next batch)?

In principle, this should result in the model being capable of almost infinite length. Or am I misunderstanding something?

@tridao
Copy link
Collaborator

tridao commented Feb 2, 2024

That'd be interesting to try! Transformer-XL did something similar.

@radarFudan
Copy link

This stateful approach seems to be common in RNN models (https://arxiv.org/abs/1803.00144).

@radarFudan
Copy link

For example, the famous LSTM language model by salesforce is using this idea: https://github.com/salesforce/awd-lstm-lm/blob/master/model.py

@jzhang38
Copy link

jzhang38 commented Feb 3, 2024

I had similar ideas. Unfortunately, mamba currently does not support setting the initial hidden state during training.
#146
CUDA master Tri Dao said it will be implemented soon tho.
I plan to try it ASAP once this functionality is ready.
@RonanKMcGovern @tridao @radarFudan or anyone else wanna work together on this? My discord handle is peiyuan007
(btw I have already scaled the mamba fine-tuning length to 16384 and the test PPL keeps going down until ~40K on proof pile. I may write a note/blog soon regarding my findings so far. I optimistically believe we can realize infinite context length. This may also lead to something publishable.)

@radarFudan
Copy link

radarFudan commented Feb 3, 2024

I am also interested in this! I have some slow solution to achieve the setting hidden states during training. (#51)

@jzhang38
Copy link

jzhang38 commented Feb 3, 2024

Yeah the simplest way is to just substitute selective_scan with a torch implementation, which is what I am currently doing.

@RonanKMcGovern
Copy link
Author

@jzhang38 that's cool. How much memory did you need to train the 16k context. I tried on an A100 (80 GB) and ran OOM. Did you also train all modules or freeze some?

@CompRhys
Copy link

CompRhys commented Feb 6, 2024

+1 for setting the initial state, setting the initial state is likely to be important when using the model in industrial control settings

@jzhang38
Copy link

jzhang38 commented Feb 9, 2024

@RonanKMcGovern
Copy link
Author

@jzhang38 that is so good, thanks

@JulienSiems
Copy link

A related paper just came out: https://arxiv.org/pdf/2406.02080v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants