-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reformer #3351
Reformer #3351
Conversation
60e5d9c
to
2f3afad
Compare
a4b0cce
to
19d6f70
Compare
feca999
to
5037574
Compare
ExperimentI tested training the Reformer model on 0.5M tokens per sample on the novel "Crime and Punishment" using conventional LM training. I essentially translated the official trax notebook: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb into hugging face code: https://colab.research.google.com/drive/1jR6hA2CQXDbucJXdiDXhmxmyoQmM2Pws The only differences to the official notebook are:
ResultsMy training starts similarly around 6.2 and goes down smoothly in the beginning. The attached plots are here: LossAccuracyLearning rate (cosine scheduler)When lowering the learning rate more, to 0.0005 e.g. the loss keeps going down but only reaches something around 2.3 in the end. Comparison The training in the official trax notebook is very smooth. Analysis
|
Tried to train model over longer time, but getting error
P: Fixed the typo. I will change the model into half-precision soon so that the memory will be sufficient :-) |
I get some good results with the following parameters: https://gist.github.com/flozi00/b491b41a9865733e5f8bb4032c313540 the best eval loss is about 1.654, but is increasing now again the same as yours |
Awesome that's already much better than what I got! If you manage to get it under 1 (loss) / >75% (accuracy) that would be great. Also feel free to change the hyper-parameters as you wish! Especially the adam betas and co. I also added support for fp16 - so the notebook now only needs 8GB of RAM. (You might have to reset the environment and re-install the github branch though) |
Sounds very great. Read that 4 hashes are good and 8 brings the best quality. Trained on some configurations now and everytime the loss goes to ~1 but then increases to 4 very fast and keeps on there for minimum 1000 steps. |
My guess is that since it's such a small dataset (0.5M tokens is tiny) the model needs very well-calibrated hyperparameter tuning. When the learning rate is low enough, this actually does not happen anymore but also the loss only gets to about ~2. But I ran very few experiments and didn't do any hyperparameter search. Will check that the gradients are correct in the next days and then hopefully be ready soon. |
@patrickvonplaten I'm excited to see a lot of progress here! The loss curves above could be due to poor hyperparameter choice, but they're also very similar to what you see when the reverse pass of the network doesn't match the forward pass. For example, failing to cache hash bucket assignments (for exact re-use in the backward pass) leads to a failure mode with loss rebounds very similar to the figures you posted above. I also once had a bug where the wrong random seed was used for dropout in the backward pass, which IIRC manifested itself in the same way. |
Thanks for taking a look @nkitaev. I just found a bug in the |
0335001
to
669ce9c
Compare
204bc21
to
2551eba
Compare
Codecov Report
@@ Coverage Diff @@
## master #3351 +/- ##
=======================================
Coverage 79.13% 79.13%
=======================================
Files 117 117
Lines 19517 19517
=======================================
Hits 15444 15444
Misses 4073 4073 Continue to review full report at Codecov.
|
22aed4a
to
2fddd44
Compare
9ea0f33
to
4e7252a
Compare
@patrickvonplaten Based on your merge, it seems like the input size for each batch is fixed in order to match the product of axial position embedding size? I am correct? |
For training, yes that's correct. For inference the input_size can also be smaller. Also check out: https://huggingface.co/transformers/model_doc/reformer.html |
* first copy & past commit from Bert and morgans LSH code * add easy way to compare to trax original code * translate most of function * make trax lsh self attention deterministic with numpy seed + copy paste code * add same config * add same config * make layer init work * implemented hash_vectors function for lsh attention * continue reformer translation * hf LSHSelfAttentionLayer gives same output as trax layer * refactor code * refactor code * refactor code * refactor * refactor + add reformer config * delete bogus file * split reformer attention layer into two layers * save intermediate step * save intermediate step * make test work * add complete reformer block layer * finish reformer layer * implement causal and self mask * clean reformer test and refactor code * fix merge conflicts * fix merge conflicts * update init * fix device for GPU * fix chunk length init for tests * include morgans optimization * improve memory a bit * improve comment * factorize num_buckets * better testing parameters * make whole model work * make lm model work * add t5 copy paste tokenizer * add chunking feed forward * clean config * add improved assert statements * make tokenizer work * improve test * correct typo * extend config * add complexer test * add new axial position embeddings * add local block attention layer * clean tests * refactor * better testing * save intermediate progress * clean test file * make shorter input length work for model * allow variable input length * refactor * make forward pass for pretrained model work * add generation possibility * finish dropout and init * make style * refactor * add first version of RevNet Layers * make forward pass work and add convert file * make uploaded model forward pass work * make uploaded model forward pass work * refactor code * add namedtuples and cache buckets * correct head masks * refactor * made reformer more flexible * make style * remove set max length * add attention masks * fix up tests * fix lsh attention mask * make random seed optional for the moment * improve memory in reformer * add tests * make style * make sure masks work correctly * detach gradients * save intermediate * correct backprob through gather * make style * change back num hashes * rename to labels * fix rotation shape * fix detach * update * fix trainer * fix backward dropout * make reformer more flexible * fix conflict * fix * fix * add tests for fixed seed in reformer layer * fix trainer typo * fix typo in activations * add fp16 tests * add fp16 training * support fp16 * correct gradient bug in reformer * add fast gelu * re-add dropout for embedding dropout * better naming * better naming * renaming * finalize test branch * finalize tests * add more tests * finish tests * fix * fix type trainer * fix fp16 tests * fix tests * fix tests * fix tests * fix issue with dropout * fix dropout seeds * correct random seed on gpu * finalize random seed for dropout * finalize random seed for dropout * remove duplicate line * correct half precision bug * make style * refactor * refactor * docstring * remove sinusoidal position encodings for reformer * move chunking to modeling_utils * make style * clean config * make style * fix tests * fix auto tests * pretrained models * fix docstring * update conversion file * Update pretrained_models.rst * fix rst * fix rst * update copyright * fix test path * fix test path * fix small issue in test * include reformer in generation tests * add docs for axial position encoding * finish docs * Update convert_reformer_trax_checkpoint_to_pytorch.py * remove isort * include sams comments * remove wrong comment in utils * correct typos * fix typo * Update reformer.rst * applied morgans optimization * make style * make gpu compatible * remove bogus file * big test refactor * add example for chunking * fix typo * add to README
* first copy & past commit from Bert and morgans LSH code * add easy way to compare to trax original code * translate most of function * make trax lsh self attention deterministic with numpy seed + copy paste code * add same config * add same config * make layer init work * implemented hash_vectors function for lsh attention * continue reformer translation * hf LSHSelfAttentionLayer gives same output as trax layer * refactor code * refactor code * refactor code * refactor * refactor + add reformer config * delete bogus file * split reformer attention layer into two layers * save intermediate step * save intermediate step * make test work * add complete reformer block layer * finish reformer layer * implement causal and self mask * clean reformer test and refactor code * fix merge conflicts * fix merge conflicts * update init * fix device for GPU * fix chunk length init for tests * include morgans optimization * improve memory a bit * improve comment * factorize num_buckets * better testing parameters * make whole model work * make lm model work * add t5 copy paste tokenizer * add chunking feed forward * clean config * add improved assert statements * make tokenizer work * improve test * correct typo * extend config * add complexer test * add new axial position embeddings * add local block attention layer * clean tests * refactor * better testing * save intermediate progress * clean test file * make shorter input length work for model * allow variable input length * refactor * make forward pass for pretrained model work * add generation possibility * finish dropout and init * make style * refactor * add first version of RevNet Layers * make forward pass work and add convert file * make uploaded model forward pass work * make uploaded model forward pass work * refactor code * add namedtuples and cache buckets * correct head masks * refactor * made reformer more flexible * make style * remove set max length * add attention masks * fix up tests * fix lsh attention mask * make random seed optional for the moment * improve memory in reformer * add tests * make style * make sure masks work correctly * detach gradients * save intermediate * correct backprob through gather * make style * change back num hashes * rename to labels * fix rotation shape * fix detach * update * fix trainer * fix backward dropout * make reformer more flexible * fix conflict * fix * fix * add tests for fixed seed in reformer layer * fix trainer typo * fix typo in activations * add fp16 tests * add fp16 training * support fp16 * correct gradient bug in reformer * add fast gelu * re-add dropout for embedding dropout * better naming * better naming * renaming * finalize test branch * finalize tests * add more tests * finish tests * fix * fix type trainer * fix fp16 tests * fix tests * fix tests * fix tests * fix issue with dropout * fix dropout seeds * correct random seed on gpu * finalize random seed for dropout * finalize random seed for dropout * remove duplicate line * correct half precision bug * make style * refactor * refactor * docstring * remove sinusoidal position encodings for reformer * move chunking to modeling_utils * make style * clean config * make style * fix tests * fix auto tests * pretrained models * fix docstring * update conversion file * Update pretrained_models.rst * fix rst * fix rst * update copyright * fix test path * fix test path * fix small issue in test * include reformer in generation tests * add docs for axial position encoding * finish docs * Update convert_reformer_trax_checkpoint_to_pytorch.py * remove isort * include sams comments * remove wrong comment in utils * correct typos * fix typo * Update reformer.rst * applied morgans optimization * make style * make gpu compatible * remove bogus file * big test refactor * add example for chunking * fix typo * add to README
@patrickvonplaten , I wanted to train a language model for reformers on a custom dataset. |
Hi @prajwal-PHAI, there are a lot of community notebooks covering T5 finetuning. |
Thanks @LysandreJik |
hey. thanks for your amazing work! the problem is that it doesn't recognize the apex package: ImportError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/transformers/trainer.py in train(self, model_path) ImportError: Please install apex from https://www.github.com/nvidia/apex to use fp16 training. though I installed it...anyone know what to do? |
Linking a related git issue #16972. cc @patrickvonplaten |
Add the Reformer
Paper: (https://arxiv.org/pdf/2001.04451.pdf)
First steps to take:
Forward-Pass: Get 1-to-1 same outputs as original Flax code for forward pass
predict_mem_len
had to be adapted to make functions equal.Backpropagation:
Tokenizer
Optimize time and memory efficiency
Pretrained Models
Check if pretrained model on C4 is added soon: google/trax@b1f0c17
Add Reformer / Bert in trax
Useful code resources:
Useful blog/paper resources:
Previous Discussions:
Update
The code is clean and ready for review now.
Small ToDos before merging:
Review
I added quite some docstrings to explain the new methods introduced by the Reformer (Axial Position Encoding, LSH Attention, Local Attention, Feed Forward chunking), so it might be better to first go through the doctsrings. Docstrings are easier to read when switching to this branch and creating the docs locally.