-
-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error with new 774M model when running in Colab #108
Comments
It is likely not possible to finetune 774M. Discussion here: https://news.ycombinator.com/item?id=20749037 I need to run tests to determine how well it works; if it's not possible, I'll add a bespoke assert to prevent finetuning on it. |
Would using fp16 help? |
Maybe it can be run at: https://cloud.google.com/compute/all-pricing#gpus ? |
There is no magic switch for FP16 in TensorFlow [yet], and the 16 GB VRAM offered by cloud GPUs still isn't enough. If there are any workarounds, I would be interested in them. |
Need to recompile tensorflow to use fp16? I have some experience in this, I can tell you how to do it without any special difficulties. |
For reference, this was @AdamDanielKing's answer on HackerNews:
I won't say that I understand everything though. |
@woctezuma That comment only explains how to deploy a trained model, which requires much less GPU memory than training because the gradients aren't stored. @minimaxir is probably right that for training you won't fit a full batch of 774M training samples in the K80 GPU that Colab gives you. @minimaxir You can work around this by training with a smaller batch size but accumulating gradients over several iterations before applying an update to the weights. That achieves a larger effective batch size than can fit in the GPU. This page might be helpful. |
The workflow for 345M finetuning uses a batch size of 1 w/ accumulated gradients. That is the workflow the 774M should be using now, with apparently no success. |
Ah, I see. That's surprising. I know OpenAI uses gradient checkpointing for some of their other work, so in that case I'd bet they use it in their training code for GPT-2 as well. See https://github.com/cybertronai/gradient-checkpointing. Instead of storing all the layer activations at once, this stores a subset of them and then recomputes them during the backward pass to significantly reduce memory usage. In my experience it's pretty easy to get that library working, and if you do then it should be effective. Another workaround is to only train with sequences significantly shorter than the maximum of 1024 tokens. |
Maybe should turn back and try to implement using TPU or Multiple GPU? In your opinion, which option would be preferable to not return to this issue in the future (when 1558M parameter model will be released)? (I assume that Colab may not be enough for this, of course) |
@AdamDanielKing, This repo took a good chunk of nshepperd's codebase as @minimaxir has said in the past. This means this repo automatically does gradient checkpointing for anything that is not 117 (see gpt_2.py). @saippuakauppias I already tried it on all GPU configurations/RAM/CPU configurations on GCP. After 10-15 failed attempts did I realize it was an issue and was prompted to HN to see @minimaxir and @AdamDanielKing's discussion. And @saippuakauppias someone has already tried TPU: [https://colab.research.google.com/github/shawwn/gpt-2/blob/tpu/Training_GPT_2_Using_TPUs.ipynb](Colab using TPU). At this moment I didn't get the best results, although I have to do some data preprocessing to see what the exact issue it. I'm also getting a pretty high loss on it. @saippuakauppias can you help me recompile TF to only use FP16? |
@dantuluri Thanks for pointing this out. It looks like the code only uses 1 gradient checkpoint at layer 10: gpt-2-simple/gpt_2_simple/src/model.py Lines 195 to 196 in 4c36ea7
The code is using memory_saving_gradients in 'collection' mode, so it doesn't automatically add any other checkpoints. 774M has 36 layers, so this means the activations of at least 26 layers will be in memory at the same time. I'd suggest adding many more checkpoints or trying the other modes. |
@dantuluri, I misunderstood the discussion on "Hacker News" (recompilation is not needed).
Can anyone check if this helps for a Colab or for Cloud Run? PS: if you suddenly need to recompile TF, then here is the easiest way: https://github.com/yaroslavvb/tensorflow-community-wheels/pull/121/files |
If FP16 is indeed in TensorFlow 1.14 via pip, I'll give it a test. |
Looking into the code it seems @minimaxir used https://github.com/cybertronai/gradient-checkpointing for gradient checkpointing. I used the variations:
Here are the definitions of each variation just for reference:
I think FP16 is probably the way to go if it works as @minimaxir said. |
@saippuakauppias do you know how to use FP16? Not too familiar on how to start using it. |
@dantuluri No, but now I'm trying to figure out how to use it. An example of how to enable FP16: https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/docs/amp/notebook_v1.14/auto_mixed_precision_demo_cifar10.ipynb You just need to wrap the optimizer in Documentation: |
There should definitely still be more gradient checkpoints. I just ran some tests on a K80 setting accumulate_gradients to 1 and seeing how many samples can fit in a batch without running out of memory.
The only code change is removing the Unfortunately I'm still struggling to fit 1 sample of 774M into memory mainly because the Edit: By the way, adding more checkpoints doesn't have a performance hit because its only effect is to not deallocate and recompute the checkpointed layer. So you just want to choose the checkpoints in a way that minimizes the peak memory usage. |
How does nshepperd's fork deal with this? |
@dantuluri He also has the |
Interesting. |
Update: Currently running on V100. Will try on lower VRAM GPUs. Edit: Can't vouch for quality of training. Just saw it training and thought it works. |
@dantuluri Perfect! This is with 774M, right? How many samples fit if you set accumulate_gradients to 1 and vary batch_size? Can you get more than one? I think the boundary between not fitting a sample and fitting one is between 12 GB and 16 GB. I wasn't able to get one of the K80s that Google offers (12 GB) to work. So it seems we still can't train for free on Colab. Edit: One place in particular that seemed helpful to add a checkpoint was at the model's output: gpt-2-simple/gpt_2_simple/gpt_2.py Line 170 in 4c36ea7
I suggest experimenting with adding it output = model.model(hparams=hparams, X=context)
tf.compat.v1.add_to_collection('checkpoints', output['logits']) and seeing if that increases the number of samples you can fit on the GPU. Edit Sept 21: The line above had a bug but should work now. Still not certain that it lowers the overall memory usage but it's worth trying. Memory peaks around there, and checkpointing seemed to bring the peak usage down while I was playing with the K80. |
Updated my code suggestion one last time. ^ |
At this moment it's training, I'm think my input may be wrong because it's structured like this:
and so on.
Because I'm a bit more familiar with nshepperd's code, do you know where he did the checkpointing (layer == 10)? I got better results training 335 using his code. Otherwise, any help on getting this code to work with my data would be much appreciated. In regards to GPU memory usage: In regards to loss Quality of results In regards to your suggestion |
Maybe @nshepperd already tried to solve this problem too? |
@saippuakauppias That line I gave had a bug. What I meant was to try adding |
@AdamDanielKing thanks for quick reply! Unfortunately, memory consumption has not decreased: Adam = 16.7GB, SGD = 8.5.GB. |
@saippuakauppias In some cases adding a checkpoint will speed things up slightly but I was mainly trying for a memory decrease. The outcomes of each step will not be changed by adding a checkpoint. Are those numbers the peak memory usage? Memory usage varies significantly throughout each training step, going up and down more than once. I was using a script that showed every (de)allocation and the total memory usage at each point, but I can't seem to remember where I got it. |
@AdamDanielKing , no, these are not peak usage. I ran Tests with SM3 optimizer (with
Rechecked Adam and SGD without removing
Currently there is only one way out: use SGD and remove |
So I have found a quick-and-dirty way to fine-tune 774M under Colab without OOM's, thanks to this tweet by @basedblue: https://twitter.com/BasedBlue/status/1169601983046672385?s=20 under this line in train.py
add this line:
This is just tuning h35, according to his tweet here you can train h15-h35: https://twitter.com/BasedBlue/status/1169600560535916544?s=20 I can't attest to the results yet - running my own tests as we speak - but I can attest that this is the first time I've been able to do any sort of tuning of 774M under Colab. EDIT: can confirm I was able to do up to 336 layers ("-336" instead of "-12" above) before getting OOM's. I do still get them intermittently at lower numbers, which seems to be a Colab quirk. The results of my finetuning on a set I'm using for benchmarking were markedly better than the 345M model in both validation and test. Also, if you're doing this, make sure to pass in the FULL list of vars to your Saver, not the shortened tuning list, or you won't be able to restore into the full model. |
Here's my notebook that demonstrates fine-tuning 774M with reduced parameters in Colab: https://github.com/jkraybill/gpt-2/blob/finetuning/GPT2-finetuning2-774M.ipynb You may need to adjust the "layers_to_train" parameter; I've had success with numbers up to 336, but still get intermittent OOM's before training starts with values between 200-336. 200 or less so far always works but the results I got with 336 seemed more coherent for text generation tuned on a large corpus than results at 200 and less. On a very non-scientific experimental benchmark, which is a randomly selected set of 100 hard trivia questions, after fine-tuning on a large trivia question set, 117M scored 6/100, 345M scored 11/100, and 774M scored 19/100 and 20/100 in two different runs. (I'm decent at trivia and my score was 43/100.) These scores interestingly scale closely with model size, so at this rate the full-size model (if able to be fine-tuned) will score around what I scored. |
That's a nice approach in some ways. It may even produce better results because the risk of overfitting is less when training fewer parameters. |
I got pretty poor results when just training on 12 vars, but with 336 the results seemed superior for the tests I ran vs 345M. As the number goes lower, unconditional output seems to deviate from the training set format more than when doing "full stack" training on 345M. This is very preliminary and I have a lot more testing to do! |
Checked memory usage on V100 (32 GB):
PS: I watched memory usage only the first 100 steps (as before). I don’t know why the memory coincided, this is not a mistake. PPS: Interesting thing from Mixed Precision log:
SGD:
SM3:
|
Forgot to tell an interesting feature. When I started testing the V100, I accidentally started tuning without using the GPU. The server had a weak processor with 8 cores and 64GB of memory, but the training worked (Adam optimizer, no changes). Very slow, but it runs on the CPU using less than 32GB of RAM. |
Can anyone adapt optimizer Adafactor for this repository?
https://arxiv.org/abs/1804.04235 https://github.com/ConnorJL/GPT2/blob/master/optimizers.py |
My quick and stupid implementation of Adafactor optimizer + remove I will now check the memory usage for the V100 (32 GB) and write results here. |
A few other tests for an example:
Can anyone recommend what other settings (like turn off |
Nice work. I'm currently experimenting with adafactor, train_vars numbers, and also checkpointing less frequently than every single layer. Checkpointing seems to have resulted in a 5-10x slowdown in training in exchange for better memory usage, so trying to find a happy medium. The good news is that it is clear that combos of adafactor, train_vars, and layer checkpointing definitively allow finetuning of 774M! I don't have any empirical results from adafactor here yet on my trivia set but will post them in an edit here once I have them. I have heard that using non-Adam optimizers can greatly detrimentally affect the learning process so we'll see what comes out. |
This progress looks good. :) Though I still think it would be more informative to look at peak memory usage rather than what nvidia-smi is giving. The memory usage varies substantially over a single training step and it's really the peak usage that decides if the program will crash. One method is shown here with the
@jkraybill I think most of that slowdown must be coming from somewhere else. Checkpointing should result in each operation being computed at most twice, so I don't think it should be possible to get more than a 2x slowdown. Especially since only the forward step is recomputed. The gradient checkpointing repo suggests a 1.2x slowdown. |
@AdamDanielKing thanks for the tip -- I need to spend a lot more time with the code to determine the cause. I updated my notebook codebase from a quite old @Tenoke fork to the most recent @nshepperd fork to use adafactor etc so it's quite likely the cause is from other changes in the codebase. The most glaring change was the checkpointing but I agree with you that 5-10x doesn't agree with what should be happening. I'll do a bunch more testing and try to pin it down. I've done a little cleanup and the code is less slow but still maybe 4-5x slower than before. I'll figure it out and share results here. BTW thank you so much for TalkToTransformer -- it's by far been the most useful tool for getting non-AI people around me interested in this stuff, and is a great implementation! |
@saippuakauppias Thanks so much for sharing. I noticed that the |
I delete this line in bash-script on initial setup (check your package installation path in echo -e "Fix finetuning 774M:"
PYTHON_VERSION=`python3 -c 'import sys; print("%s.%s" % (sys.version_info.major, sys.version_info.minor))'`
GPT2PY_PATH="/usr/local/lib/python${PYTHON_VERSION}/dist-packages/gpt_2_simple/gpt_2.py"
NEED_FIX=`cat ${GPT2PY_PATH} | grep 'cannot finetune the 774M' | wc -l`
if [ ${NEED_FIX} -ne 0 ]
then
mv ${GPT2PY_PATH} gpt_2.py
grep -vwE 'cannot finetune the 774M' gpt_2.py > ${GPT2PY_PATH}
if [ $? -ne 0 ]
then
echo -e "FAILED grep"
exit 1
fi
rm gpt_2.py
fi |
for the benefit of people after me, I can confirm that the |
If there's a PR, I can test and merge it. (as long as there aren't any other preconditions/hacks outside of the package required to get it working) |
@minimaxir, All that is needed now is to add additional checkpoints and use SGD optimizer. What do you think about adding optimizers SM3 and Adafactor? |
Can anyone help me implement RectifiedAdam optimizer (https://arxiv.org/pdf/1908.03265v1.pdf)? Or Lookahead optimizer (https://arxiv.org/abs/1907.08610v1)? Found them here. I'm trying to check RAdam here: https://colab.research.google.com/drive/1waCfxIgrrY-s4gZm7R3uEZDAk2hdCKVK (install gpt-2-simple from my fork) |
Does anyone know why after training with SGD optimizer - is nonsense generated? Here are two of my Colabs with examples for comparing the generation results (at the end of the colab). SGD: https://colab.research.google.com/drive/1LjZ8Z7oIjQVdgDM1FsWLusjbMXW925eV |
Will it work in a collab pro? |
I have a similar issue with 355m model in Colab Notebook (even in Colab Pro) with small dataset training file: #277 Pls help, I wanted to show this process for students, and till last months everything worked (I was using this colab notebook for longer time without any issues). I am trying to reproduce the training session which worked perfectly in August 2021, now it doesn't work. |
Recently, Google made it so that you generally only get K80s on the free plan, which with 12GB VRAM didn't work well with the 355M model, so that might explain it. |
Thank you! That explains it indeed - 124M works without issues. Still, I'm using Colab Pro and wonder why 355M doesn't work for me... But 124M is also fine for my experiments (at least, it working) :) |
I made 355M work on Colab Pro but used the gpt-2-simple==0.7.2.
I did this so I could use
Which are not available in tensorflow>2. I have no idea if this is the right approach though. Here's the caveat: |
When running sess command, getting OOM issue. Not sure if new large model is too large for Colab?
WARNING: Logging before flag parsing goes to stderr.
W0820 16:58:18.137592 140704259733376 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/sample.py:17: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:
7 frames
ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/wte/Initializer/random_normal/RandomStandardNormal}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1368 pass
1369 message = error_interpolation.interpolate(message, self._graph)
-> 1370 raise type(e)(node_def, op, message)
1371
1372 def _extend_graph(self):
ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/wte/Initializer/random_normal/RandomStandardNormal (defined at /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py:185) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Original stack trace for 'model/wte/Initializer/random_normal/RandomStandardNormal':
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 12, in
save_every=500
File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 170, in finetune
output = model.model(hparams=hparams, X=context)
File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 185, in model
initializer=tf.compat.v1.random_normal_initializer(stddev=0.02))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 929, in _get_single_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 259, in call
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 198, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 2511, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 263, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1568, in init
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1698, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 901, in
partition_info=partition_info)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py", line 323, in call
shape, self.mean, self.stddev, dtype, seed=self.seed)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py", line 79, in random_normal
shape_tensor, dtype, seed=seed1, seed2=seed2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 728, in random_standard_normal
seed2=seed2, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()
The text was updated successfully, but these errors were encountered: