You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I do training using PyTorch engine on NVIDIA T4 GPU with some residual block and batch normalization layer, and found that the memory of GPU just increasing. In my program I need to store the training data in new ArrayDataset object each iteration. I want to clear the resource that needed at the end of each iteration including every batch represented in ArrayDataset, so I close all of the Trainer, Predictor, and Model at the end of each iteration. Beside, I train using EasyTrain.fit which close every batch that used in training.
However, If I'm not mistaken, I just found that if I enable the droplast in sampling the GPU memory keep increasing. I think this happened because the last batch not closed so the memory keeps increasing. Are there any way to explicitly close the last dropped batch or are there any other possible solution to this problem?
Edit: I think it not because the droplast, but somehow the model still have some overhead when I close it. I have checked that the NDManager ,Trainer, Predictor, and model has been closed and print the resources using NDManager.debugdump() and giving 0 used resource result. But in reallity the memory of GPU just keep growing after training and just released when the program is killed. Beside, i didn't implement any abstract block , and just use concise implementation of DJL Block. Are there any other idea why the CUDA can't be cleared?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello, I do training using PyTorch engine on NVIDIA T4 GPU with some residual block and batch normalization layer, and found that the memory of GPU just increasing. In my program I need to store the training data in new
ArrayDataset
object each iteration. I want to clear the resource that needed at the end of each iteration including every batch represented inArrayDataset
, so I close all of the Trainer, Predictor, and Model at the end of each iteration. Beside, I train usingEasyTrain.fit
which close every batch that used in training.However, If I'm not mistaken, I just found that if I enable the droplast in sampling the GPU memory keep increasing. I think this happened because the last batch not closed so the memory keeps increasing. Are there any way to explicitly close the last dropped batch or are there any other possible solution to this problem?Edit: I think it not because the droplast, but somehow the model still have some overhead when I close it. I have checked that the
NDManager
,Trainer
,Predictor
, and model has been closed and print the resources usingNDManager.debugdump()
and giving 0 used resource result. But in reallity the memory of GPU just keep growing after training and just released when the program is killed. Beside, i didn't implement any abstract block , and just use concise implementation of DJLBlock
. Are there any other idea why the CUDA can't be cleared?Beta Was this translation helpful? Give feedback.
All reactions