Question about Memory Management #1657

brayo303 · 2022-05-16T03:02:23Z

brayo303
May 16, 2022

Hello, I do training using PyTorch engine on NVIDIA T4 GPU with some residual block and batch normalization layer, and found that the memory of GPU just increasing. In my program I need to store the training data in new ArrayDataset object each iteration. I want to clear the resource that needed at the end of each iteration including every batch represented in ArrayDataset, so I close all of the Trainer, Predictor, and Model at the end of each iteration. Beside, I train using EasyTrain.fit which close every batch that used in training.

However, If I'm not mistaken, I just found that if I enable the droplast in sampling the GPU memory keep increasing. I think this happened because the last batch not closed so the memory keeps increasing. Are there any way to explicitly close the last dropped batch or are there any other possible solution to this problem?

Edit: I think it not because the droplast, but somehow the model still have some overhead when I close it. I have checked that the NDManager ,Trainer, Predictor, and model has been closed and print the resources using NDManager.debugdump() and giving 0 used resource result. But in reallity the memory of GPU just keep growing after training and just released when the program is killed. Beside, i didn't implement any abstract block , and just use concise implementation of DJL Block. Are there any other idea why the CUDA can't be cleared?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Memory Management #1657

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Question about Memory Management #1657

brayo303 May 16, 2022

Replies: 0 comments

brayo303
May 16, 2022