-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader workers created subprocess more and more after each epoch #60
Comments
I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch. Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code. |
thanks for reply, do you have any idea to solve the memory leak for 1 epoch? |
Sorry for the late reply. We were able to track down the multi epoch process bomb you experienced. It's due to this creation of a shardloader manager at every epoch. https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py#L207. We need to add better garbage collection. The best way to get around this is to have one iterator you use across multiple epochs, instead of creating a new iterator ( |
The main issue is the sub-process bomb. And I observe the memory leak for 1 epoch which increase very slowly, but the memory used will reach the upper limit until current epoch end, and then it start to increase due to the sub-process increase. Now the memory used is stable through fix the sub-process issue. Thanks for you help. |
Dataloader workers created more and more subprocess in every epoch. this led to severe memory loss after each epoch end. Finally my CPU memory used up and the program crashed.
The only thing i can do is trying to set 1 epoch and more iters. But the dataloader seems also exist memory leak slowly, and i do not know it's due to the pytorch dataloader or shardloader.
Thank you very much.
The text was updated successfully, but these errors were encountered: