Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader workers created subprocess more and more after each epoch #60

Open
hjldegit opened this issue Jun 16, 2019 · 4 comments
Open

Comments

@hjldegit
Copy link

Dataloader workers created more and more subprocess in every epoch. this led to severe memory loss after each epoch end. Finally my CPU memory used up and the program crashed.

The only thing i can do is trying to set 1 epoch and more iters. But the dataloader seems also exist memory leak slowly, and i do not know it's due to the pytorch dataloader or shardloader.

Thank you very much.

@raulpuric
Copy link
Contributor

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch.

Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

@hjldegit
Copy link
Author

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch.
Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

thanks for reply, do you have any idea to solve the memory leak for 1 epoch?

@raulpuric
Copy link
Contributor

raulpuric commented Jun 25, 2019

Sorry for the late reply.
We did some digging around and weren't able to find any memory leaks for 1 epoch. Did you have a memory leak in gpu or cpu memory?

We were able to track down the multi epoch process bomb you experienced. It's due to this creation of a shardloader manager at every epoch. https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py#L207. We need to add better garbage collection.

The best way to get around this is to have one iterator you use across multiple epochs, instead of creating a new iterator (for batch in dataloader) every epoch. However, this is quite similar to your proposed solution of having 1 epoch with more iterations.

@hjldegit
Copy link
Author

The main issue is the sub-process bomb.

And I observe the memory leak for 1 epoch which increase very slowly, but the memory used will reach the upper limit until current epoch end, and then it start to increase due to the sub-process increase.

Now the memory used is stable through fix the sub-process issue. Thanks for you help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants