Dataloader workers created subprocess more and more after each epoch #60

hjldegit · 2019-06-16T07:39:13Z

Dataloader workers created more and more subprocess in every epoch. this led to severe memory loss after each epoch end. Finally my CPU memory used up and the program crashed.

The only thing i can do is trying to set 1 epoch and more iters. But the dataloader seems also exist memory leak slowly, and i do not know it's due to the pytorch dataloader or shardloader.

Thank you very much.

raulpuric · 2019-06-17T18:59:04Z

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch.

Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

hjldegit · 2019-06-19T06:50:36Z

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch.
Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

thanks for reply, do you have any idea to solve the memory leak for 1 epoch?

raulpuric · 2019-06-25T22:49:34Z

Sorry for the late reply.
We did some digging around and weren't able to find any memory leaks for 1 epoch. Did you have a memory leak in gpu or cpu memory?

We were able to track down the multi epoch process bomb you experienced. It's due to this creation of a shardloader manager at every epoch. https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py#L207. We need to add better garbage collection.

The best way to get around this is to have one iterator you use across multiple epochs, instead of creating a new iterator (for batch in dataloader) every epoch. However, this is quite similar to your proposed solution of having 1 epoch with more iterations.

hjldegit · 2019-06-27T09:55:08Z

The main issue is the sub-process bomb.

And I observe the memory leak for 1 epoch which increase very slowly, but the memory used will reach the upper limit until current epoch end, and then it start to increase due to the sub-process increase.

Now the memory used is stable through fix the sub-process issue. Thanks for you help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader workers created subprocess more and more after each epoch #60

Dataloader workers created subprocess more and more after each epoch #60

hjldegit commented Jun 16, 2019

raulpuric commented Jun 17, 2019

hjldegit commented Jun 19, 2019

raulpuric commented Jun 25, 2019 •

edited

Loading

hjldegit commented Jun 27, 2019

Dataloader workers created subprocess more and more after each epoch #60

Dataloader workers created subprocess more and more after each epoch #60

Comments

hjldegit commented Jun 16, 2019

raulpuric commented Jun 17, 2019

hjldegit commented Jun 19, 2019

raulpuric commented Jun 25, 2019 • edited Loading

hjldegit commented Jun 27, 2019

raulpuric commented Jun 25, 2019 •

edited

Loading