Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory after training 1 epoch #8

Open
balag59 opened this issue Jun 10, 2020 · 10 comments
Open

RuntimeError: CUDA out of memory after training 1 epoch #8

balag59 opened this issue Jun 10, 2020 · 10 comments

Comments

@balag59
Copy link

balag59 commented Jun 10, 2020

@mattiadg
I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory.
Here is the exact message:
Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached)
Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs)
Thanks!

@mattiadg
Copy link
Owner

mattiadg commented Jun 10, 2020 via email

@balag59
Copy link
Author

balag59 commented Jun 11, 2020

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan notifications@github.com ha scritto:

@mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

@mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though).

@mattiadg
Copy link
Owner

mattiadg commented Jun 11, 2020 via email

@balag59
Copy link
Author

balag59 commented Jun 11, 2020

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan notifications@github.com ha scritto:

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)?

@mattiadg
Copy link
Owner

mattiadg commented Jun 11, 2020 via email

@balag59
Copy link
Author

balag59 commented Jun 11, 2020

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto:

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

Thank you so much! This helps!

@balag59
Copy link
Author

balag59 commented Jun 11, 2020

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto:

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did.

@mattiadg
Copy link
Owner

mattiadg commented Jun 11, 2020 via email

@balag59
Copy link
Author

balag59 commented Jun 11, 2020

Oops, yes my mistake. Sorry, I haven't been using this code for a while now. The problem is that it can load more segments than the ones actually used in a single iteration, if it loads more segments than - - max-sentences, so when it is too high it just occupies gpu memory. Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan notifications@github.com ha scritto:

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan @.*** ha scritto: … <#m_-7013085008636611461_> I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>> <#8 <#8> <#8 <#8>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> (comment) <#8 (comment) <#8 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA . I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA .

Thanks that makes sense! Speaking of the code, is there a possibility that you will be releasing the code from your latest paper which brings in improvements like knowledge distillation?

@balag59
Copy link
Author

balag59 commented Jun 20, 2020

@mattiadg Any updates on the possibility of releasing code from the latest paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants