-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA out of memory after training 1 epoch #8
Comments
Hi,
Try to use the validation set in both training and validation. Do you get
the same error during training this way?
Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan <notifications@github.com>
ha scritto:
… @mattiadg <https://github.com/mattiadg>
I'm currently training on a very very large dataset with 4 GPUs and I get
a CUDA out of memory error after the completion of 1 training epoch. After
the training is complete, when validation starts, it runs out of memory.
Here is the exact message:
Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB
already allocated; 3.53 GiB free; 6.75 GiB cached)
Is this a memory leak? Is there an issue with emptying the cache or do I
just need to reduce the batch size/max tokens?(already tried reducing the
batch size by half and the same error occurs)
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA>
.
|
@mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). |
I think that it is possible that in the validation set there are samples
that are too large. Let's exclude all the possibilities that are easy to
solve before thinking about a memory leak that is more difficult to detect.
I have trained using datasets with a few millions of samples and never had
such a problem.
Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan <notifications@github.com>
ha scritto:
… Hi, Try to use the validation set in both training and validation. Do you
get the same error during training this way? Il gio 11 giu 2020, 00:02
Balaji Radhakrishnan ***@***.*** ha scritto:
… <#m_4792665824052452770_>
@mattiadg <https://github.com/mattiadg> https://github.com/mattiadg I'm
currently training on a very very large dataset with 4 GPUs and I get a
CUDA out of memory error after the completion of 1 training epoch. After
the training is complete, when validation starts, it runs out of memory.
Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB
total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB
cached) Is this a memory leak? Is there an issue with emptying the cache or
do I just need to reduce the batch size/max tokens?(already tried reducing
the batch size by half and the same error occurs) Thanks! — You are
receiving this because you were mentioned. Reply to this email directly,
view it on GitHub <#8
<#8>>, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA
.
@mattiadg <https://github.com/mattiadg> I have't tried using the same set
yet. Training runs fine and runs the entire epoch. The issue begins after
training of 1 epoch is complete and when it tries to perform the validation
which leads me to believe that memory is not being released correctly( I'm
not sure though).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA>
.
|
Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? |
--max-tokens sets a maximum length for the (source) segments. Those longer
than the parameter are removed from the sets. A lower value means less and
shorter samples, so it speeds up a bit an epoch. I have never noticed
significant differences in the convergence.
Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan <notifications@github.com>
ha scritto:
… I think that it is possible that in the validation set there are samples
that are too large. Let's exclude all the possibilities that are easy to
solve before thinking about a memory leak that is more difficult to detect.
I have trained using datasets with a few millions of samples and never had
such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan
***@***.*** ha scritto:
… <#m_-3582157314581188595_>
Hi, Try to use the validation set in both training and validation. Do you
get the same error during training this way? Il gio 11 giu 2020, 00:02
Balaji Radhakrishnan *@*.*** ha scritto: … <#m_4792665824052452770_>
@mattiadg <https://github.com/mattiadg> https://github.com/mattiadg
https://github.com/mattiadg I'm currently training on a very very large
dataset with 4 GPUs and I get a CUDA out of memory error after the
completion of 1 training epoch. After the training is complete, when
validation starts, it runs out of memory. Here is the exact message: Tried
to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already
allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there
an issue with emptying the cache or do I just need to reduce the batch
size/max tokens?(already tried reducing the batch size by half and the same
error occurs) Thanks! — You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#8
<#8> <#8
<#8>>>, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA
. @mattiadg <https://github.com/mattiadg> https://github.com/mattiadg I
have't tried using the same set yet. Training runs fine and runs the entire
epoch. The issue begins after training of 1 epoch is complete and when it
tries to perform the validation which leads me to believe that memory is
not being released correctly( I'm not sure though). — You are receiving
this because you were mentioned. Reply to this email directly, view it on
GitHub <#8 (comment)
<#8 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA
.
Thanks! I restored the batch size back to 512 but reduced the max-tokens
from 12k to 6k and it seems to be working fine now. How does the max-tokens
parameter affect the time to convergence or performance( if it does affect
them at all)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA>
.
|
Thank you so much! This helps! |
I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did. |
Oops, yes my mistake. Sorry, I haven't been using this code for a while
now. The problem is that it can load more segments than the ones actually
used in a single iteration, if it loads more segments than - -
max-sentences, so when it is too high it just occupies gpu memory.
Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan <notifications@github.com>
ha scritto:
… --max-tokens sets a maximum length for the (source) segments. Those longer
than the parameter are removed from the sets. A lower value means less and
shorter samples, so it speeds up a bit an epoch. I have never noticed
significant differences in the convergence. Il gio 11 giu 2020, 11:44
Balaji Radhakrishnan ***@***.*** ha scritto:
… <#m_-7013085008636611461_>
I think that it is possible that in the validation set there are samples
that are too large. Let's exclude all the possibilities that are easy to
solve before thinking about a memory leak that is more difficult to detect.
I have trained using datasets with a few millions of samples and never had
such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan *@*.*** ha
scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in
both training and validation. Do you get the same error during training
this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan *@*.*** ha
scritto: … <#m_4792665824052452770_> @mattiadg
<https://github.com/mattiadg> https://github.com/mattiadg
https://github.com/mattiadg https://github.com/mattiadg I'm currently
training on a very very large dataset with 4 GPUs and I get a CUDA out of
memory error after the completion of 1 training epoch. After the training
is complete, when validation starts, it runs out of memory. Here is the
exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity;
11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a
memory leak? Is there an issue with emptying the cache or do I just need to
reduce the batch size/max tokens?(already tried reducing the batch size by
half and the same error occurs) Thanks! — You are receiving this because
you were mentioned. Reply to this email directly, view it on GitHub <#8
<#8> <#8
<#8>> <#8
<#8> <#8
<#8>>>>, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA
. @mattiadg <https://github.com/mattiadg> https://github.com/mattiadg
https://github.com/mattiadg I have't tried using the same set yet.
Training runs fine and runs the entire epoch. The issue begins after
training of 1 epoch is complete and when it tries to perform the validation
which leads me to believe that memory is not being released correctly( I'm
not sure though). — You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#8
<#8> (comment) <#8
(comment)
<#8 (comment)>>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA
. Thanks! I restored the batch size back to 512 but reduced the max-tokens
from 12k to 6k and it seems to be working fine now. How does the max-tokens
parameter affect the time to convergence or performance( if it does affect
them at all)? — You are receiving this because you were mentioned. Reply to
this email directly, view it on GitHub <#8 (comment)
<#8 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA
.
I'm sorry but doesn't max-tokens stand for the maximum number of audio
frames that can be loaded in a single GPU for every iteration? I thought it
did.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA>
.
|
Thanks that makes sense! Speaking of the code, is there a possibility that you will be releasing the code from your latest paper which brings in improvements like knowledge distillation? |
@mattiadg Any updates on the possibility of releasing code from the latest paper? |
@mattiadg
I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory.
Here is the exact message:
Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached)
Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs)
Thanks!
The text was updated successfully, but these errors were encountered: