-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assorted Issues #1331
Comments
Thanks @vedantroy for reporting these, we'll take a look at each carefully. For #5 above, you should see something in the logs (e.g.):
Let us know if that's not the case! |
I do not see that 😔, hence why I added it to the list of issues. |
Sounds good, tagging @mvpatel2000 for auto grad accum. Curious the use case for logging and the loss within the microbatch. Was this to debug an error? Ideally, microbatching is just an implementation detail, and for convergence purposes just look at the batch loss. |
@hanlint def loss(self, out, micro_batch):
mse_loss, vb_loss = self.diffusion.training_losses(
out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
)
return mse_loss + vb_loss I want to log both the
and composer would log both of the losses. This feature would be very useful b/c for debugging purposes, since I need to see how both of my losses change. Right now, I can't do this. So to work around the issue, I do the following:
However, Also, another bug See: |
Hi there! Thanks for raising these issues -- appreciate it!
Can you elaborate? If I understand correctly, you would like to be able to access the loss from each microbatch?
Noted, we'll see if we can allow custom types (i.e. dictionaries) for the loss output. We do support a tuple for the loss function output. In the meantime, you can do something like: def loss(self, out, micro_batch):
mse_loss, vb_loss = self.diffusion.training_losses(
out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
)
return (mse_loss, vb_loss)
This is likely because the optimization step (rather than the microbatch step) is given to
Can you share the stack trace? CC @mvpatel2000
Noted, we'll make sure the print statement is always visible (currently it requires the log level to be configured correctly). CC @mvpatel2000.
Noted, thanks! We'll take a look. |
Thanks for bringing this up! Feedback is super helpful.
We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!
If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues |
Hey @vedantroy , I'm having a bit of trouble reproducing the LR issue. Could you try running it again and printing the logs to console via I'm trying to figure out if there is something missing in my testing, or if it's just WandB's logging of float values that is truncated... Like in the early stage of a cosine decay, the true value might be 0.9999912.. but then maybe it's just getting displayed as 1.0. |
@abhi-mosaic, I turned on |
Ahhh, I can't reproduce the bug anymore, as the codebase has evolved and I don't know which version of the codebase caused that particular bug. :/ |
I don't care about logging things at a micro batch level. I was saying that the parameter name |
@mvpatel2000 for the Auto grad accum issue, it may be that we need to do a cuda cache clear after the grad accum adjustment, otherwise the repeated restarts can cause memory fragmentation on the GPU. This can be triggered by increasing the image resolution such as a known small batch size fits into memory, then greatly increasing the batch size to force multiple grad accum adjustments. |
Noting here that we merged in cache clearing and are seeing far lower rates of cache fragmentation. It looks like this has basically resolved the auto grad accum issues, but please let us know if you are still seeing problems |
Closing this because it seems resolved -- please feel free to open if you feel any point was not addressed |
** Environment **
loss
should specify micro_batch instead of batchloss
methodgrad_accum
fails with CUDA OOM even thoughbatch_size=1
w/ nograd_accum
worksgrad_accum="auto"
is set toTrue
The text was updated successfully, but these errors were encountered: