Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted Issues #1331

Closed
vedantroy opened this issue Jul 30, 2022 · 13 comments
Closed

Assorted Issues #1331

vedantroy opened this issue Jul 30, 2022 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@vedantroy
Copy link

** Environment **

  • OS: Ubuntu 22.04 LTS
  • GPU:
  *-display                 
       description: VGA compatible controller
       product: GP102 [GeForce GTX 1080 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:58 memory:fa000000-faffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff
  *-graphics
       product: EFI VGA
       physical id: 2
       logical name: /dev/fb0
       capabilities: fb
       configuration: depth=32 resolution=1024,768
  • Cuda: 11.7
  • Composer: 0.8.2
  1. loss should specify micro_batch instead of batch
  2. Cannot log multiple losses, allow me to return dictionary of losses
  3. Cannot log things at batch level (not micro-batch level) inside of loss method
  4. (Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works
  5. (Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True
@hanlint
Copy link
Contributor

hanlint commented Jul 31, 2022

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Let us know if that's not the case!

@vedantroy
Copy link
Author

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Let us know if that's not the case!

I do not see that 😔, hence why I added it to the list of issues.

@hanlint
Copy link
Contributor

hanlint commented Jul 31, 2022

Sounds good, tagging @mvpatel2000 for auto grad accum.

Curious the use case for logging and the loss within the microbatch. Was this to debug an error? Ideally, microbatching is just an implementation detail, and for convergence purposes just look at the batch loss.

@vedantroy
Copy link
Author

vedantroy commented Jul 31, 2022

@hanlint
My trainer has this:

    def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return mse_loss + vb_loss

I want to log both the mse_loss and vb_loss, right now composer only logs the final sum.
Ideally, I could return a dictionary like this:

return {  "mse_loss: mse_loss, "vb_loss: vb_loss }

and composer would log both of the losses. This feature would be very useful b/c for debugging purposes, since I need to see how both of my losses change.

Right now, I can't do this. So to work around the issue, I do the following:

        self.logger.data_batch({"mse_loss": mse_loss, "vb_loss": vb_loss})
        return mse_loss + vb_loss

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

Also, another bug
The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

See:

image

However, my learning rate stays flat:
image

@ravi-mosaicml
Copy link
Contributor

Hi there! Thanks for raising these issues -- appreciate it!

loss should specify micro_batch instead of batch

Cannot log things at batch level (not micro-batch level) inside of loss method

Can you elaborate? If I understand correctly, you would like to be able to access the loss from each microbatch?

Cannot log multiple losses, allow me to return dictionary of losses

Noted, we'll see if we can allow custom types (i.e. dictionaries) for the loss output. We do support a tuple for the loss function output. In the meantime, you can do something like:

def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return (mse_loss, vb_loss)

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

This is likely because the optimization step (rather than the microbatch step) is given to wandb.log as the step argument (https://docs.wandb.ai/guides/track/log#stepwise-and-incremental-logging). We do not support metric logging at the microbatch level, as the number of microbatches should not affect convergence. As a workaround, you could log each microbatch to a seperate plot.

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

Can you share the stack trace? CC @mvpatel2000

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

Noted, we'll make sure the print statement is always visible (currently it requires the log level to be configured correctly). CC @mvpatel2000.

Also, another bug
The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

Noted, thanks! We'll take a look.

@mvpatel2000
Copy link
Contributor

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

@abhi-mosaic
Copy link
Contributor

abhi-mosaic commented Aug 2, 2022

Hey @vedantroy , I'm having a bit of trouble reproducing the LR issue. Could you try running it again and printing the logs to console via ProgressBarLogger(progress_bar=False, log_to_console=True, log_level='BATCH') and report back what the raw float values for lr are?

I'm trying to figure out if there is something missing in my testing, or if it's just WandB's logging of float values that is truncated... Like in the early stage of a cosine decay, the true value might be 0.9999912.. but then maybe it's just getting displayed as 1.0.

@vedantroy
Copy link
Author

@abhi-mosaic, I turned on ProgressBarLogger, but I did not find any difference in the console output.

@vedantroy
Copy link
Author

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

Ahhh, I can't reproduce the bug anymore, as the codebase has evolved and I don't know which version of the codebase caused that particular bug. :/

@vedantroy
Copy link
Author

vedantroy commented Aug 5, 2022

I don't care about logging things at a micro batch level. I was saying that the parameter name batch in the loss method should be changed to micro_batch to be more descriptive of what it actually is.

@ravi-mosaicml

@hanlint
Copy link
Contributor

hanlint commented Aug 5, 2022

@mvpatel2000 for the Auto grad accum issue, it may be that we need to do a cuda cache clear after the grad accum adjustment, otherwise the repeated restarts can cause memory fragmentation on the GPU. This can be triggered by increasing the image resolution such as a known small batch size fits into memory, then greatly increasing the batch size to force multiple grad accum adjustments.

@mvpatel2000
Copy link
Contributor

Noting here that we merged in cache clearing and are seeing far lower rates of cache fragmentation. It looks like this has basically resolved the auto grad accum issues, but please let us know if you are still seeing problems

@mvpatel2000
Copy link
Contributor

Closing this because it seems resolved -- please feel free to open if you feel any point was not addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants