Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to properly skip samples that cause inf/nan gradients/loss #4956

Closed
levhaikin opened this issue Dec 3, 2020 · 21 comments
Closed

how to properly skip samples that cause inf/nan gradients/loss #4956

levhaikin opened this issue Dec 3, 2020 · 21 comments
Labels
feature Is an improvement or enhancement question Further information is requested won't fix This will not be worked on

Comments

@levhaikin
Copy link

tl;dr

does the approach in the code snippet below look ok, or is there a better alternative for automatically skipping few "bad" samples in the data that cause inf/nan gradients/loss? (is it a good practice altogether?)

details

sometimes, there is a small percentage (but annoyingly large in absolute value) of "dirty" samples in the data that cause the loss to be nan, although the neural-network architecture itself is fine and stable in terms of numerical stability.
one approach is to automatically stop training (use terminate_on_nan) and then somehow isolate all these samples and remove them from the data permanently. but..
sometimes we simply want to automatically skip these samples as if they never existed (perhaps with a warning), and continue training.
I couldn't find any documentation about how to do that, nor anyone who asked this question. so i decided to ask and offer a solution I found, for others that might need it as well.
in the end, i came up with the following approach - override on_after_backwards method in my lightning-module with the following code:

code

    def on_after_backward(self) -> None:
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break

        if not valid_gradients:
            log.warning(f'detected inf or nan values in gradients. not updating model parameters')
            self.zero_grad()

pros

  • this code successfully identifies nan/inf gradients, and skips parameter update by zeroing gradients for the specific batch
  • support multi-gpu (at least ddp which I tested). when done this way, detecting inf/nan gradients (instead of inf/nan loss), we avoid a potential cases of losing synchronization between different processes, because typically one of the processes would generate an inf loss, while the others won't. if we stop only one process from doing a backwards pass, we lose synchronization, and would stumble into a never-ending processes that wait for nothing. training stalls. when checking gradients, it is after all gradients in all processes have been affected by the bad inf loss. so we have synchronization.

cons

  • can't catch bad samples that way.. need to work harder..
  • might not be future proof
  • clutters lightning module code (it is essentially architecture agnostic, boiler-plate code)
  • perhaps there is a better way..

final question

is it worth having such functionality integrated into lightning as a simple command-line-switch/parameter?

@levhaikin levhaikin added the question Further information is requested label Dec 3, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2020

Hi! thanks for your contribution!, great first issue!

@justusschock justusschock added the feature Is an improvement or enhancement label Dec 3, 2020
@justusschock
Copy link
Member

@levhaikin Thanks for the proposal. I think it should be fine to use. However, I am not sure we want to have this as option within lightning.

Thoughts @tchaton @Borda ?

@carmocca
Copy link
Contributor

carmocca commented Dec 3, 2020

In the case of invalid losses, you can return None in your training_step to skip it.

@levhaikin
Copy link
Author

thanks @carmocca, this is definitely a much simpler way!

  • is it expected to work with ddp?
  • is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

@carmocca
Copy link
Contributor

carmocca commented Dec 4, 2020

  • is it expected to work with ddp?

I'm not sure. We don't have a test for it using DDP. I'll try it and report back.

If it doesn't, this could be fixed with #3325 cc @rohan-varma

  • is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

See the returns of training_step https://pytorch-lightning.readthedocs.io/en/stable/lightning_module.html#training-step

Do you still want to support skipping invalid gradients or is skipping losses enough?

@levhaikin
Copy link
Author

if it works with ddp then I guess it should be enough.
thanks for pointing to the docs. I didn't notice that on my own..
i'll probably test it as well, once my current training finishes (don't want to interrupt it)

@stale stale bot added the won't fix This will not be worked on label Jan 3, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Jan 3, 2021
@stale stale bot removed the won't fix This will not be worked on label Jan 3, 2021
@stale
Copy link

stale bot commented Feb 3, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@jiaruipeng1994
Copy link

jiaruipeng1994 commented Mar 22, 2022

In the case of invalid losses, you can return None in your training_step to skip it.

Is there a way to change the result of training_step from NaN to None in the Callbacks?

@carmocca
Copy link
Contributor

No. The optimization procedure is completely managed by the loops calling the LightningModule hooks and Callbacks have no access to it.

@yozhikoff
Copy link

It seems returning None in training_step is a bad idea when using AMP. I am using the native backend and for me it causes gradient scaler issues.
Interestingly it fails only on GPU, on CPU everything works just fine.

Should I submit a bug report?

@carmocca
Copy link
Contributor

carmocca commented May 2, 2022

No. Returning None from training_step is not supported with AMP

@mfoglio

This comment was marked as off-topic.

@carmocca

This comment was marked as off-topic.

@ashesh-0
Copy link

The code presented in the first comment does not work for me. I'm using mixed precision with pytorch_lightning 1.5.5. self.zero_grad() present in the function written above on_after_backward is somehow causing the loss to explode. When I comment self.zero_grad() out, even though I get inf gradients, the training converges. I know the gradients are inf since I've added the print statement there.

    def on_after_backward(self) -> None:
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print(f'detected inf or nan values in gradients. not updating model parameters')
            # self.zero_grad()

However, with self.zero_grad() present, the loss diverges to larger values.

My guess is that with float16 datatype, lower values for the gradient would also count as the inf. However, while updating the weights, the value of the gradient is used. inf therefore looks symbolic in that sense. Please correct me if I'm wrong anywhere.

I'm using gradient_clip_val in pl.Trainer to stablize the training.

@arlofaria
Copy link

FWIW, I encountered a similar problem and it seems to have been resolved by switching from Trainer(precision=16) to Trainer(precision="bf16"), if you have a suitable device for that floating-point type.

@YooSungHyun
Copy link

YooSungHyun commented Mar 15, 2023

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem
because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem)
how about this?

@unlugi
Copy link

unlugi commented Apr 23, 2023

@YooSungHyun can I put this code in self.training_step or do I need to create self.optimizer step after the training_step happens?

@YooSungHyun
Copy link

@unlugi i just override optimizer_step and, it is called on global step in training loop
plz chk this
https://github.com/YooSungHyun/lightning-U2/blob/main/models/u2/lightningmodule.py#L135

@DanTremonti
Copy link

Hi, @YooSungHyun! When you mentioned -

i just can have gradient nan problem
because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem)
how about this?

do you mean that it is still possible that nan gradients are not handled by the modified optimizer_step that you shared?
Sorry, I'm unable to get the case that you are trying to describe, could you please explain?

@DanTremonti
Copy link

DanTremonti commented May 10, 2023

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

FYI, the suggested optimizer_step override throws

TypeError: optimizer_step() missing 1 required positional argument: 'optimizer_closure'

for me in lightning 2.0. Removing optimizer_idx from args works for me. Ref - #16539 and docs

@znb899
Copy link

znb899 commented Sep 17, 2023

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

I had "MisconfigurationException: When optimizer.step(closure) is called, the closure should be callable"
To avoid any problems with not "reimplementing" the method the correct way, I did:

def optimizer_step(
        self,
        *args, **kwargs
    ):
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                valid_gradients = not (torch.isnan(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print("detected inf or nan values in gradients. not updating model parameters")
            self.zero_grad()
        
        pl.LightningModule.optimizer_step(self, *args, **kwargs)

I'm not sure if this is enough when using gradient accumulation + ddp + amp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.