how to properly skip samples that cause inf/nan gradients/loss #4956

levhaikin · 2020-12-03T07:54:25Z

tl;dr

does the approach in the code snippet below look ok, or is there a better alternative for automatically skipping few "bad" samples in the data that cause inf/nan gradients/loss? (is it a good practice altogether?)

details

sometimes, there is a small percentage (but annoyingly large in absolute value) of "dirty" samples in the data that cause the loss to be nan, although the neural-network architecture itself is fine and stable in terms of numerical stability.
one approach is to automatically stop training (use terminate_on_nan) and then somehow isolate all these samples and remove them from the data permanently. but..
sometimes we simply want to automatically skip these samples as if they never existed (perhaps with a warning), and continue training.
I couldn't find any documentation about how to do that, nor anyone who asked this question. so i decided to ask and offer a solution I found, for others that might need it as well.
in the end, i came up with the following approach - override on_after_backwards method in my lightning-module with the following code:

code

    def on_after_backward(self) -> None:
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break

        if not valid_gradients:
            log.warning(f'detected inf or nan values in gradients. not updating model parameters')
            self.zero_grad()

pros

this code successfully identifies nan/inf gradients, and skips parameter update by zeroing gradients for the specific batch
support multi-gpu (at least ddp which I tested). when done this way, detecting inf/nan gradients (instead of inf/nan loss), we avoid a potential cases of losing synchronization between different processes, because typically one of the processes would generate an inf loss, while the others won't. if we stop only one process from doing a backwards pass, we lose synchronization, and would stumble into a never-ending processes that wait for nothing. training stalls. when checking gradients, it is after all gradients in all processes have been affected by the bad inf loss. so we have synchronization.

cons

can't catch bad samples that way.. need to work harder..
might not be future proof
clutters lightning module code (it is essentially architecture agnostic, boiler-plate code)
perhaps there is a better way..

final question

is it worth having such functionality integrated into lightning as a simple command-line-switch/parameter?

The text was updated successfully, but these errors were encountered:

github-actions · 2020-12-03T07:55:13Z

Hi! thanks for your contribution!, great first issue!

justusschock · 2020-12-03T14:39:19Z

@levhaikin Thanks for the proposal. I think it should be fine to use. However, I am not sure we want to have this as option within lightning.

Thoughts @tchaton @Borda ?

carmocca · 2020-12-03T22:00:23Z

In the case of invalid losses, you can return None in your training_step to skip it.

levhaikin · 2020-12-04T07:16:04Z

thanks @carmocca, this is definitely a much simpler way!

is it expected to work with ddp?
is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

carmocca · 2020-12-04T14:07:38Z

is it expected to work with ddp?

I'm not sure. We don't have a test for it using DDP. I'll try it and report back.

If it doesn't, this could be fixed with #3325 cc @rohan-varma

is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

See the returns of training_step https://pytorch-lightning.readthedocs.io/en/stable/lightning_module.html#training-step

Do you still want to support skipping invalid gradients or is skipping losses enough?

levhaikin · 2020-12-04T15:27:48Z

if it works with ddp then I guess it should be enough.
thanks for pointing to the docs. I didn't notice that on my own..
i'll probably test it as well, once my current training finishes (don't want to interrupt it)

stale · 2021-02-03T05:53:55Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

jiaruipeng1994 · 2022-03-22T02:22:01Z

In the case of invalid losses, you can return None in your training_step to skip it.

Is there a way to change the result of training_step from NaN to None in the Callbacks?

carmocca · 2022-03-22T23:58:12Z

No. The optimization procedure is completely managed by the loops calling the LightningModule hooks and Callbacks have no access to it.

yozhikoff · 2022-04-12T18:09:28Z

It seems returning None in training_step is a bad idea when using AMP. I am using the native backend and for me it causes gradient scaler issues.
Interestingly it fails only on GPU, on CPU everything works just fine.

Should I submit a bug report?

carmocca · 2022-05-02T16:30:05Z

No. Returning None from training_step is not supported with AMP

ashesh-0 · 2022-06-30T21:16:32Z

The code presented in the first comment does not work for me. I'm using mixed precision with pytorch_lightning 1.5.5. self.zero_grad() present in the function written above on_after_backward is somehow causing the loss to explode. When I comment self.zero_grad() out, even though I get inf gradients, the training converges. I know the gradients are inf since I've added the print statement there.

    def on_after_backward(self) -> None:
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print(f'detected inf or nan values in gradients. not updating model parameters')
            # self.zero_grad()

However, with self.zero_grad() present, the loss diverges to larger values.

My guess is that with float16 datatype, lower values for the gradient would also count as the inf. However, while updating the weights, the value of the gradient is used. inf therefore looks symbolic in that sense. Please correct me if I'm wrong anywhere.

I'm using gradient_clip_val in pl.Trainer to stablize the training.

arlofaria · 2023-01-31T22:45:18Z

FWIW, I encountered a similar problem and it seems to have been resolved by switching from Trainer(precision=16) to Trainer(precision="bf16"), if you have a suitable device for that floating-point type.

YooSungHyun · 2023-03-15T09:25:27Z

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem
because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem)
how about this?

unlugi · 2023-04-23T19:31:29Z

@YooSungHyun can I put this code in self.training_step or do I need to create self.optimizer step after the training_step happens?

YooSungHyun · 2023-04-27T04:03:56Z

@unlugi i just override optimizer_step and, it is called on global step in training loop
plz chk this
https://github.com/YooSungHyun/lightning-U2/blob/main/models/u2/lightningmodule.py#L135

DanTremonti · 2023-05-08T07:07:03Z

Hi, @YooSungHyun! When you mentioned -

i just can have gradient nan problem
because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem)
how about this?

do you mean that it is still possible that nan gradients are not handled by the modified optimizer_step that you shared?
Sorry, I'm unable to get the case that you are trying to describe, could you please explain?

DanTremonti · 2023-05-10T05:38:12Z

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

FYI, the suggested optimizer_step override throws

TypeError: optimizer_step() missing 1 required positional argument: 'optimizer_closure'

for me in lightning 2.0. Removing optimizer_idx from args works for me. Ref - #16539 and docs

znb899 · 2023-09-17T15:24:20Z

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

I had "MisconfigurationException: When optimizer.step(closure) is called, the closure should be callable"
To avoid any problems with not "reimplementing" the method the correct way, I did:

def optimizer_step(
        self,
        *args, **kwargs
    ):
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                valid_gradients = not (torch.isnan(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print("detected inf or nan values in gradients. not updating model parameters")
            self.zero_grad()
        
        pl.LightningModule.optimizer_step(self, *args, **kwargs)

I'm not sure if this is enough when using gradient accumulation + ddp + amp.

levhaikin added the question Further information is requested label Dec 3, 2020

justusschock added the feature Is an improvement or enhancement label Dec 3, 2020

iamkucuk mentioned this issue Dec 23, 2020

Returning None from training_step with multi GPU DDP training #5243

Open

stale bot added the won't fix This will not be worked on label Jan 3, 2021

Lightning-AI deleted a comment from stale bot Jan 3, 2021

stale bot removed the won't fix This will not be worked on label Jan 3, 2021

carmocca mentioned this issue Jan 5, 2021

Handle invalid training_step losses in DDP #5359

Closed

28 tasks

stale bot added the won't fix This will not be worked on label Feb 3, 2021

edenlightning closed this as completed Feb 9, 2021

TommasoBendinelli mentioned this issue May 14, 2021

Training fails at the end of the epoch when returning None in the training step #7544

Closed

lhatsk mentioned this issue Feb 10, 2022

Training duration & NaNs during training aqlaboratory/openfold#19

Open

This comment was marked as off-topic.

Sign in to view

JackTemaki mentioned this issue Jul 2, 2024

Ignore a single broken gradient rwth-i6/returnn#1568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to properly skip samples that cause inf/nan gradients/loss #4956

how to properly skip samples that cause inf/nan gradients/loss #4956

levhaikin commented Dec 3, 2020

github-actions bot commented Dec 3, 2020

justusschock commented Dec 3, 2020

carmocca commented Dec 3, 2020

levhaikin commented Dec 4, 2020

carmocca commented Dec 4, 2020 •

edited

Loading

levhaikin commented Dec 4, 2020

stale bot commented Feb 3, 2021

jiaruipeng1994 commented Mar 22, 2022 •

edited

Loading

carmocca commented Mar 22, 2022

yozhikoff commented Apr 12, 2022

carmocca commented May 2, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

ashesh-0 commented Jun 30, 2022

arlofaria commented Jan 31, 2023

YooSungHyun commented Mar 15, 2023 •

edited

Loading

unlugi commented Apr 23, 2023

YooSungHyun commented Apr 27, 2023

DanTremonti commented May 8, 2023

DanTremonti commented May 10, 2023 •

edited

Loading

znb899 commented Sep 17, 2023 •

edited

Loading

how to properly skip samples that cause inf/nan gradients/loss #4956

how to properly skip samples that cause inf/nan gradients/loss #4956

Comments

levhaikin commented Dec 3, 2020

tl;dr

details

code

pros

cons

final question

github-actions bot commented Dec 3, 2020

justusschock commented Dec 3, 2020

carmocca commented Dec 3, 2020

levhaikin commented Dec 4, 2020

carmocca commented Dec 4, 2020 • edited Loading

levhaikin commented Dec 4, 2020

stale bot commented Feb 3, 2021

jiaruipeng1994 commented Mar 22, 2022 • edited Loading

carmocca commented Mar 22, 2022

yozhikoff commented Apr 12, 2022

carmocca commented May 2, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

ashesh-0 commented Jun 30, 2022

arlofaria commented Jan 31, 2023

YooSungHyun commented Mar 15, 2023 • edited Loading

unlugi commented Apr 23, 2023

YooSungHyun commented Apr 27, 2023

DanTremonti commented May 8, 2023

DanTremonti commented May 10, 2023 • edited Loading

znb899 commented Sep 17, 2023 • edited Loading

carmocca commented Dec 4, 2020 •

edited

Loading

jiaruipeng1994 commented Mar 22, 2022 •

edited

Loading

YooSungHyun commented Mar 15, 2023 •

edited

Loading

DanTremonti commented May 10, 2023 •

edited

Loading

znb899 commented Sep 17, 2023 •

edited

Loading