Is the warning emitted by self.log-ing an integer intentional? #18739

awaelchli · 2023-10-06T20:11:43Z

Bug description

When you call

self.log("integer", 1)

you get the warning:

/Users/adrian/repositories/lightning/src/lightning/pytorch/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('integer', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.

Is this intentional?

What version are you seeing the problem on?

v1.9, v2.0, master

How to reproduce the bug

import torch
from lightning.pytorch import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def training_step(self, batch, batch_idx):
        loss = self.layer(batch).sum()
        self.log("integer", 1)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(max_steps=1)
trainer.fit(model, train_data)

Error messages and logs

/Users/adrian/repositories/lightning/src/lightning/pytorch/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('integer', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): 1.9+
#- Lightning App Version (e.g., 0.5.2): -
#- PyTorch Version (e.g., 2.0): 2.1
#- Python version (e.g., 3.9): 3.11
#- OS (e.g., Linux): MacOs
#- CUDA/cuDNN version: -
#- GPU models and configuration: -
#- How you installed Lightning(`conda`, `pip`, source): source
#- Running environment of LightningApp (e.g. local, cloud): -

More info

The question was raised here #18723 (comment), I'm not sure if there is a good reason for it.

cc @carmocca @Blaizzy @stas00

The text was updated successfully, but these errors were encountered:

carmocca · 2023-10-09T14:13:02Z

I don't understand the question of whether it's intentional or not. The fact that there is a warning shows that it's intentional.

Logging is meant for floating point types, as the reduction logic will not preserve the integer types. If you log epoch=0 and epoch=1, then the mean will be 0.5 which is clearly not right if you want to track epochs. The warning aims to tell the user about this limitation.

For an example like this, we would expect that the user fixes it with:
self.log("epoch", float(trainer.current_epoch), reduce_fx=max)

Perhaps the message could be changed to make the implications of converting integers to floats clearer

stas00 · 2023-10-09T15:12:58Z

Logging is meant for floating point types

I'm not sure why you said that, Carlos. Since not all logged types are `mean'ed.

Some log types are aggregate or counters. Examples: like global_step, consumed_samples - these can't be float. They are ints. Usually there can be no 3.4 steps or 22234.8 samples.

carmocca · 2023-10-09T15:51:47Z

@stas00 There's an unfortunate overload of nomenclature in lightning that is "self.logging" and "logging to a Logger". It's a source of confusion for new users.

With my comment, I meant specifically self.logging. As you point out, "logging to a Logger" integer types is perfectly normal.

self.log is a mechanism that supports aggregating data across steps/epochs and reducing across ranks. This is what is currently designed to work with floating types and the reason for this warning. After all this happens, PL ends up calling self.trainer.logger.log_metrics(what_was_self_logged)

For cases when one does not want the aggregation/reduction, the user should skip self.log and simply call trainer.logger.log_metrics(whatever) themselves.

Sorry for the confusion, happy to hear suggestions about this, but changes to names are impossible as we want to avoid annoying deprecations.

awaelchli · 2023-10-09T15:58:00Z

@carmocca I'm just not finding the info in the history why the warning needs to be there as opposed to converting the value automatically. If we look at the PR where it was introduced #10076, the motivation seems to be different and no clear explanation why we couldn't just internally work with floats. Why does the user need to be informed?

carmocca · 2023-10-09T16:22:27Z

We do work internally with floats: https://github.com/Lightning-AI/lightning/blob/master/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L216

The warning just aims to let the user know about the potential implications of this, such as that your epoch could become 0.5 as I described in an example above.

One option is that the warning is changed to also suggest logger.log_metrics({"epoch": epoch}) as an alternative to calling self.log("epoch", epoch.float())

stas00 · 2023-10-09T16:35:58Z

Thank you for explaining, Carlos. Yes, the naming is indeed unfortunately not self-documenting. log_reduce or something similar would have been more intuitive.

The incorrect use is then being done here in the integration of PTL at https://github.com/NVIDIA/nemo with:

$ grep '\.log(' | grep global 
nemo/collections/nlp/models/language_modeling/megatron_finetune_model.py:            
self.log('global_step', self.trainer.global_step, prog_bar=True, rank_zero_only=True, batch_size=1)
[...]
nemo/collections/multimodal/speech_cv/models/visual_ctc_models.py:        
self.log('global_step', torch.tensor(self.trainer.global_step, dtype=torch.float32))
[...]
$ grep '\.log(' | grep global | grep -v float | wc -l
10

so sometimes it's converted to float, which is somewhat weird when one reads: global_step 3.0

edit: I filed an Issue there: NVIDIA/NeMo#7665

One option is that the warning is changed to also suggest logger.log_metrics({"epoch": epoch}) as an alternative to calling self.log("epoch", epoch.float())

That would make for a better warning, Carlos.

Also it's not just about float vs integer, it's about a wasted reduction across ranks of the data that doesn't need to be reduced. As all ranks should have the same counters in a deterministic way.

LWprogramming · 2023-12-14T21:59:42Z

I'm confused about using global_step for anything besides logging then. How are we supposed to log it without self.log but if we also want to have it accessible for e.g. checkpoint filenames? checkpoint-100, checkpoint-200... etc

trainer.logger.log_metrics(whatever) doesn't work because the key won't be available to the ModelCheckpoint callback.

carmocca · 2023-12-14T22:07:58Z

@LWprogramming You can do trainer.callback_metrics["global_step"] = ...

awaelchli added bug Something isn't working needs triage Waiting to be triaged by maintainers logging Related to the `LoggerConnector` and `log()` and removed needs triage Waiting to be triaged by maintainers labels Oct 6, 2023

github-actions bot added ver: 1.9.x ver: 2.0.x ver: 2.1.x labels Oct 6, 2023

awaelchli added question Further information is requested and removed bug Something isn't working labels Oct 9, 2023

stas00 mentioned this issue Oct 9, 2023

int counter logging shouldn't be reduced across ranks NVIDIA/NeMo#7665

Closed

carmocca mentioned this issue Oct 24, 2023

Extend warning about reducing non floating types #18847

Merged

Borda closed this as completed in #18847 Oct 25, 2023

RaulPPelaez mentioned this issue Jan 19, 2024

Log epoch real time in LNNP torchmd/torchmd-net#231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the warning emitted by self.log-ing an integer intentional? #18739

Is the warning emitted by self.log-ing an integer intentional? #18739

awaelchli commented Oct 6, 2023 •

edited

Loading

carmocca commented Oct 9, 2023 •

edited

Loading

stas00 commented Oct 9, 2023 •

edited

Loading

carmocca commented Oct 9, 2023 •

edited

Loading

awaelchli commented Oct 9, 2023

carmocca commented Oct 9, 2023 •

edited

Loading

stas00 commented Oct 9, 2023 •

edited

Loading

LWprogramming commented Dec 14, 2023

carmocca commented Dec 14, 2023

Is the warning emitted by self.log-ing an integer intentional? #18739

Is the warning emitted by self.log-ing an integer intentional? #18739

Comments

awaelchli commented Oct 6, 2023 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

carmocca commented Oct 9, 2023 • edited Loading

stas00 commented Oct 9, 2023 • edited Loading

carmocca commented Oct 9, 2023 • edited Loading

awaelchli commented Oct 9, 2023

carmocca commented Oct 9, 2023 • edited Loading

stas00 commented Oct 9, 2023 • edited Loading

LWprogramming commented Dec 14, 2023

carmocca commented Dec 14, 2023

awaelchli commented Oct 6, 2023 •

edited

Loading

carmocca commented Oct 9, 2023 •

edited

Loading

stas00 commented Oct 9, 2023 •

edited

Loading

carmocca commented Oct 9, 2023 •

edited

Loading

carmocca commented Oct 9, 2023 •

edited

Loading

stas00 commented Oct 9, 2023 •

edited

Loading