Metric with ddp spawn causes script to hang after Trainer.fit() #331

awaelchli · 2021-06-29T21:44:55Z

🐛 Bug

Adding a torchmetric as an attribute to the model causes processes to hang when launching with ddp spawn.

To Reproduce

Install torchmetrics 0.4, PyTorch Lightning 0.3.7
Run the script below

Code sample

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.metrics import Accuracy


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.metric = Accuracy() # add this to break it all

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        max_epochs=1,
        weights_summary=None,
        accelerator="ddp_spawn",
        gpus=2,
    )
    trainer.fit(model, train_dataloader=train_data)


if __name__ == '__main__':
    run()

A few important notes:

All you have to do is self.metric = Accuracy(), then it will break. You don't have to use the metric, just assign it.
Installing TM < 0.4 solves the problem

Expected behavior

No hang.

Environment

PyTorch Version (e.g., 1.0): 1.8.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version: whatever comes with 1.8.0 torch
GPU models and configuration: our grid cluster
Any other relevant information: no

Additional context

trying to prepare PL patch release: Lightning-AI/pytorch-lightning#8198

The text was updated successfully, but these errors were encountered:

github-actions · 2021-06-29T21:45:42Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2021-06-29T21:49:41Z

I will try to git bisect against torchmetrics master to find the commit.

awaelchli · 2021-06-29T22:17:35Z

Bisecting between 0.3.2 and 0.4.0 I found that commit fc3333b is the problematic one. cc @tchaton

hudeven · 2021-06-30T17:02:27Z

thanks @awaelchli for bring it up! I'm using Lightning + torchmetric and encountered the same issue for ddp. cc: @tchaton , @maximsch2

SkafteNicki · 2021-07-01T08:47:48Z

@awaelchli could you try the fix in PR #338 and see if that fixes the problem?

awaelchli · 2021-07-01T08:55:32Z

It was fixed on Lightning side Lightning-AI/pytorch-lightning#8218 to account for the changes in torchmetrics state_dict saving and loading.
Your branch also seems to fix it when I test against PL 1.3.7.

awaelchli · 2021-07-01T08:58:50Z

Btw #339 is in parallel reworking the state_dict logic, which also resolves the problem here.

SkafteNicki · 2021-07-02T07:38:55Z

Closing as #339 solved this.

awaelchli added bug / fix Something isn't working help wanted Extra attention is needed labels Jun 29, 2021

kaushikb11 added the Priority Critical task/issue label Jun 29, 2021

Borda added the Lightning label Jun 29, 2021

This was referenced Jun 30, 2021

fix state_dict() hanging process with torchmetrics 0.4 Lightning-AI/pytorch-lightning#8218

Merged

[bugfix] Improve sync states management for checkpointing #334

Closed

fuine mentioned this issue Jun 30, 2021

Metrics with list state cause crash using ddp #337

Closed

Borda added this to the v0.4 milestone Jul 2, 2021

SkafteNicki closed this as completed Jul 2, 2021

Borda added the distributed DDP, etc. label Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric with ddp spawn causes script to hang after Trainer.fit() #331

Metric with ddp spawn causes script to hang after Trainer.fit() #331

awaelchli commented Jun 29, 2021 •

edited

Loading

github-actions bot commented Jun 29, 2021

awaelchli commented Jun 29, 2021

awaelchli commented Jun 29, 2021

hudeven commented Jun 30, 2021

SkafteNicki commented Jul 1, 2021

awaelchli commented Jul 1, 2021 •

edited

Loading

awaelchli commented Jul 1, 2021

SkafteNicki commented Jul 2, 2021

Metric with ddp spawn causes script to hang after Trainer.fit() #331

Metric with ddp spawn causes script to hang after Trainer.fit() #331

Comments

awaelchli commented Jun 29, 2021 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Jun 29, 2021

awaelchli commented Jun 29, 2021

awaelchli commented Jun 29, 2021

hudeven commented Jun 30, 2021

SkafteNicki commented Jul 1, 2021

awaelchli commented Jul 1, 2021 • edited Loading

awaelchli commented Jul 1, 2021

SkafteNicki commented Jul 2, 2021

awaelchli commented Jun 29, 2021 •

edited

Loading

awaelchli commented Jul 1, 2021 •

edited

Loading