[fix] Add barrier to accelerator's teardown #6814

ananthsub · 2021-04-04T04:19:35Z

What does this PR do?

There doesn't seem to be any synchronization before we return from the main trainer API entry points (fit/test/validate/predict). This was previously added in #4323 to handle cases like this:

trainer.fit(module)
trainer.test()  # <-- best checkpoint path might not be available for all ranks

or this

module = MyLightningModule(...)
trainer.fit(module)
module.cpu()

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/accelerators/accelerator.py

SeanNaren · 2021-04-04T09:48:43Z

We should add a test to make sure this case isn't forgotten again, however I can't seem to reproduce this. I took the reproduce from the original issue, modified it a bit since the API has changed since, but it runs fine:

# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
# USE THIS MODEL TO REPRODUCE A BUG YOU REPORT
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
import glob
import os

import torch
from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        """
        Testing PL Module

        Use as follows:
        - subclass
        - modify the behavior for what you want

        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing

        or:

        model = BaseTestModel()
        model.training_epoch_end = None

        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self.layer(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('loss', loss)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('x', loss, sync_dist=True)

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('y', loss, sync_dist=True)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def run_test():
    class TestModel(BoringModel):
        def validation_step(self, batch, batch_idx):
            output = self.layer(batch)
            loss = self.loss(batch, output)
            self.log('x', loss)

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = TestModel()
    tmp_dir = 'temp/'
    if os.path.exists(tmp_dir):
        os.rmdir(tmp_dir)
    checkpoint = ModelCheckpoint(
        dirpath=tmp_dir,
        monitor='x',
        mode='min',
        save_top_k=1
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=2,
        accelerator='ddp',
        gpus=2,
        callbacks=[checkpoint]
    )
    trainer.fit(model, train_data, val_data)

    checkpoints = list(sorted(glob.glob(os.path.join(tmp_dir, "*.ckpt"), recursive=True)))
    print("checkpoints", checkpoints)
    print(checkpoint.best_model_path)
    assert os.path.exists(
        checkpoint.best_model_path), f'Could not find checkpoint at rank {trainer.global_rank}'


if __name__ == '__main__':
    run_test()

ananthsub · 2021-04-05T10:30:20Z

@SeanNaren n00b question: what's the recommended way for writing a multi-gpu/cpu test? are these special tests? using the call_training_script utility?

tchaton

LGTM !

pytorch_lightning/accelerators/accelerator.py

tchaton · 2021-04-06T07:14:29Z

@SeanNaren n00b question: what's the recommended way for writing a multi-gpu/cpu test? are these special tests? using the call_training_script utility?

Hey @ananthsub,

If you use ddp_spawn, ddp_cpu, you can directly use pytest test function and rely on any PL helpers while building your test.
If you use ddp, you need to decorate with RunIf(special=True).
This will run your tests with tests/special_tests.sh, so we can gather coverage and prevent hanging due to subprocesses.

Best,
T.C

ananthsub · 2021-04-07T18:23:40Z

@SeanNaren for why it's hard to reproduce, this happens when we load checkpoint weights: https://github.com/PyTorchLightning/pytorch-lightning/blob/19e67d18c472c3a03dec4dd9bfcef031e9ca8719/pytorch_lightning/trainer/trainer.py#L992-L995

But I think this is the wrong spot to apply it. We should synchronize before we return back to the users because they could do anything else afterward, not just call back into trainer.test. In either case, this should be completely safe and we can figure out whether this other synchronization is needed when calling trainer.test()

tchaton

LGTM !

kaushikb11 · 2021-04-07T19:27:26Z

@ananthsub There's a failing test. Link

ananthsub · 2021-04-07T21:20:56Z

@kaushikb11 @tchaton do the azure pipelines need any different filesystem access? or is https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=5034&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=d9f671c5-a304-5675-5394-961fd7f98b9b a real failure that we need to debug? meaning this PR isn't fixing the issue

kaushikb11 · 2021-04-07T22:09:20Z

@ananthsub It's strange. Is the test passing on your machine?

carmocca

But I think this is the wrong spot to apply it.
Should we remove that test barrier then?

carmocca · 2021-04-07T23:28:28Z

pytorch_lightning/accelerators/accelerator.py

+            trainer.fit(module)
+            trainer.test()  <-- best checkpoint path needs to be available on all ranks


nit: no indentation

Suggested change

trainer.fit(module)

trainer.test() <-- best checkpoint path needs to be available on all ranks

trainer.fit(module)

trainer.test() <-- best checkpoint path needs to be available on all ranks

codecov · 2021-04-13T20:45:14Z

Codecov Report

Merging #6814 (71d993b) into master (d1529c2) will increase coverage by 43%.
The diff coverage is 100%.

❗ Current head 71d993b differs from pull request most recent head b10bd99. Consider uploading reports for the commit b10bd99 to get more accurate results

@@           Coverage Diff            @@
##           master   #6814     +/-   ##
========================================
+ Coverage      44%     87%    +43%     
========================================
  Files         197     194      -3     
  Lines       12585   12395    -190     
========================================
+ Hits         5501   10776   +5275     
+ Misses       7084    1619   -5465

CHANGELOG.md

awaelchli · 2021-04-18T01:55:39Z

Looks like one of the special tests is timing out, meaning not every process is entering the barrier? That seems bizarre.

ananthsub · 2021-04-19T02:05:54Z

@awaelchli @Borda @SeanNaren do we need to look for free ports across tests?

awaelchli · 2021-04-20T13:48:23Z

I'm not sure, some tests have

tests.helpers.utils.set_random_master_port()

ananthsub requested review from justusschock, tchaton, awaelchli and SeanNaren April 4, 2021 04:20

ananthsub added the bug Something isn't working label Apr 4, 2021

kaushikb11 reviewed Apr 4, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved

tchaton approved these changes Apr 6, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

ananthsub changed the title ~~[WIP] Add barrier to accelerator's teardown~~ [fix] Add barrier to accelerator's teardown Apr 7, 2021

ananthsub force-pushed the fix-teardown-barrier branch from 306e876 to 7e9e49c Compare April 7, 2021 17:47

ananthsub added this to the 1.2.x milestone Apr 7, 2021

ananthsub marked this pull request as ready for review April 7, 2021 17:48

ananthsub requested review from Borda, carmocca and williamFalcon as code owners April 7, 2021 17:48

justusschock approved these changes Apr 7, 2021

View reviewed changes

tchaton approved these changes Apr 7, 2021

View reviewed changes

ananthsub enabled auto-merge (squash) April 7, 2021 18:37

SeanNaren approved these changes Apr 7, 2021

View reviewed changes

carmocca reviewed Apr 7, 2021

View reviewed changes

mergify bot added the has conflicts label Apr 8, 2021

ananthsub requested a review from edenlightning as a code owner April 13, 2021 05:03

mergify bot removed the has conflicts label Apr 13, 2021

ananthsub force-pushed the fix-teardown-barrier branch from 9239ee3 to 9abbfea Compare April 13, 2021 05:06

awaelchli approved these changes Apr 13, 2021

View reviewed changes

ananthsub added the ready PRs ready to be merged label Apr 14, 2021

ananthsub disabled auto-merge April 14, 2021 00:28

ananthsub enabled auto-merge (squash) April 14, 2021 00:28

kaushikb11 reviewed Apr 14, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

ananthsub force-pushed the fix-teardown-barrier branch 2 times, most recently from 9c0022c to ffa65f7 Compare April 14, 2021 20:03

carmocca disabled auto-merge April 15, 2021 14:45

carmocca enabled auto-merge (squash) April 15, 2021 14:45

mergify bot added the has conflicts label Apr 16, 2021

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

mergify bot removed the has conflicts label Apr 19, 2021

ananthsub and others added 9 commits April 26, 2021 01:55

rebase

f53c68a

Update CHANGELOG.md

d85ff9c

Update CHANGELOG.md

0d24bff

Update test_trainer.py

f464eec

Update test_trainer.py

c3fd5c0

Update test_trainer.py

c486d3e

Update test_trainer.py

45c5c6b

Update CHANGELOG.md

c4c0a41

random master port

3a6b41d

ananthsub force-pushed the fix-teardown-barrier branch from 12787b1 to 3a6b41d Compare April 26, 2021 08:56

carmocca merged commit bc3f08b into Lightning-AI:master Apr 26, 2021

ananthsub deleted the fix-teardown-barrier branch April 26, 2021 09:47

kaushikb11 mentioned this pull request Apr 26, 2021

Update teardown for TPU accelerator #7211

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Add barrier to accelerator's teardown #6814

[fix] Add barrier to accelerator's teardown #6814

ananthsub commented Apr 4, 2021 •

edited

Loading

SeanNaren commented Apr 4, 2021

ananthsub commented Apr 5, 2021

tchaton left a comment

tchaton commented Apr 6, 2021

ananthsub commented Apr 7, 2021 •

edited

Loading

tchaton left a comment

kaushikb11 commented Apr 7, 2021 •

edited

Loading

ananthsub commented Apr 7, 2021

kaushikb11 commented Apr 7, 2021

carmocca left a comment

carmocca Apr 7, 2021

codecov bot commented Apr 13, 2021 •

edited

Loading

awaelchli commented Apr 18, 2021

ananthsub commented Apr 19, 2021

awaelchli commented Apr 20, 2021

		trainer.fit(module)
		trainer.test() <-- best checkpoint path needs to be available on all ranks

[fix] Add barrier to accelerator's teardown #6814

[fix] Add barrier to accelerator's teardown #6814

Conversation

ananthsub commented Apr 4, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

SeanNaren commented Apr 4, 2021

ananthsub commented Apr 5, 2021

tchaton left a comment

Choose a reason for hiding this comment

tchaton commented Apr 6, 2021

ananthsub commented Apr 7, 2021 • edited Loading

tchaton left a comment

Choose a reason for hiding this comment

kaushikb11 commented Apr 7, 2021 • edited Loading

ananthsub commented Apr 7, 2021

kaushikb11 commented Apr 7, 2021

carmocca left a comment

Choose a reason for hiding this comment

carmocca Apr 7, 2021

Choose a reason for hiding this comment

codecov bot commented Apr 13, 2021 • edited Loading

Codecov Report

awaelchli commented Apr 18, 2021

ananthsub commented Apr 19, 2021

awaelchli commented Apr 20, 2021

ananthsub commented Apr 4, 2021 •

edited

Loading

ananthsub commented Apr 7, 2021 •

edited

Loading

kaushikb11 commented Apr 7, 2021 •

edited

Loading

codecov bot commented Apr 13, 2021 •

edited

Loading