Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Add barrier to accelerator's teardown #6814

Merged
merged 9 commits into from
Apr 26, 2021

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Apr 4, 2021

What does this PR do?

There doesn't seem to be any synchronization before we return from the main trainer API entry points (fit/test/validate/predict). This was previously added in #4323 to handle cases like this:

trainer.fit(module)
trainer.test()  # <-- best checkpoint path might not be available for all ranks

or this

module = MyLightningModule(...)
trainer.fit(module)
module.cpu() 

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ananthsub ananthsub added the bug Something isn't working label Apr 4, 2021
@SeanNaren
Copy link
Contributor

We should add a test to make sure this case isn't forgotten again, however I can't seem to reproduce this. I took the reproduce from the original issue, modified it a bit since the API has changed since, but it runs fine:

# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
# USE THIS MODEL TO REPRODUCE A BUG YOU REPORT
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
import glob
import os

import torch
from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        """
        Testing PL Module

        Use as follows:
        - subclass
        - modify the behavior for what you want

        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing

        or:

        model = BaseTestModel()
        model.training_epoch_end = None

        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self.layer(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('loss', loss)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('x', loss, sync_dist=True)

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('y', loss, sync_dist=True)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def run_test():
    class TestModel(BoringModel):
        def validation_step(self, batch, batch_idx):
            output = self.layer(batch)
            loss = self.loss(batch, output)
            self.log('x', loss)

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = TestModel()
    tmp_dir = 'temp/'
    if os.path.exists(tmp_dir):
        os.rmdir(tmp_dir)
    checkpoint = ModelCheckpoint(
        dirpath=tmp_dir,
        monitor='x',
        mode='min',
        save_top_k=1
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=2,
        accelerator='ddp',
        gpus=2,
        callbacks=[checkpoint]
    )
    trainer.fit(model, train_data, val_data)

    checkpoints = list(sorted(glob.glob(os.path.join(tmp_dir, "*.ckpt"), recursive=True)))
    print("checkpoints", checkpoints)
    print(checkpoint.best_model_path)
    assert os.path.exists(
        checkpoint.best_model_path), f'Could not find checkpoint at rank {trainer.global_rank}'


if __name__ == '__main__':
    run_test()

@ananthsub
Copy link
Contributor Author

@SeanNaren n00b question: what's the recommended way for writing a multi-gpu/cpu test? are these special tests? using the call_training_script utility?

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved
@tchaton
Copy link
Contributor

tchaton commented Apr 6, 2021

@SeanNaren n00b question: what's the recommended way for writing a multi-gpu/cpu test? are these special tests? using the call_training_script utility?

Hey @ananthsub,

If you use ddp_spawn, ddp_cpu, you can directly use pytest test function and rely on any PL helpers while building your test.
If you use ddp, you need to decorate with RunIf(special=True).
This will run your tests with tests/special_tests.sh, so we can gather coverage and prevent hanging due to subprocesses.

Best,
T.C

@ananthsub ananthsub changed the title [WIP] Add barrier to accelerator's teardown [fix] Add barrier to accelerator's teardown Apr 7, 2021
@ananthsub ananthsub added this to the 1.2.x milestone Apr 7, 2021
@ananthsub ananthsub marked this pull request as ready for review April 7, 2021 17:48
@ananthsub
Copy link
Contributor Author

ananthsub commented Apr 7, 2021

@SeanNaren for why it's hard to reproduce, this happens when we load checkpoint weights: https://github.com/PyTorchLightning/pytorch-lightning/blob/19e67d18c472c3a03dec4dd9bfcef031e9ca8719/pytorch_lightning/trainer/trainer.py#L992-L995

But I think this is the wrong spot to apply it. We should synchronize before we return back to the users because they could do anything else afterward, not just call back into trainer.test. In either case, this should be completely safe and we can figure out whether this other synchronization is needed when calling trainer.test()

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@ananthsub ananthsub enabled auto-merge (squash) April 7, 2021 18:37
@kaushikb11
Copy link
Contributor

kaushikb11 commented Apr 7, 2021

@ananthsub There's a failing test. Link

@ananthsub
Copy link
Contributor Author

@kaushikb11 @tchaton do the azure pipelines need any different filesystem access? or is https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=5034&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=d9f671c5-a304-5675-5394-961fd7f98b9b a real failure that we need to debug? meaning this PR isn't fixing the issue

@kaushikb11
Copy link
Contributor

@ananthsub It's strange. Is the test passing on your machine?

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think this is the wrong spot to apply it.
Should we remove that test barrier then?

Comment on lines 147 to 148
trainer.fit(module)
trainer.test() <-- best checkpoint path needs to be available on all ranks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no indentation

Suggested change
trainer.fit(module)
trainer.test() <-- best checkpoint path needs to be available on all ranks
trainer.fit(module)
trainer.test() <-- best checkpoint path needs to be available on all ranks

@codecov
Copy link

codecov bot commented Apr 13, 2021

Codecov Report

Merging #6814 (71d993b) into master (d1529c2) will increase coverage by 43%.
The diff coverage is 100%.

❗ Current head 71d993b differs from pull request most recent head b10bd99. Consider uploading reports for the commit b10bd99 to get more accurate results

@@           Coverage Diff            @@
##           master   #6814     +/-   ##
========================================
+ Coverage      44%     87%    +43%     
========================================
  Files         197     194      -3     
  Lines       12585   12395    -190     
========================================
+ Hits         5501   10776   +5275     
+ Misses       7084    1619   -5465     

@ananthsub ananthsub added the ready PRs ready to be merged label Apr 14, 2021
@ananthsub ananthsub disabled auto-merge April 14, 2021 00:28
@ananthsub ananthsub enabled auto-merge (squash) April 14, 2021 00:28
CHANGELOG.md Outdated Show resolved Hide resolved
@ananthsub ananthsub force-pushed the fix-teardown-barrier branch 2 times, most recently from 9c0022c to ffa65f7 Compare April 14, 2021 20:03
@carmocca carmocca disabled auto-merge April 15, 2021 14:45
@carmocca carmocca enabled auto-merge (squash) April 15, 2021 14:45
@awaelchli
Copy link
Contributor

Looks like one of the special tests is timing out, meaning not every process is entering the barrier? That seems bizarre.

@Borda Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021
@ananthsub
Copy link
Contributor Author

@awaelchli @Borda @SeanNaren do we need to look for free ports across tests?

@mergify mergify bot removed the has conflicts label Apr 19, 2021
@awaelchli
Copy link
Contributor

I'm not sure, some tests have

tests.helpers.utils.set_random_master_port()

@carmocca carmocca merged commit bc3f08b into Lightning-AI:master Apr 26, 2021
@ananthsub ananthsub deleted the fix-teardown-barrier branch April 26, 2021 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants