Merge branch 'master' into fix-teardown-barrier

Lightning-AI · Apr 15, 2021 · 71d993b · 71d993b
2 parents 6e4d7c9 + f6f81f0
commit 71d993b
Show file tree

Hide file tree

Showing 21 changed files with 233 additions and 250 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Added a `teardown` hook to `ClusterEnvironment` ([#6942](https://github.com/PyTorchLightning/pytorch-lightning/pull/6942))
+
+
 - Added utils for NaN/Inf detection for gradients and parameters ([#6834](https://github.com/PyTorchLightning/pytorch-lightning/pull/6834/))
 
 
@@ -196,9 +199,13 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Fixed
 
+
 - Added a barrier in the accelerator `teardown` to synchronize processes before execution finishes ([#6814](https://github.com/PyTorchLightning/pytorch-lightning/pull/6814))
 
 
+- Fixed incorrect removal of `WORLD_SIZE` environment variable in DDP training when launching with torch distributed/torchelastic ([#6942](https://github.com/PyTorchLightning/pytorch-lightning/pull/6942))
+
+
 - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic:
     * Support SLURM and torchelastic global rank environment variables ([#5715](https://github.com/PyTorchLightning/pytorch-lightning/pull/5715))
     * Remove hardcoding of local rank in accelerator connector ([#6878](https://github.com/PyTorchLightning/pytorch-lightning/pull/6878))
@@ -246,7 +253,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed `--gpus` default for parser returned by `Trainer.add_argparse_args` ([#6898](https://github.com/PyTorchLightning/pytorch-lightning/pull/6898))
 
 
-- Fixed pickle error checker to now check for `pickle.PickleError` to catch all pickle errors ([#6917](https://github.com/PyTorchLightning/pytorch-lightning/pull/6917)) 
+- Fixed pickle error checker to now check for `pickle.PickleError` to catch all pickle errors ([#6917](https://github.com/PyTorchLightning/pytorch-lightning/pull/6917))
 
 
 - Fixed `AttributeError` for `require_backward_grad_sync` when running manual optimization with sharded plugin ([#6915](https://github.com/PyTorchLightning/pytorch-lightning/pull/6915))

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -116,8 +116,8 @@ jobs:
         set -e
         python -m pytest pl_examples -v --maxfail=2 --durations=0
         pip install . --user --quiet
-        bash pl_examples/run_examples-args.sh --gpus 1 --max_epochs 1 --batch_size 64 --limit_train_batches 5 --limit_val_batches 3
-        bash pl_examples/run_ddp-examples.sh --max_epochs 1 --batch_size 32 --limit_train_batches 2 --limit_val_batches 2
+        bash pl_examples/run_examples-args.sh --trainer.gpus 1 --trainer.max_epochs 1 --data.batch_size 64 --trainer.limit_train_batches 5 --trainer.limit_val_batches 3
+        bash pl_examples/run_ddp-examples.sh --trainer.max_epochs 1 --data.batch_size 32 --trainer.limit_train_batches 2 --trainer.limit_val_batches 2
         # cd pl_examples/basic_examples
         # bash submit_ddp_job.sh
         # bash submit_ddp2_job.sh

diff --git a/docs/source/advanced/multi_gpu.rst b/docs/source/advanced/multi_gpu.rst
@@ -813,12 +813,6 @@ Below we describe how to enable all of these to see benefit. **With all these im
 
 Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains a lot of helpful information when configuring your own models.
 
-.. note::
-    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
-    This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
-
-    This limitation is actively being worked on and will be resolved in the near future.
-
 .. code-block:: python
 
     from pytorch_lightning import Trainer
@@ -848,12 +842,6 @@ We expose a hook that layers initialized within the hook will be sharded instant
 
 This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
 
-.. note::
-    When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` may not work for loading saved checkpoints. If you've trained on one GPU, you can manually instantiate the model and call the hook,
-    however when using multiple GPUs, this will not work as ``LightningModule.load_from_checkpoint`` doesn't support sharded checkpoints.
-
-    We recommend using ``Trainer.test`` or ``Trainer.predict`` for inference.
-
 .. code-block:: python
 
     from pytorch_lightning import Trainer
@@ -950,7 +938,7 @@ Here is some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lig
 * If you're using Adam or AdamW, ensure to use FusedAdam or DeepSpeedCPUAdam (for CPU Offloading) rather than the default torch optimizers as they come with large speed benefits
 * Treat your GPU/CPU memory as one large pool. In some cases, you may not want to offload certain things (like activations) to provide even more space to offload model parameters
 * When offloading to the CPU, make sure to bump up the batch size as GPU memory will be freed
-
+* We also support sharded checkpointing. By passing ``save_full_weights=False`` to the ``DeepSpeedPlugin``, we'll save shards of the model which allows you to save extremely large models. However to load the model and run test/validation/predict you must use the Trainer object.
 
 Custom DeepSpeed Config
 """""""""""""""""""""""

diff --git a/docs/source/common/lightning_cli.rst b/docs/source/common/lightning_cli.rst
@@ -224,7 +224,7 @@ datamodule class. However, there are many cases in which the objective is to eas
 multiple models and datasets. For these cases the tool can be configured such that a model and/or a datamodule is
 specified by an import path and init arguments. For example, with a tool implemented as:
 
-.. testcode::
+.. code-block:: python
 
     from pytorch_lightning.utilities.cli import LightningCLI
 

diff --git a/pl_examples/basic_examples/README.md b/pl_examples/basic_examples/README.md
@@ -8,10 +8,10 @@ Trains MNIST where the model is defined inside the `LightningModule`.
 python simple_image_classifier.py
 
 # gpus (any number)
-python simple_image_classifier.py --gpus 2
+python simple_image_classifier.py --trainer.gpus 2
 
 # dataparallel
-python simple_image_classifier.py --gpus 2 --distributed_backend 'dp'
+python simple_image_classifier.py --trainer.gpus 2 --trainer.accelerator 'dp'
 ```
 
 ---
@@ -30,10 +30,10 @@ Generic image classifier with an arbitrary backbone (ie: a simple system)
 python backbone_image_classifier.py
 
 # gpus (any number)
-python backbone_image_classifier.py --gpus 2
+python backbone_image_classifier.py --trainer.gpus 2
 
 # dataparallel
-python backbone_image_classifier.py --gpus 2 --distributed_backend 'dp'
+python backbone_image_classifier.py --trainer.gpus 2 --trainer.accelerator 'dp'
 ```
 
 ---
@@ -44,10 +44,10 @@ Showing the power of a system... arbitrarily complex training loops
 python autoencoder.py
 
 # gpus (any number)
-python autoencoder.py --gpus 2
+python autoencoder.py --trainer.gpus 2
 
 # dataparallel
-python autoencoder.py --gpus 2 --distributed_backend 'dp'
+python autoencoder.py --trainer.gpus 2 --trainer.accelerator 'dp'
 ```
 ---
 # Multi-node example

diff --git a/pl_examples/basic_examples/autoencoder.py b/pl_examples/basic_examples/autoencoder.py
@@ -11,8 +11,12 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""
+MNIST autoencoder example.
 
-from argparse import ArgumentParser
+To run:
+python autoencoder.py --trainer.max_epochs=50
+"""
 
 import torch
 import torch.nn.functional as F
@@ -21,6 +25,7 @@
 
 import pytorch_lightning as pl
 from pl_examples import _DATASETS_PATH, _TORCHVISION_MNIST_AVAILABLE, cli_lightning_logo
+from pytorch_lightning.utilities.cli import LightningCLI
 from pytorch_lightning.utilities.imports import _TORCHVISION_AVAILABLE
 
 if _TORCHVISION_AVAILABLE:
@@ -87,44 +92,31 @@ def configure_optimizers(self):
         return optimizer
 
 
+class MyDataModule(pl.LightningDataModule):
+
+    def __init__(
+        self,
+        batch_size: int = 32,
+    ):
+        super().__init__()
+        dataset = MNIST(_DATASETS_PATH, train=True, download=True, transform=transforms.ToTensor())
+        self.mnist_test = MNIST(_DATASETS_PATH, train=False, download=True, transform=transforms.ToTensor())
+        self.mnist_train, self.mnist_val = random_split(dataset, [55000, 5000])
+        self.batch_size = batch_size
+
+    def train_dataloader(self):
+        return DataLoader(self.mnist_train, batch_size=self.batch_size)
+
+    def val_dataloader(self):
+        return DataLoader(self.mnist_val, batch_size=self.batch_size)
+
+    def test_dataloader(self):
+        return DataLoader(self.mnist_test, batch_size=self.batch_size)
+
+
 def cli_main():
-    pl.seed_everything(1234)
-
-    # ------------
-    # args
-    # ------------
-    parser = ArgumentParser()
-    parser.add_argument('--batch_size', default=32, type=int)
-    parser.add_argument('--hidden_dim', type=int, default=64)
-    parser = pl.Trainer.add_argparse_args(parser)
-    args = parser.parse_args()
-
-    # ------------
-    # data
-    # ------------
-    dataset = MNIST(_DATASETS_PATH, train=True, download=True, transform=transforms.ToTensor())
-    mnist_test = MNIST(_DATASETS_PATH, train=False, download=True, transform=transforms.ToTensor())
-    mnist_train, mnist_val = random_split(dataset, [55000, 5000])
-
-    train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
-    val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
-    test_loader = DataLoader(mnist_test, batch_size=args.batch_size)
-
-    # ------------
-    # model
-    # ------------
-    model = LitAutoEncoder(args.hidden_dim)
-
-    # ------------
-    # training
-    # ------------
-    trainer = pl.Trainer.from_argparse_args(args)
-    trainer.fit(model, train_loader, val_loader)
-
-    # ------------
-    # testing
-    # ------------
-    result = trainer.test(test_dataloaders=test_loader)
+    cli = LightningCLI(LitAutoEncoder, MyDataModule, seed_everything_default=1234)
+    result = cli.trainer.test(cli.model, datamodule=cli.datamodule)
     print(result)
 
 

diff --git a/pl_examples/basic_examples/backbone_image_classifier.py b/pl_examples/basic_examples/backbone_image_classifier.py
@@ -11,15 +11,20 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""
+MNIST backbone image classifier example.
 
-from argparse import ArgumentParser
+To run:
+python backbone_image_classifier.py --trainer.max_epochs=50
+"""
 
 import torch
 from torch.nn import functional as F
 from torch.utils.data import DataLoader, random_split
 
 import pytorch_lightning as pl
 from pl_examples import _DATASETS_PATH, _TORCHVISION_MNIST_AVAILABLE, cli_lightning_logo
+from pytorch_lightning.utilities.cli import LightningCLI
 from pytorch_lightning.utilities.imports import _TORCHVISION_AVAILABLE
 
 if _TORCHVISION_AVAILABLE:
@@ -59,7 +64,11 @@ class LitClassifier(pl.LightningModule):
     )
     """
 
-    def __init__(self, backbone, learning_rate=1e-3):
+    def __init__(
+        self,
+        backbone,
+        learning_rate: float = 0.0001,
+    ):
         super().__init__()
         self.save_hyperparameters()
         self.backbone = backbone
@@ -92,52 +101,42 @@ def configure_optimizers(self):
         # self.hparams available because we called self.save_hyperparameters()
         return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
 
-    @staticmethod
-    def add_model_specific_args(parent_parser):
-        parser = parent_parser.add_argument_group("LitClassifier")
-        parser.add_argument('--learning_rate', type=float, default=0.0001)
-        return parent_parser
+
+class MyDataModule(pl.LightningDataModule):
+
+    def __init__(
+        self,
+        batch_size: int = 32,
+    ):
+        super().__init__()
+        dataset = MNIST(_DATASETS_PATH, train=True, download=True, transform=transforms.ToTensor())
+        self.mnist_test = MNIST(_DATASETS_PATH, train=False, download=True, transform=transforms.ToTensor())
+        self.mnist_train, self.mnist_val = random_split(dataset, [55000, 5000])
+        self.batch_size = batch_size
+
+    def train_dataloader(self):
+        return DataLoader(self.mnist_train, batch_size=self.batch_size)
+
+    def val_dataloader(self):
+        return DataLoader(self.mnist_val, batch_size=self.batch_size)
+
+    def test_dataloader(self):
+        return DataLoader(self.mnist_test, batch_size=self.batch_size)
+
+
+class MyLightningCLI(LightningCLI):
+
+    def add_arguments_to_parser(self, parser):
+        parser.add_class_arguments(Backbone, 'model.backbone')
+
+    def instantiate_model(self):
+        self.config_init['model']['backbone'] = Backbone(**self.config['model']['backbone'])
+        super().instantiate_model()
 
 
 def cli_main():
-    pl.seed_everything(1234)
-
-    # ------------
-    # args
-    # ------------
-    parser = ArgumentParser()
-    parser.add_argument('--batch_size', default=32, type=int)
-    parser.add_argument('--hidden_dim', type=int, default=128)
-    parser = pl.Trainer.add_argparse_args(parser)
-    parser = LitClassifier.add_model_specific_args(parser)
-    args = parser.parse_args()
-
-    # ------------
-    # data
-    # ------------
-    dataset = MNIST(_DATASETS_PATH, train=True, download=True, transform=transforms.ToTensor())
-    mnist_test = MNIST(_DATASETS_PATH, train=False, download=True, transform=transforms.ToTensor())
-    mnist_train, mnist_val = random_split(dataset, [55000, 5000])
-
-    train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
-    val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
-    test_loader = DataLoader(mnist_test, batch_size=args.batch_size)
-
-    # ------------
-    # model
-    # ------------
-    model = LitClassifier(Backbone(hidden_dim=args.hidden_dim), args.learning_rate)
-
-    # ------------
-    # training
-    # ------------
-    trainer = pl.Trainer.from_argparse_args(args)
-    trainer.fit(model, train_loader, val_loader)
-
-    # ------------
-    # testing
-    # ------------
-    result = trainer.test(test_dataloaders=test_loader)
+    cli = MyLightningCLI(LitClassifier, MyDataModule, seed_everything_default=1234)
+    result = cli.trainer.test(cli.model, datamodule=cli.datamodule)
     print(result)