DataModules: run all data augmentation on the GPU #992

adamjstewart · 2022-12-30T18:25:23Z

This PR overhauls all of our data modules to improve uniformity. This includes the following changes:

Add GeoDataModule and NonGeoDataModule base classes to reduce code duplication
Only instantiate the datasets that are needed for a particular stage
Replace torchvision with kornia (better support for MSI, GPU, inverse)
Replace dataset transforms with on_after_batch_transfer (CPU -> GPU, sample -> batch, faster)
Remove instance methods for preprocessing (fixes Trainers: num_workers > 0 results in pickling error on macOS/Windows #886)
Fix bug where train/val/test split would differ for each stage and images would leak between sets
Deprecate torchgeo.transforms.AugmentationSequential (use kornia.augmentation.AugmentationSequential instead)
Deprecate torchgeo.datamodules.utils.dataset_split (use torch.utils.data.random_split instead)

In a future PR, I'm planning on extending this to the rest of our transforms:

Rewrite all index transforms to be compatible with Kornia (Convert all index transforms to Kornia #999)
Update tutorials to use Kornia with our transforms
Upstream and remove our custom transforms and AugmentationSequential hacks

Fixes #619
Fixes #337
Fixes #336

torchgeo/datamodules/geo.py

adamjstewart · 2023-01-04T02:14:28Z

torchgeo/datamodules/geo.py

+        """
+        dataset = self.val_dataset or self.dataset
+        if dataset is not None:
+            if hasattr(dataset, "plot"):


If we enforce that all datasets must have a plot method, we could remove this check. Currently the only ones lacking are VHR-10 (WIP) and our point datasets (GBIF, iNaturalist, EDDMapS).

adamjstewart · 2023-01-04T02:15:53Z

torchgeo/datamodules/geo.py

+        self.train_aug: Optional[Transform] = None
+        self.val_aug: Optional[Transform] = None
+        self.test_aug: Optional[Transform] = None
+        self.predict_aug: Optional[Transform] = None


The idea for all of these is that you can either define a single attribute (self.foo) or a different attribute for each stage (self.train_foo, self.val_foo, etc.)

adamjstewart · 2023-01-04T02:16:28Z

torchgeo/datamodules/geo.py

+            stage: Either 'fit', 'validate', 'test', or 'predict'.
+        """
+        if stage in ["fit"]:
+            self.train_dataset = self.dataset_class(  # type: ignore[call-arg]


Need to ignore type warnings because not all datasets accept a split argument

adamjstewart · 2023-01-04T02:17:07Z

torchgeo/datamodules/geo.py

+            MisconfigurationException: If :meth:`setup` does not define a
+                'train_dataset'.
+        """
+        dataset = self.train_dataset or self.dataset


Note that datasets/samplers with length 0 also evaluate to False, so this may lead to red herrings

adamjstewart · 2023-01-04T02:18:08Z

torchgeo/datamodules/geo.py

+        """
+        # Non-Tensor values cannot be moved to a device
+        del batch["crs"]
+        del batch["bbox"]


I may decide to remove these completely but didn't do that for this PR since it requires a ton of unrelated changes. Will reconsider when working on #985.

adamjstewart · 2023-01-04T16:42:03Z

I'm having trouble understanding the failing BYOL tests. The same datamodules work fine with SemanticSegmentationTask. If anyone can figure out how to fix these, we can see how much coverage we lack and fix that.

torchgeo/datamodules/geo.py

adamjstewart · 2023-01-05T19:42:04Z

The last remaining test failure is due to discrepancies between Chesapeake and all other datasets. All of our datasets return a mask of shape "h w" but Chesapeake returns "c h w". This breaks everything. Can't wait until PEP 646 is supported in PyTorch...

How should we handle this? It doesn't seem like Chesapeake can be changed to "h w" since the prior labels are c=4. We could change every other dataset to be "c h w" like Kornia expects, but that sounds like a lot more work. We could also write a custom AugmentationSequential just for Chesapeake that handles things properly. Our AugmentationSequential only works for batches, not samples.

torchgeo/datamodules/inria.py

torchgeo/datamodules/potsdam.py

torchgeo/datamodules/vaihingen.py

torchgeo/datamodules/potsdam.py

adamjstewart · 2023-01-15T18:55:02Z

@ashnair1 @nilsleh how would you propose to solve the ExtractTensorPatches OOM issue?

Easiest fix is to use RandomNCrop instead (this is what we do during train). I'm happy to do this. However, it doesn't really make sense during val/test, and isn't useful at all during predict.

Alternative fix would be to introduce a GridNonGeoSampler that works similarly to GridGeoSampler and allows a single image to span multiple batches. However, this would require all NonGeoDatasets to change from __getitem__(idx: int) to __getitem__(idx: int, rows: slice, cols: slice) which would be a ton of work.

Any other ideas for how to handle this?

ashnair1 · 2023-01-16T09:54:17Z

@ashnair1 @nilsleh how would you propose to solve the ExtractTensorPatches OOM issue?

Easiest fix is to use RandomNCrop instead (this is what we do during train). I'm happy to do this. However, it doesn't really make sense during val/test, and isn't useful at all during predict.

Alternative fix would be to introduce a GridNonGeoSampler that works similarly to GridGeoSampler and allows a single image to span multiple batches. However, this would require all NonGeoDatasets to change from __getitem__(idx: int) to __getitem__(idx: int, rows: slice, cols: slice) which would be a ton of work.

Any other ideas for how to handle this?

A trivial solution to OOM would be to run inference in an iterative manner whenever we use _ExtractTensorPatches.

For example, in the validation_step of torchgeo/trainers/segmentation.py we iterate over the batch instead of passing in the entire batch.

def validation_step(self, *args: Any, **kwargs: Any) -> None:
    """Compute validation loss and log example predictions.

    Args:
        batch: the output of your DataLoader
        batch_idx: the index of this batch
    """
    batch = args[0]
    batch_idx = args[1]
    x = batch["image"]
    y = batch["mask"]
-   y_hat = self(x)
+   from einops import rearrange
+   y_hat_list = []
+   for i in range(x.shape[0]):
+       out = rearrange(x[i], 'c h w -> 1 c h w') 
+       out = self(out)
+       y_hat_list.append(out)    
+   y_hat = rearrange(y_hat_list, 'b 1 c h w -> b c h w')
    y_hat_hard = y_hat.argmax(dim=1)

    loss = self.loss(y_hat, y)

adamjstewart · 2023-01-16T19:55:23Z

run inference in an iterative manner whenever we use _ExtractTensorPatches

I don't think there's an easy way to tell whether or not we are using _ExtractTensorPatches. We could say "if batch size > X run iteratively" but I'm not sure what X should be.

ashnair1 · 2023-01-16T20:02:44Z

How about making this the default behaviour for val, test and predict steps?

X is dependent on the hardware RAM which can only determined by the user. So I don't think we can go down that route.

adamjstewart · 2023-01-16T20:45:41Z

I think replacing all val/test/predict steps from batch processing to a for-loop over the batch size would be extremely bad for speed.

nilsleh · 2023-01-17T09:35:19Z

We could say "if batch size > X run iteratively" but I'm not sure what X should be.

Don't know if it is a desired approach, but could you wrap y_hat = self(x) in a try, except block checking for OOM, and in the except block do the looping?

adamjstewart · 2023-01-17T15:19:21Z

If I recall, it isn't possible to check for OOM, the program crashes without raising an error. But correct me if I'm wrong. Also, it would be difficult to get test coverage of that branch.

nilsleh · 2023-01-17T15:57:26Z

On the pytorch docs, they seem to suggest a possible way as a RuntimeError but as you said difficult to get test coverage.

ashnair1

Currently training fails during the sanity checking (validation) step fig = datamodule.plot(sample) as fig will be of type None because self.val_dataset (of type torch.utils.data.Subset) does not have a plot method.

torchgeo/datamodules/inria.py

torchgeo/datamodules/geo.py

adamjstewart · 2023-01-23T17:32:22Z

torchgeo/datamodules/nasa_marine_debris.py

-    """
-    output: Dict[str, Any] = {}
-    output["image"] = torch.stack([sample["image"] for sample in batch])
-    output["boxes"] = [sample["boxes"] for sample in batch]


@calebrob6 starting a thread here so we can discuss how to handle NASA Marine Debris.

I think the solution in this collate_fn is actually correct, we want a list of tensors, not a single tensor: kornia/kornia#1497

Note that the name is wrong, we should be used bbox_xyxy or bbox_xywh instead of bbox (boxes is translated to bbox internally in our AugmentationSequential wrapper) but we can fix that when working on #985.

Okay, see if the last commit fixes this.

This reverts commit f465efc.

adamjstewart · 2023-01-23T19:00:07Z

@calebrob6 I believe I've addressed all review comments, let me know if you find anything else to fix.

calebrob6 · 2023-01-23T20:51:09Z

Yep that did it, nice job!

* DataModules: run all data augmentation on the GPU * Passing tests * Update BigEarthNet * Break ChesapeakeCVPR * Update COWC * Update Cyclone * Update ETCI2021 * mypy fixes * Update FAIR1M * Update Inria * Update LandCoverAI * Update LoveDA * Update NAIP * Update NASA * Update OSCD * Update RESISC45 * Update SEN12MS * Update So2Sat * Update SpaceNet * Update UCMerced * Update USAVars * Update xview * Remove seed * mypy fixes * OSCD hacks * Add NonGeoDataModule base class * Fixes * Add base class to docs * mypy fixes * Fix several tests * Fix Normalize * Syntax error * Fix bigearthnet * Fix dtype * Consistent kornia import * Get regression datasets working * Fix detection tests * Fix some chesapeake bugs * Fix several segmentation issues * isort fixes * Undo breaking change * Remove more code duplication, standardize docstrings * mypy fixes * Add default augmentation * Augmentations can be any callable * Fix datasets tests * Fix datamodule tests * Fix more datamodules * Typo fix * Set up val_dataset even when fit * Fix classification tests * Fix ETCI2021 * Fix SEN12MS * Add GeoDataModule base class * Fix several chesapeake bugs * Fix dtype and shape * Fix crs/bbox issue * Fix test dtype * Fix unequal size stacking error * flake8 fix * Better checks on sampler * Fix bug introduced in NAIP dm * Fix chesapeake dimensions * Add one to mask * Fix missing imports * Fix batch size * Simplify augmentations * Don't run test or predict without datasets * Fix tests * Allow shared dataset * One more try * Fix typo * Fix another typo * Fix Chesapeake dimensions * Apply augmentations during sanity check too * Don't reuse fixtures * Increase coverage * Fix ETCI tests * Test predict_step * Test all loss methods * Simplify validation plotting * Document new classes * Fix plotting * Plotting should be robust in case dataset does not contain RGB bands * Fix flake8 * 100% coverage of trainers * Add lightning-lite dependency * Revert "Add lightning-lite dependency" This reverts commit 1df7291. * Define our own MisconfigurationException * Properly test new data module base classes * Fix mistake in setup call * ExtractTensorPatches runs into OOM errors * Test both fast_dev_run True and False * Fix plot methods * Fix OSCD tests * Fix bug with inconsistent train/val/test splits between stages * Fix issues with images of different sizes * Fix OSCD tests * Fix OSCD tests * Bad rebase * No trainer for OSCD so no need for config * Bad rebase * plot: only works during validation * Fix collation of NASA Marine Debris dataset * flake8 fix * Quick test * Revert "Quick test" This reverts commit f465efc. * 56 workers is a bit excessive Co-authored-by: Caleb Robinson <calebrob6@gmail.com>

adamjstewart added this to the 0.4.0 milestone Dec 30, 2022

github-actions bot added datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets transforms Data augmentation transforms testing Continuous integration testing documentation Improvements or additions to documentation labels Dec 30, 2022

adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Jan 2, 2023

github-actions bot added the trainers PyTorch Lightning trainers label Jan 2, 2023

adamjstewart mentioned this pull request Jan 3, 2023

Trainers: num_workers > 0 results in pickling error on macOS/Windows #886

Closed

adamjstewart commented Jan 4, 2023

View reviewed changes

adamjstewart marked this pull request as ready for review January 4, 2023 16:40

calebrob6 reviewed Jan 5, 2023

View reviewed changes

torchgeo/datamodules/geo.py Outdated Show resolved Hide resolved

torchgeo/datamodules/geo.py Show resolved Hide resolved

ashnair1 reviewed Jan 5, 2023

View reviewed changes

torchgeo/datamodules/geo.py Outdated Show resolved Hide resolved

torchgeo/datamodules/geo.py Outdated Show resolved Hide resolved

ashnair1 reviewed Jan 5, 2023

View reviewed changes

torchgeo/datamodules/inria.py Outdated Show resolved Hide resolved

nilsleh reviewed Jan 9, 2023

View reviewed changes

torchgeo/datamodules/potsdam.py Outdated Show resolved Hide resolved

nilsleh reviewed Jan 10, 2023

View reviewed changes

torchgeo/datamodules/vaihingen.py Show resolved Hide resolved

nilsleh reviewed Jan 10, 2023

View reviewed changes

torchgeo/datamodules/potsdam.py Show resolved Hide resolved

adamjstewart force-pushed the datamodules/gpu branch 2 times, most recently from bf2bbb7 to 3d27e5d Compare January 15, 2023 18:49

ashnair1 reviewed Jan 18, 2023

View reviewed changes

torchgeo/datamodules/inria.py Outdated Show resolved Hide resolved

torchgeo/datamodules/geo.py Outdated Show resolved Hide resolved

adamjstewart added 7 commits January 22, 2023 23:08

Fix OSCD tests

647e601

Fix bug with inconsistent train/val/test splits between stages

deca721

Fix issues with images of different sizes

5ae2f4c

Fix OSCD tests

6a1571a

Fix OSCD tests

cc6176e

Bad rebase

40d31bb

No trainer for OSCD so no need for config

f6a3061

adamjstewart dismissed calebrob6’s stale review via f6a3061 January 23, 2023 05:11

adamjstewart force-pushed the datamodules/gpu branch from 21dc56f to f6a3061 Compare January 23, 2023 05:11

adamjstewart added 2 commits January 22, 2023 23:13

Bad rebase

ed37429

plot: only works during validation

66093a6

adamjstewart commented Jan 23, 2023

View reviewed changes

adamjstewart added 4 commits January 23, 2023 11:56

Fix collation of NASA Marine Debris dataset

1776fd6

flake8 fix

38797f1

Quick test

f465efc

Revert "Quick test"

514ad2f

This reverts commit f465efc.

adamjstewart closed this Jan 23, 2023

adamjstewart reopened this Jan 23, 2023

56 workers is a bit excessive

6c4dafa

calebrob6 approved these changes Jan 23, 2023

View reviewed changes

adamjstewart merged commit 55f74da into main Jan 23, 2023

adamjstewart deleted the datamodules/gpu branch January 23, 2023 22:08

calebrob6 mentioned this pull request Feb 28, 2023

Add dtype field to RasterDataset #1149

Merged

isaaccorley mentioned this pull request Apr 12, 2023

UCMerced: fix image shape bug #1238

Merged

Pale-Blue-Dot-97 mentioned this pull request Jun 9, 2023

Expand Torchgeo_FCN_Full.ipynb use more Chesapeake datasets Pale-Blue-Dot-97/Minerva#185

Open

adamjstewart mentioned this pull request Sep 7, 2023

Encountered 'NoneType' object has no attribute 'canvas' error while training a semantic segmentation model #1551

Closed

adamjstewart mentioned this pull request Feb 25, 2024

OSCDDataModule initialises with batch_size 1, ignoring the configured batch_size #1890

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataModules: run all data augmentation on the GPU #992

DataModules: run all data augmentation on the GPU #992

adamjstewart commented Dec 30, 2022 •

edited

Loading

adamjstewart Jan 4, 2023

adamjstewart Jan 4, 2023

adamjstewart Jan 4, 2023

adamjstewart Jan 4, 2023

adamjstewart Jan 4, 2023

adamjstewart commented Jan 4, 2023

adamjstewart commented Jan 5, 2023

adamjstewart commented Jan 15, 2023

ashnair1 commented Jan 16, 2023

adamjstewart commented Jan 16, 2023

ashnair1 commented Jan 16, 2023

adamjstewart commented Jan 16, 2023

nilsleh commented Jan 17, 2023 •

edited

Loading

adamjstewart commented Jan 17, 2023

nilsleh commented Jan 17, 2023 •

edited

Loading

ashnair1 left a comment

adamjstewart Jan 23, 2023

adamjstewart Jan 23, 2023

adamjstewart commented Jan 23, 2023

calebrob6 commented Jan 23, 2023

DataModules: run all data augmentation on the GPU #992

DataModules: run all data augmentation on the GPU #992

Conversation

adamjstewart commented Dec 30, 2022 • edited Loading

adamjstewart Jan 4, 2023

Choose a reason for hiding this comment

adamjstewart Jan 4, 2023

Choose a reason for hiding this comment

adamjstewart Jan 4, 2023

Choose a reason for hiding this comment

adamjstewart Jan 4, 2023

Choose a reason for hiding this comment

adamjstewart Jan 4, 2023

Choose a reason for hiding this comment

adamjstewart commented Jan 4, 2023

adamjstewart commented Jan 5, 2023

adamjstewart commented Jan 15, 2023

ashnair1 commented Jan 16, 2023

adamjstewart commented Jan 16, 2023

ashnair1 commented Jan 16, 2023

adamjstewart commented Jan 16, 2023

nilsleh commented Jan 17, 2023 • edited Loading

adamjstewart commented Jan 17, 2023

nilsleh commented Jan 17, 2023 • edited Loading

ashnair1 left a comment

Choose a reason for hiding this comment

adamjstewart Jan 23, 2023

Choose a reason for hiding this comment

adamjstewart Jan 23, 2023

Choose a reason for hiding this comment

adamjstewart commented Jan 23, 2023

calebrob6 commented Jan 23, 2023

adamjstewart commented Dec 30, 2022 •

edited

Loading

nilsleh commented Jan 17, 2023 •

edited

Loading

nilsleh commented Jan 17, 2023 •

edited

Loading