Add MMFlood dataset #2450

lccol · 2024-12-05T16:54:07Z

This PR adds the MMFlood dataset from the paper "MMFlood: A Multimodal Dataset for Flood Delineation From Satellite Imagery". This is a Sentinel-1 + DEM dataset for Image Segmentation.

Original tif files are of variable resolution. Max height in pixels is 2147, max width in pixels is 2313 (which are the ones reported in the docs). The dataset also includes hydrography information, but it is not available for all acquisitions (currently the implemented class does not read such tif files).

Example with False Color representation

lccol · 2024-12-05T16:55:51Z

@microsoft-github-policy-service agree

nilsleh

Thanks so much for the contribution, this is great! I have made a first pass of comments below. If you have questions about anything feel free to comment.

tests/datamodules/test_mmflood.py

torchgeo/datamodules/mmflood.py

torchgeo/datasets/mmflood.py

torchgeo/datamodules/mmflood.py

torchgeo/datasets/mmflood.py

…nadded and fixed _verify

adamjstewart

Mostly looks good now, thanks for the hard work!

Only other comment I would make is that the recommended approach for these "curated" geospatial datasets (RasterDatasets containing both images and masks) is to create a dummy dataset for the images, a dummy dataset for the masks, and an IntersectionDataset that combines them. This usually lets you completely skip the __init__ and __getitem__ since it will inherit from RasterDataset. See L7Irish and L8Biome for examples of these. Up to you whether or not you want to do this since you're almost done, but it could make the code a bit cleaner.

adamjstewart · 2024-12-11T23:46:50Z

tests/data/mmflood/data.py

+    MAX_VALUE = 1000.0
+    MIN_VALUE = 0.0
+    RANGE = MAX_VALUE - MIN_VALUE
+    FOLDERS = ['s1_raw', 'DEM', 'mask']


lowercase would be better for local variables. Just note that range is a builtin and this would overshadow it, so maybe name it something else or just use (min_value - max_value) everywhere.

adamjstewart · 2024-12-11T23:53:17Z

docs/api/datasets/geo_datasets.csv

@@ -20,6 +20,7 @@ Dataset,Type,Source,License,Size (px),Resolution (m)
 `L8 Biome`_,"Imagery, Masks",Landsat,"CC0-1.0","8,900x8,900","15, 30"
 `LandCover.ai Geo`_,"Imagery, Masks",Aerial,"CC-BY-NC-SA-4.0","4,200--9,500",0.25--0.5
 `Landsat`_,Imagery,Landsat,"public domain","8,900x8,900",30
+`MMFlood`_,"Imagery,DEM,Masks","Sentinel, MapZen/TileZen, OpenStreetMap",CC-BY-4.0,"2,147x2,313",20


The paper is CC-BY-4.0, but the data is MIT, I would use MIT here

adamjstewart · 2024-12-11T23:55:42Z

torchgeo/datasets/mmflood.py

+    def __init__(
+        self,
+        root: Path = 'data',
+        crs: CRS | None = None,


I would also add a res parameter to match all other GeoDatasets

adamjstewart · 2024-12-11T23:58:26Z

torchgeo/datasets/mmflood.py

+class MMFlood(RasterDataset):
+    """MMFlood dataset.
+
+    `MMFlood <https://huggingface.co/datasets/links-ads/mmflood>`__ dataset is a multimodal flood delineation dataset.


This line is longer than 88 characters, can you add some newlines?

adamjstewart · 2024-12-11T23:58:40Z

torchgeo/datasets/mmflood.py

+            check_folders: if True, verifies pairings of all s1, dem and mask data across all the folders
+            load_all: if True, loads all tif files contained in the "activations" folder in the root folder specified. Otherwise, only acquisitions for the given split are loaded.


Lines too long, wrap at 88 chars

adamjstewart · 2024-12-12T00:00:58Z

torchgeo/datamodules/mmflood.py

+    mean = torch.tensor([0.1785585, 0.03574104, 168.45529])
+    median = torch.tensor([0.116051525, 0.025692634, 86.0])
+    std = torch.tensor([2.405442, 0.22719479, 242.74359])


Do these work both include_dem True and False? Seems like the length should change

nilsleh · 2024-12-12T09:14:59Z

torchgeo/datasets/mmflood.py

+            the merged image
+        """
+        image = self._load_tif(index, modality='s1_raw', query=query).float()
+        if self.include_dem:


In the docstring you mentioned that the DEM is not available for all regions, but is this also explicitly handled? Or what happens if the DEM is not available. Do I get a tensor with a zero padded channel dimension?

Or maybe as an additional check, if you have the full dataset downloaded, can you instantiate a datamodule or dataloader over the dataset an iterate through the length of the dataset without errors?

DEM is available for all the regions. Hydrography maps are missing for many of the tiles (738 missing). They are downloaded but not handled by this class. Maybe should I remove them from the docstring? Or introduce a flag include_hydro in the constructor and load only a portion of the dataset (1012 tiles out of 1748 total)?

Ah sorry, misread. If you think that hydrography should be useful to users, then I think a flag to include_hydro with default False would be a good thing. And then just add in the docstring like you mentioned here, that if someone wants to use hydro they will only have a subset of the data available.

lccol · 2024-12-19T14:29:59Z

Thank you @adamjstewart for your comments. You managed all of your comments, including the conversion of MMFlood to RasterDataset class, similarly to L7Irish and L8Biome. I have just two questions:

I noticed that few entries have missing values, leading to NaNs. I was thinking of either putting all pixel values and mask values to 0 (option 1) or add a new entry missing (option 2) in the dict returned by the __getitem__. It will contain a Tensor of the same shape as the mask, with True in case of missing values in the image, False otherwise. I tried to check but I haven't found any other dataset which have NaN values in them. Probably this last option is kind of unusual compared to all the other datasets within the library... What do you think?
from some tests, I found that some of the tiles are partially overlapped and thus different tif files are merged when doing an iteration over the entire dataset. This should be fine I guess, since both images and masks are merged in a consistent manner (same order for both Sentinel-1 and masks data in the reverse painter algorithm). However, the dataset uses the tags in the tif files to store the timestamp of each tile. Is there a way to create the RTree with the temporal information stored in the tags? From my understanding, RasterDataset parses dates directly from the filename...

adamjstewart · 2024-12-20T21:45:49Z

I definitely prefer option 1. Our trainers support ignore_index which can be used to ignore these values during performance computation.
You mean the timestamp isn't stored in the filename, it's stored in some kind of metadata? How do you access this metadata? It's possible to override __init__ and extract the appropriate metadata yourself, it's just really ugly.

Add MMFlood dataset

76a6941

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Dec 5, 2024

adamjstewart added this to the 0.7.0 milestone Dec 5, 2024

lccol and others added 2 commits December 6, 2024 11:11

Added tests for MMFloodDataModule

19ee181

Merge branch 'main' into main

344115c

nilsleh requested changes Dec 9, 2024

View reviewed changes

lccol and others added 3 commits December 10, 2024 12:00

added uncompressed test data folder and datamodule test. added versio…

41c118d

…nadded and fixed _verify

fix assertion

9d0f76f

Merge branch 'main' into main

d9bb5ef

lccol requested a review from nilsleh December 10, 2024 12:51

updated docstring

37ac4ab

adamjstewart reviewed Dec 12, 2024

View reviewed changes

nilsleh reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMFlood dataset #2450

Add MMFlood dataset #2450

lccol commented Dec 5, 2024 •

edited

Loading

lccol commented Dec 5, 2024

nilsleh left a comment

adamjstewart left a comment

adamjstewart Dec 11, 2024

adamjstewart Dec 11, 2024

adamjstewart Dec 11, 2024

adamjstewart Dec 11, 2024

adamjstewart Dec 11, 2024

adamjstewart Dec 12, 2024

nilsleh Dec 12, 2024

lccol Dec 12, 2024

nilsleh Dec 12, 2024

lccol commented Dec 19, 2024 •

edited

Loading

adamjstewart commented Dec 20, 2024

		check_folders: if True, verifies pairings of all s1, dem and mask data across all the folders
		load_all: if True, loads all tif files contained in the "activations" folder in the root folder specified. Otherwise, only acquisitions for the given split are loaded.

Add MMFlood dataset #2450

Are you sure you want to change the base?

Add MMFlood dataset #2450

Conversation

lccol commented Dec 5, 2024 • edited Loading

lccol commented Dec 5, 2024

nilsleh left a comment

Choose a reason for hiding this comment

adamjstewart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lccol commented Dec 19, 2024 • edited Loading

adamjstewart commented Dec 20, 2024

lccol commented Dec 5, 2024 •

edited

Loading

lccol commented Dec 19, 2024 •

edited

Loading