Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMFlood dataset #2450

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add MMFlood dataset #2450

wants to merge 7 commits into from

Conversation

lccol
Copy link

@lccol lccol commented Dec 5, 2024

This PR adds the MMFlood dataset from the paper "MMFlood: A Multimodal Dataset for Flood Delineation From Satellite Imagery". This is a Sentinel-1 + DEM dataset for Image Segmentation.

Original tif files are of variable resolution. Max height in pixels is 2147, max width in pixels is 2313 (which are the ones reported in the docs). The dataset also includes hydrography information, but it is not available for all acquisitions (currently the implemented class does not read such tif files).

Example with False Color representation
immagine

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Dec 5, 2024
@lccol
Copy link
Author

lccol commented Dec 5, 2024

@microsoft-github-policy-service agree

@adamjstewart adamjstewart added this to the 0.7.0 milestone Dec 5, 2024
Copy link
Collaborator

@nilsleh nilsleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the contribution, this is great! I have made a first pass of comments below. If you have questions about anything feel free to comment.

tests/datamodules/test_mmflood.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
@lccol lccol requested a review from nilsleh December 10, 2024 12:51
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good now, thanks for the hard work!

Only other comment I would make is that the recommended approach for these "curated" geospatial datasets (RasterDatasets containing both images and masks) is to create a dummy dataset for the images, a dummy dataset for the masks, and an IntersectionDataset that combines them. This usually lets you completely skip the __init__ and __getitem__ since it will inherit from RasterDataset. See L7Irish and L8Biome for examples of these. Up to you whether or not you want to do this since you're almost done, but it could make the code a bit cleaner.

Comment on lines +15 to +18
MAX_VALUE = 1000.0
MIN_VALUE = 0.0
RANGE = MAX_VALUE - MIN_VALUE
FOLDERS = ['s1_raw', 'DEM', 'mask']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase would be better for local variables. Just note that range is a builtin and this would overshadow it, so maybe name it something else or just use (min_value - max_value) everywhere.

@@ -20,6 +20,7 @@ Dataset,Type,Source,License,Size (px),Resolution (m)
`L8 Biome`_,"Imagery, Masks",Landsat,"CC0-1.0","8,900x8,900","15, 30"
`LandCover.ai Geo`_,"Imagery, Masks",Aerial,"CC-BY-NC-SA-4.0","4,200--9,500",0.25--0.5
`Landsat`_,Imagery,Landsat,"public domain","8,900x8,900",30
`MMFlood`_,"Imagery,DEM,Masks","Sentinel, MapZen/TileZen, OpenStreetMap",CC-BY-4.0,"2,147x2,313",20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper is CC-BY-4.0, but the data is MIT, I would use MIT here

def __init__(
self,
root: Path = 'data',
crs: CRS | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a res parameter to match all other GeoDatasets

class MMFlood(RasterDataset):
"""MMFlood dataset.

`MMFlood <https://huggingface.co/datasets/links-ads/mmflood>`__ dataset is a multimodal flood delineation dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is longer than 88 characters, can you add some newlines?

Comment on lines +204 to +205
check_folders: if True, verifies pairings of all s1, dem and mask data across all the folders
load_all: if True, loads all tif files contained in the "activations" folder in the root folder specified. Otherwise, only acquisitions for the given split are loaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines too long, wrap at 88 chars

Comment on lines +25 to +27
mean = torch.tensor([0.1785585, 0.03574104, 168.45529])
median = torch.tensor([0.116051525, 0.025692634, 86.0])
std = torch.tensor([2.405442, 0.22719479, 242.74359])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these work both include_dem True and False? Seems like the length should change

the merged image
"""
image = self._load_tif(index, modality='s1_raw', query=query).float()
if self.include_dem:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docstring you mentioned that the DEM is not available for all regions, but is this also explicitly handled? Or what happens if the DEM is not available. Do I get a tensor with a zero padded channel dimension?

Or maybe as an additional check, if you have the full dataset downloaded, can you instantiate a datamodule or dataloader over the dataset an iterate through the length of the dataset without errors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEM is available for all the regions. Hydrography maps are missing for many of the tiles (738 missing). They are downloaded but not handled by this class. Maybe should I remove them from the docstring? Or introduce a flag include_hydro in the constructor and load only a portion of the dataset (1012 tiles out of 1748 total)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, misread. If you think that hydrography should be useful to users, then I think a flag to include_hydro with default False would be a good thing. And then just add in the docstring like you mentioned here, that if someone wants to use hydro they will only have a subset of the data available.

@lccol
Copy link
Author

lccol commented Dec 19, 2024

Thank you @adamjstewart for your comments. You managed all of your comments, including the conversion of MMFlood to RasterDataset class, similarly to L7Irish and L8Biome. I have just two questions:

  • I noticed that few entries have missing values, leading to NaNs. I was thinking of either putting all pixel values and mask values to 0 (option 1) or add a new entry missing (option 2) in the dict returned by the __getitem__. It will contain a Tensor of the same shape as the mask, with True in case of missing values in the image, False otherwise. I tried to check but I haven't found any other dataset which have NaN values in them. Probably this last option is kind of unusual compared to all the other datasets within the library... What do you think?
  • from some tests, I found that some of the tiles are partially overlapped and thus different tif files are merged when doing an iteration over the entire dataset. This should be fine I guess, since both images and masks are merged in a consistent manner (same order for both Sentinel-1 and masks data in the reverse painter algorithm). However, the dataset uses the tags in the tif files to store the timestamp of each tile. Is there a way to create the RTree with the temporal information stored in the tags? From my understanding, RasterDataset parses dates directly from the filename...

@adamjstewart
Copy link
Collaborator

  • I definitely prefer option 1. Our trainers support ignore_index which can be used to ignore these values during performance computation.
  • You mean the timestamp isn't stored in the filename, it's stored in some kind of metadata? How do you access this metadata? It's possible to override __init__ and extract the appropriate metadata yourself, it's just really ugly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants