Add MMEarth dataset #2202

nilsleh · 2024-07-31T09:10:10Z

This PR adds the MMEarth dataset. Implementation is tested with the 100k version under the assumption that the dataset format is identical for the other versions.

There are some expected changes to the dataset so still a draft. But opening a PR already for some discussion points:

As a multi modal dataset there are various configuration and selection schemes. The dataset has arguments for modality and band selection within modalities (the current implementation is a suggestion, but open for cleaner suggestions)
For that same reason, we don't have a classic "image" and "label" sample dictionary because there are various ways of using this dataset across classification, segmentation and pixelwise regression. Is it reasonable to expect people to write transform functions that would define the sample as needed for their models?
Normalization statistics are provided with the dataset. Normally we do standardization in the DataModule, however, given the complexity of various modalities and also two normalization schemes (z-score and min-max) I think it makes the dataset more useful when supplying this functionality in the dataset already
The most complex part is handling the different modalities because some of them require some extra handling as seen here
Additionally, need to decide what to do with the nan-values and clearly comment that
currently this is just an implementation to support a non-geo dataset with standard index sampling, for other indexing schemes we can write an inherited class that also has a script that generates the needed parquet file and download it from HF
probably things I am missing so feel free to add @adamjstewart @ando-shah

nilsleh · 2024-08-20T07:46:09Z

@adamjstewart and @ando-shah I think as a basic Dataset this does the job. For any more specific requirements we can subclass this to support other model specific needs.

adamjstewart

I added a new test to ruff

torchgeo/datasets/mmearth.py

adamjstewart · 2024-08-21T10:06:21Z

Planning to get a working draft of SatlasPretrain first and then we can modify this PR to ensure it matches and meets @ando-shah's needs. These will be some of our first multi-modal datasets, so want to make sure we're doing things consistently (at least at an external user-facing API level).

docs/api/non_geo_datasets.csv

tests/datasets/test_mmearth.py

torchgeo/datasets/mmearth.py

torchgeo/datamodules/mmearth.py

torchgeo/datasets/mmearth.py

adamjstewart · 2024-08-27T14:34:20Z

torchgeo/datasets/mmearth.py

+            if modality not in self.all_modalities:
+                raise ValueError(f"'{modality}' is an invalid modality name.")
+
+    def _validate_modality_bands(self, modality_bands: dict[str, list[str]]) -> None:


Validation is good, but not if it makes the code too complex.

torchgeo/datasets/mmearth.py

nilsleh · 2024-10-02T16:48:10Z

Ando requested that Sentinel 1 should be separated into ascending and descending. That turned out to be a bit tricky, since MMEarth only stores "sentinel1" as a concatenation of ascending and descending, but I think I implemented it correctly to support band selection and normalization etc

nilsleh · 2024-10-04T13:29:49Z

As another point regarding Sentinel1:

Sentinel 1 ascending and descending are stored separately, however, in the hdf5 file, there is just a single sentinel1 key that always has a shape of [8, 128, 128] so if one of ascending or descending or a subset of bands within the two is not available it is still stored as [1, 128,128] of nans in the dataset
in order to not return Nans from the __getitem__ , only non-nan data is now returned, i.e. if I request {'sentinel1_asc': ['VV', 'VH', 'HH', 'HV']} in the dataset init but for the particular tile only {'sentinel1_asc': ['VV', 'VH']} is available, then a [2, 128, 128] tensor will be returned
additionally, the sample contains a dict with the sensor and bands actually found in this particular sample

adamjstewart

Data loader seems much more complicated than it needs to be. Does it satisfy all of Ando's requirements?

docs/api/datasets/non_geo_datasets.csv

torchgeo/datasets/mmearth.py

nilsleh · 2024-10-08T08:54:36Z

Data loader seems much more complicated than it needs to be. Does it satisfy all of Ando's requirements?

Yeah all the feedback I have gotten from him, I have incorporated. I agree it is complicated but it's also loading data from >10 modalities so not a typical single image/target dataset. He said he will try it now and give any other feedback.

nilsleh · 2024-10-10T12:10:58Z

@ando-shah found a bug in my implementation. The available band names were different from the sample specific band name, for example in era5, they have the current or previous date of the sample included in the band name, so not a general name, and therefore it was not correctly retrieved.

torchgeo/datasets/mmearth.py

adamjstewart · 2024-10-10T12:10:53Z

torchgeo/datasets/mmearth.py

+            root: root directory where dataset can be found
+            subset: one of "MMEarth", "MMEarth64", or "MMEarth100k"
+            modalities: list of modalities to load
+            modality_bands: dictionary of modality bands, see `all_modality_bands`


At the moment, all_modality_bands isn't documented. Should it be?

I meant to point to the class variable all_modality_bands

Yes, but all_modality_bands isn't documented: https://torchgeo--2202.org.readthedocs.build/en/2202/api/datasets.html#torchgeo.datasets.MMEarth

I know you and I would just look at the source code instead of the docs, but most users will only ever look at the docs, and currently this references a variable that isn't documented.

If you want to document it, you can use something like:

#: List of valid modality bands all_modality_bands = ...

Then you could link to it like:

modality_bands: dictionary of modality bands, see :attr:`all_modality_bands`

torchgeo/datasets/mmearth.py

nilsleh · 2024-10-10T13:45:13Z

@adamjstewart @ando-shah while trying to write a subclass based on a metadata file, an annoying part is the image_{} and mask_{} naming for the sample keys, because you need to remove them again to match the meta data naming of the modalities. So since there isn't a torchgeo trainer that works out of the box on this dataset anyways, maybe we can just return the modality name itself? Just wondering if it potentially creates more confusion to have a chosen set of prefixes for the modalities, when some could be either etc.

adamjstewart · 2024-10-10T14:47:35Z

maybe we can just return the modality name itself?

Then kornia.augmentation.AugmentationSequential will no longer work.

torchgeo/datasets/mmearth.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

nilsleh added 2 commits July 31, 2024 08:10

more tests

5f3e257

more comments

b1d472b

nilsleh marked this pull request as draft July 31, 2024 09:10

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Jul 31, 2024

adamjstewart added this to the 0.6.0 milestone Jul 31, 2024

nilsleh and others added 7 commits July 31, 2024 09:13

files for dm

c9f1408

lazy import

b7ce2fb

metadata in raw format

64cac20

mypy

d5eb7fd

mypy another one

376d4ab

Merge branch 'main' into mmearth

d570081

Merge branch 'main' into mmearth

62fe6e1

nilsleh marked this pull request as ready for review August 20, 2024 07:44

adamjstewart reviewed Aug 20, 2024

View reviewed changes

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved

nilsleh added 3 commits August 20, 2024 07:51

ruff

b452a4d

merge main

eecc5c7

class var

50229f3

adamjstewart requested changes Aug 27, 2024

View reviewed changes

adamjstewart modified the milestones: 0.6.0, 0.7.0 Aug 29, 2024

nilsleh added 2 commits October 2, 2024 07:38

merge main

df1251b

requested changes

10d2a99

github-actions bot removed the datamodules PyTorch Lightning datamodules label Oct 2, 2024

nilsleh added 2 commits October 2, 2024 09:14

ds_version -> subset

d32f624

separate Sentinel 1 ascending and descending

c95c0e8

nilsleh added 3 commits October 2, 2024 16:55

remove mmearth from datamodule docs

f84af05

separate reading item for subclasses

f5b3fa2

sentinel1 only return available data

f360b9c

nilsleh added 2 commits October 4, 2024 13:32

remove split from dataset

0f47444

fix tests

e37027f

adamjstewart reviewed Oct 8, 2024

View reviewed changes

nilsleh added 4 commits October 8, 2024 09:03

requests

46d6753

requests

c810be5

resolution

9ea0960

more band logic

be756ad

adamjstewart reviewed Oct 10, 2024

View reviewed changes

review

420f143

typo

fe01e88

adamjstewart reviewed Oct 10, 2024

View reviewed changes

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved

Update torchgeo/datasets/mmearth.py

adf3413

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMEarth dataset #2202

Add MMEarth dataset #2202

nilsleh commented Jul 31, 2024 •

edited

Loading

nilsleh commented Aug 20, 2024

adamjstewart left a comment

adamjstewart commented Aug 21, 2024

adamjstewart Aug 27, 2024

nilsleh commented Oct 2, 2024

nilsleh commented Oct 4, 2024

adamjstewart left a comment

nilsleh commented Oct 8, 2024

nilsleh commented Oct 10, 2024

adamjstewart Oct 10, 2024

nilsleh Oct 10, 2024

adamjstewart Oct 10, 2024

nilsleh commented Oct 10, 2024 •

edited

Loading

adamjstewart commented Oct 10, 2024

Add MMEarth dataset #2202

Are you sure you want to change the base?

Add MMEarth dataset #2202

Conversation

nilsleh commented Jul 31, 2024 • edited Loading

nilsleh commented Aug 20, 2024

adamjstewart left a comment

Choose a reason for hiding this comment

adamjstewart commented Aug 21, 2024

adamjstewart Aug 27, 2024

Choose a reason for hiding this comment

nilsleh commented Oct 2, 2024

nilsleh commented Oct 4, 2024

adamjstewart left a comment

Choose a reason for hiding this comment

nilsleh commented Oct 8, 2024

nilsleh commented Oct 10, 2024

adamjstewart Oct 10, 2024

Choose a reason for hiding this comment

nilsleh Oct 10, 2024

Choose a reason for hiding this comment

adamjstewart Oct 10, 2024

Choose a reason for hiding this comment

nilsleh commented Oct 10, 2024 • edited Loading

adamjstewart commented Oct 10, 2024

nilsleh commented Jul 31, 2024 •

edited

Loading

nilsleh commented Oct 10, 2024 •

edited

Loading