Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMEarth dataset #2202

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Add MMEarth dataset #2202

wants to merge 28 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Jul 31, 2024

This PR adds the MMEarth dataset. Implementation is tested with the 100k version under the assumption that the dataset format is identical for the other versions.

There are some expected changes to the dataset so still a draft. But opening a PR already for some discussion points:

  • As a multi modal dataset there are various configuration and selection schemes. The dataset has arguments for modality and band selection within modalities (the current implementation is a suggestion, but open for cleaner suggestions)
  • For that same reason, we don't have a classic "image" and "label" sample dictionary because there are various ways of using this dataset across classification, segmentation and pixelwise regression. Is it reasonable to expect people to write transform functions that would define the sample as needed for their models?
  • Normalization statistics are provided with the dataset. Normally we do standardization in the DataModule, however, given the complexity of various modalities and also two normalization schemes (z-score and min-max) I think it makes the dataset more useful when supplying this functionality in the dataset already
  • The most complex part is handling the different modalities because some of them require some extra handling as seen here
  • Additionally, need to decide what to do with the nan-values and clearly comment that
  • currently this is just an implementation to support a non-geo dataset with standard index sampling, for other indexing schemes we can write an inherited class that also has a script that generates the needed parquet file and download it from HF
  • probably things I am missing so feel free to add @adamjstewart @ando-shah

@nilsleh nilsleh marked this pull request as draft July 31, 2024 09:10
@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Jul 31, 2024
@adamjstewart adamjstewart added this to the 0.6.0 milestone Jul 31, 2024
@nilsleh nilsleh marked this pull request as ready for review August 20, 2024 07:44
@nilsleh
Copy link
Collaborator Author

nilsleh commented Aug 20, 2024

@adamjstewart and @ando-shah I think as a basic Dataset this does the job. For any more specific requirements we can subclass this to support other model specific needs.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test to ruff

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
@adamjstewart
Copy link
Collaborator

Planning to get a working draft of SatlasPretrain first and then we can modify this PR to ensure it matches and meets @ando-shah's needs. These will be some of our first multi-modal datasets, so want to make sure we're doing things consistently (at least at an external user-facing API level).

docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
tests/datasets/test_mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Show resolved Hide resolved
if modality not in self.all_modalities:
raise ValueError(f"'{modality}' is an invalid modality name.")

def _validate_modality_bands(self, modality_bands: dict[str, list[str]]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation is good, but not if it makes the code too complex.

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
@adamjstewart adamjstewart modified the milestones: 0.6.0, 0.7.0 Aug 29, 2024
@github-actions github-actions bot removed the datamodules PyTorch Lightning datamodules label Oct 2, 2024
@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 2, 2024

Ando requested that Sentinel 1 should be separated into ascending and descending. That turned out to be a bit tricky, since MMEarth only stores "sentinel1" as a concatenation of ascending and descending, but I think I implemented it correctly to support band selection and normalization etc

@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 4, 2024

As another point regarding Sentinel1:

  • Sentinel 1 ascending and descending are stored separately, however, in the hdf5 file, there is just a single sentinel1 key that always has a shape of [8, 128, 128] so if one of ascending or descending or a subset of bands within the two is not available it is still stored as [1, 128,128] of nans in the dataset
  • in order to not return Nans from the __getitem__ , only non-nan data is now returned, i.e. if I request {'sentinel1_asc': ['VV', 'VH', 'HH', 'HV']} in the dataset init but for the particular tile only {'sentinel1_asc': ['VV', 'VH']} is available, then a [2, 128, 128] tensor will be returned
  • additionally, the sample contains a dict with the sensor and bands actually found in this particular sample

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data loader seems much more complicated than it needs to be. Does it satisfy all of Ando's requirements?

docs/api/datasets/non_geo_datasets.csv Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 8, 2024

Data loader seems much more complicated than it needs to be. Does it satisfy all of Ando's requirements?

Yeah all the feedback I have gotten from him, I have incorporated. I agree it is complicated but it's also loading data from >10 modalities so not a typical single image/target dataset. He said he will try it now and give any other feedback.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 10, 2024

@ando-shah found a bug in my implementation. The available band names were different from the sample specific band name, for example in era5, they have the current or previous date of the sample included in the band name, so not a general name, and therefore it was not correctly retrieved.

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
root: root directory where dataset can be found
subset: one of "MMEarth", "MMEarth64", or "MMEarth100k"
modalities: list of modalities to load
modality_bands: dictionary of modality bands, see `all_modality_bands`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, all_modality_bands isn't documented. Should it be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to point to the class variable all_modality_bands

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but all_modality_bands isn't documented: https://torchgeo--2202.org.readthedocs.build/en/2202/api/datasets.html#torchgeo.datasets.MMEarth

I know you and I would just look at the source code instead of the docs, but most users will only ever look at the docs, and currently this references a variable that isn't documented.

If you want to document it, you can use something like:

#: List of valid modality bands
all_modality_bands = ...

Then you could link to it like:

modality_bands: dictionary of modality bands, see :attr:`all_modality_bands`

torchgeo/datasets/mmearth.py Outdated Show resolved Hide resolved
@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 10, 2024

@adamjstewart @ando-shah while trying to write a subclass based on a metadata file, an annoying part is the image_{} and mask_{} naming for the sample keys, because you need to remove them again to match the meta data naming of the modalities. So since there isn't a torchgeo trainer that works out of the box on this dataset anyways, maybe we can just return the modality name itself? Just wondering if it potentially creates more confusion to have a chosen set of prefixes for the modalities, when some could be either etc.

@adamjstewart
Copy link
Collaborator

maybe we can just return the modality name itself?

Then kornia.augmentation.AugmentationSequential will no longer work.

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants