Add splitting utilities for GeoDatasets #866

pmandiola · 2022-10-24T17:22:33Z

This PR adds the ability to create train/val/test RasterDatasets from the same underlying raster files, inspired from a functionality in the rastervision package.

An extent_crop param is added to RasterDataset to indicate the portion of the raster files to be skipped. This way, one could create a train RasterDataset using only the top 80% of the files by specifying extent_crop=(0.2, 0, 0, 0), and a test RasterDataset using the remaining 20% with extent_crop=(0, 0, 0.8, 0).

I got a bit stuck with how to implement a test for this, but before figuring it out I wanted to check if this looks ok to you.

adamjstewart · 2022-10-24T19:24:40Z

This feature serves the same purpose as GeoSampler's roi parameter. The only difference is that roi has to be in geospatial coordinates.

The reason I originally chose to put this parameter in the sampler instead of the dataset is in order to save time. Instantiating a dataset is kinda slow (you need to recursively search the filesystem for files), so if you only need to instantiate 1 dataset and 3 samplers, that's much faster than 3 datasets and 1 sampler.

That doesn't mean our current design is ideal, and I agree we at least need an easier way to easily split a dataset without knowledge of geospatial metadata. In addition to the split strategy you propose here, I would also love a way to overlay a grid over a dataset and randomly distribute cells to different splits (see Figure 2 in https://arxiv.org/abs/1805.02855 for an example of what I mean).

I feel like we should add a torchgeo.utils directory with utilities for splitting and joining datasets. This somewhat relates to #30. I think we'll find that there are too many ways to split a dataset and a single parameter in either the dataset or sampler won't cut it. What do you think?

pmandiola · 2022-10-24T20:32:29Z

One big difference with the GeoSampler's roi is when the dataset has non-contiguos rasters. In our specific problem, we have one raster per city so all the raster files in the RasterDataset are non-contiguos. Using one roi for the full dataset through the GeoSampler doesn't make sense in this case. Or at least I wouldn't know how to make it work.

With this method, when the RasterDataset is loaded, each file is cropped by reducing the bounds that get added to the index. So for our specific case with cities, we can easily divide each raster/city into train/val/test datasets. Also, not needing the CRS beforehand is a big advantage.

I don't know where it would be best to put all these splitting utilities but I think doing it on the dataset level gives you more flexibility. The GeoSampler's roi parameter can be added to the RasterDataset, but I don't think it's possible to do the same we are doing here.

pmandiola · 2022-10-26T17:51:46Z

@adamjstewart I used the same original logic for splitting and turned it into a function that could go into utils. What do you think of this approach? It might be possible to do the same with a GeoSampler by splitting its underlying dataset

def train_test_split(
    dataset: GeoDataset,
    test_size: float = 0.25,
    random_seed: int = None,
) -> List[GeoDataset]:

    assert 0 < test_size < 1
    
    np.random.seed(random_seed)

    index_train = Index(interleaved=False, properties=Property(dimension=3))
    index_test = Index(interleaved=False, properties=Property(dimension=3))

    for i, hit in enumerate(dataset.index.intersection(dataset.index.bounds, objects=True)):
        box = BoundingBox(*hit.bounds)
        horizontal, flip = np.random.randint(2, size=2)
        if flip:
            box_train, box_test = box.split(1 - test_size, horizontal)
        else:
            box_test, box_train = box.split(test_size, horizontal)
        index_train.insert(i, tuple(box_train))
        index_test.insert(i, tuple(box_test))
    
    dataset_train = deepcopy(dataset)
    dataset_train.index = index_train
    dataset_test = deepcopy(dataset)
    dataset_test.index = index_test

    return dataset_train, dataset_test

Also added a split() method to BoundingBox:

  def split(
      self,
      proportion: float,
      horizontal: bool = True,
  ) -> "BoundingBox":
      """Split BoundingBox in two.

      Args:
          proportion: split proportion
          horizontal: whether the split is horizontal (True) or
              vertical

      Returns:
          A tuple with the resulting BoundingBoxes
      """
      
      if horizontal:
          w = self.maxx - self.minx
          splitx = self.minx + int(w * proportion)
          bbox1 = BoundingBox(self.minx, splitx, self.miny, self.maxy, self.mint, self.maxt)
          bbox2 = BoundingBox(splitx, self.maxx, self.miny, self.maxy, self.mint, self.maxt)
      else:
          h = self.maxy - self.miny
          splity = self.miny + int(h * proportion)
          bbox1 = BoundingBox(self.minx, self.maxx, self.miny, splity, self.mint, self.maxt)
          bbox2 = BoundingBox(self.minx, self.maxx, splity, self.maxy, self.mint, self.maxt)

      return bbox1, bbox2

adamjstewart · 2022-10-27T15:44:22Z

I'm currently preparing for a conference, but I'll try to look at this in more detail next week. I'll have to decide whether splitting should happen at the dataset or sampler level.

pmandiola · 2022-12-12T17:18:54Z

Hi @adamjstewart, any thoughts on this dataset/sampler splitting strategy? Let me know what you think so I can finalize it or adapt it to work for samplers instead.

One pro of splitting at the dataset level is that later one could use different sampling strategies for train/val/test, like using a RandomSampler for training and then a GridSampler for validation and testing.

adamjstewart · 2022-12-13T03:00:42Z

Apologies for taking so long to get back to this. Yes, this implementation is looking much better now!

Note that we do also have some splitting stuff in torchgeo.datamodules.utils as well. We may want to consider moving both of these to a shared file, could be torchgeo.datasets.splits, or torchgeo.utils.splits, or wherever. That would help us avoid that nasty circular import you discovered. We'll also need to be careful to ensure that names are specific enough to avoid overlap with future splitting utilities.

At this point, I think the exact implementation details are less of a blocker than the naming and file location. I think the best path forward would be to sit down and brainstorm all potential ways that someone might want to split our datasets, and compare them to splitting utilities in other libraries. Just trying to avoid adding functions here and then moving or renaming them later. We should also consider if we want to merge the sampler roi functionality into this. I've seen a lot of users find sampler splitting less intuitive than dataset splitting simply because torchvision has no samplers and everything must be done at the dataset level.

Sorry this isn't a quick and straightforward addition. This is how it always is when you want to make a big change. Adding a new dataset is routine, but adding a new way of interacting with datasets requires a lot of thought and careful design.

pmandiola · 2022-12-14T14:54:11Z

No worries! I've been kind of busy and just now I have some time I can put into this.

I hadn't seen the dataset_split function in torchgeo.datamodules.utils before, thank for pointing that out. I think it makes sense to move both of these to something like torchgeo.datasets.splits.

What do you think about merging both into one random_split function and use the dataset type to decide how to do the split? We could use the same function naming as torch's random_split (or maybe that's not a good idea, not sure), and we could also use a similar logic for specifying the length/proportion of each dataset.

For the sampler's roi functionality, we could add another function in the same file and call it something like roi_split. The inputs could be a dataset and a list of roi's (instead of proportions) and the function returns the resulting datasets.

Let me know what you think.

adamjstewart · 2022-12-14T21:57:35Z

What do you think about merging both into one random_split function and use the dataset type to decide how to do the split?

I'm worried about this because in the case of GeoDataset there are several different ways you might want to split the data. So I don't think having a single function will work. That's why we need to enumerate all possible splitting strategies so we can decide on function names.

pmandiola · 2022-12-19T14:24:57Z

Ok, I see your point. Let's list them then and see how we can move forward. I'll start with some we have discussed here:

Random split for NonGeoDatasets. This would be the usual way of splitting a torch dataset, currently implemented in the dataset_split function in torchgeo.datamodules.utils
Randomly split each bounding box in a GeoDataset's index This is one way of randomly split a GeoDataset, as implemented in this PR.
Randomly assign each bounding box in a GeoDataset's index to different splits. A very simple way of randomly splitting a GeoDataset, not implemented.
Overlay a grid over each bounding box in a GeoDataset's index and randomly distribute cells to different splits. As described in https://arxiv.org/abs/1805.02855, not implemented.
Split a GeoDataset by defining the ROI for each new dataset. Similar to what is now implemented with the GeoSampler's roi parameter, but in a separate function allowing multiple roi's to be defined for multiple resulting GeoDatasets.
Split a GeoDataset by it's time dimension. Not sure if time series forecasting is already implemented in torchgeo, but this would be useful for doing it.

pmandiola · 2022-12-20T18:15:02Z

I went forward and re-wrote the definitions into a more functional form and included a proposal for the function names. I think it would make a lot of sense to have this in a new torchgeo.datasets.splits file.

What are your thoughts @adamjstewart? Any other strategy you would add at this stage?

Regular splits:

random_nongeo_split: randomly splits a NonGeoDataset into non-overlapping new NonGeoDatasets.

Spatial splits:

random_bbox_assignment: randomly assigns each BoundingBox in a GeoDataset's index to new GeoDatasets.

random_bbox_splitting: splits each BoundingBox in a GeoDataset's index and randomly assigns each part to new GeoDatasets.

random_grid_cell_assignment: overlays a grid over each BoundingBox in a GeoDataset's index and randomly assigns each cell to new GeoDatasets.

roi_split: intersects a ROI with a GeoDataset for each desired new GeoDataset.

Time splits:

time_series_split: splits a GeoDataset along the time dimension into new GeoDatasets.

adamjstewart · 2022-12-20T18:19:21Z

Those names sound good to me. How do you propose these splitters work? Would they accept a dataset as input and create multiple datasets? Or would they be arguments to a sampler?

pmandiola · 2022-12-20T18:26:34Z

I'm thinking on a function that takes a dataset as an argument and returns multiple datasets, like the train_test_split function I wrote in this PR.

I'm not sure whether to ask for val_pct and train_pct parameters like in the current torchgeo.datamodules.utils.dataset_split implementation, or move to use a list of lengths like in torch's random_split.

adamjstewart · 2022-12-20T18:59:13Z

The latest version of random_split accepts both a list of lengths and a list of percentages. I would like to do the same, both for consistency with PyTorch and because it's more useful this way.

adamjstewart

There's a ton of tedious error-prone code required here (a couple releases back we fixed a ton of floating point rounding issues for stuff like this) but your unit tests are so thorough it gives me confidence that things are likely bug-free, nice job!

@calebrob6 is planning on drawing up some pictures to illustrate what each splitter does in a follow-up PR, should help with the docstrings.

Most of my review comments are clarification comments or type hint suggestions, I couldn't actually find any bugs or missing features.

tests/datasets/test_splits.py

torchgeo/datasets/geo.py

torchgeo/datasets/splits.py

nilsleh

I think this is a super convenient utility, thanks a lot @pmandiola! Looking ahead with the WIP of #877 there are likely to be some other changes/adaptions since we have not settled on an implementation for Time Series Support. However, I think this is a great starting point and convenient to generally create time series splits from more difficult to inspect raster data as opposed to a dataframe where you can just slice your data. Another case important to spatio-temporal data looking ahead would be to create splits both across time and geospatial location.

With respect to your question about time-series analysis and rolling windows @adamjstewart , I have made some comments in the time-series support PR. So I think we can discuss that there.

torchgeo/datasets/splits.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

pmandiola · 2023-02-21T14:36:44Z

Thanks @adamjstewart and @nilsleh for your reviews! I think I addressed most of your comments but happy to change anything else if needed.

adamjstewart · 2023-02-21T18:14:31Z

Thanks again for the hard work on this! Sorry it took so long to review!

pmandiola · 2023-02-21T18:17:25Z

No worries, happy to help!

* add extent_crop to BoundingBox * add extent_crop param to RasterDataset * train_test_split function * minor changes * fix circular import * remove extent_crop * move existing functions to new file * refactor random_nongeo_split * refactor random_bbox_splitting * add roi_split * add random_bbox_assignment * add input checks * fix input type * minor reorder * add tests * add non-overlapping test * more tests * fix tests * additional tests * check overlapping rois * add time_series_split with tests * fix random_nongeo_split to work with fractions in torch 1.9 * modify random_nongeo_split test for coverage * add random_grid_cell_assignment with tests * add test * insert object into new indexes * check grid_size * better tests * small type fix * fix again * rm .DS_Store * fix typo Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> * bump version added * add to __init__ * add to datasets.rst * use accumulate from itertools * clarify grid_size * remove random_nongeo_split * remove _create_geodataset_like * black reformatting * Update tests/datasets/test_splits.py Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> * change import * docstrings * undo intersection change * use microsecond * use isclose * black * fix typing * add comments --------- Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

adamjstewart · 2024-02-07T17:35:18Z

Some of our users have commented that the documentation isn't clear enough about the difference between some of these splitting utilities. Also that we don't provide recommendations of which splitting utility to use for which situation.

@calebrob6 I know you're passionate about this. Any interest/time in making diagrams to explain the differences between these functions and adding some text to explain how to choose which one is right for you?

pmandiola added 2 commits October 24, 2022 12:18

add extent_crop to BoundingBox

85810c0

add extent_crop param to RasterDataset

45f963b

github-actions bot added the datasets Geospatial or benchmark datasets label Oct 24, 2022

adamjstewart added the utilities Utilities for working with geospatial data label Oct 27, 2022

adamjstewart added this to the 0.4.0 milestone Oct 27, 2022

pmandiola added 3 commits October 27, 2022 12:59

train_test_split function

cf8e824

minor changes

a50c983

fix circular import

85d5718

Merge branch 'main' into feature/split_raster_datasets

12fc826

pmandiola changed the title ~~Reuse raster files in different train/val/test RasterDatasets~~ Add splitting utilities for GeoDatasets Dec 19, 2022

pmandiola added 7 commits December 20, 2022 16:19

remove extent_crop

d343c3b

move existing functions to new file

bd65c85

refactor random_nongeo_split

6f694e8

refactor random_bbox_splitting

9a80943

add roi_split

8c54b84

add random_bbox_assignment

6698745

add input checks

4de89b4

pmandiola added 5 commits February 15, 2023 15:17

add to datasets.rst

41f7308

use accumulate from itertools

71f7503

clarify grid_size

91445b7

remove random_nongeo_split

4fcd141

remove _create_geodataset_like

fa513b5

github-actions bot added the documentation Improvements or additions to documentation label Feb 15, 2023

black reformatting

3f46921

pmandiola requested a review from adamjstewart February 16, 2023 12:27

adamjstewart reviewed Feb 17, 2023

View reviewed changes

adamjstewart reviewed Feb 19, 2023

View reviewed changes

torchgeo/datasets/splits.py Show resolved Hide resolved

adamjstewart added this to the 0.5.0 milestone Feb 19, 2023

nilsleh requested changes Feb 20, 2023

View reviewed changes

torchgeo/datasets/splits.py Outdated Show resolved Hide resolved

torchgeo/datasets/splits.py Outdated Show resolved Hide resolved

pmandiola and others added 9 commits February 20, 2023 10:16

Update tests/datasets/test_splits.py

5d2f866

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

change import

28c24fb

docstrings

358a679

undo intersection change

3610c1f

use microsecond

d589ed1

use isclose

d6fc778

black

9c9fd26

fix typing

15ffb24

add comments

0182180

adamjstewart approved these changes Feb 21, 2023

View reviewed changes

nilsleh approved these changes Feb 21, 2023

View reviewed changes

adamjstewart merged commit ceeec81 into microsoft:main Feb 21, 2023

adamjstewart mentioned this pull request Sep 29, 2023

Multiple roi in Sampler #536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add splitting utilities for GeoDatasets #866

Add splitting utilities for GeoDatasets #866

pmandiola commented Oct 24, 2022

adamjstewart commented Oct 24, 2022

pmandiola commented Oct 24, 2022 •

edited

Loading

pmandiola commented Oct 26, 2022 •

edited

Loading

adamjstewart commented Oct 27, 2022

pmandiola commented Dec 12, 2022

adamjstewart commented Dec 13, 2022

pmandiola commented Dec 14, 2022

adamjstewart commented Dec 14, 2022

pmandiola commented Dec 19, 2022 •

edited

Loading

pmandiola commented Dec 20, 2022

adamjstewart commented Dec 20, 2022

pmandiola commented Dec 20, 2022

adamjstewart commented Dec 20, 2022

adamjstewart left a comment

nilsleh left a comment •

edited

Loading

pmandiola commented Feb 21, 2023

adamjstewart commented Feb 21, 2023

pmandiola commented Feb 21, 2023

adamjstewart commented Feb 7, 2024

Add splitting utilities for GeoDatasets #866

Add splitting utilities for GeoDatasets #866

Conversation

pmandiola commented Oct 24, 2022

adamjstewart commented Oct 24, 2022

pmandiola commented Oct 24, 2022 • edited Loading

pmandiola commented Oct 26, 2022 • edited Loading

adamjstewart commented Oct 27, 2022

pmandiola commented Dec 12, 2022

adamjstewart commented Dec 13, 2022

pmandiola commented Dec 14, 2022

adamjstewart commented Dec 14, 2022

pmandiola commented Dec 19, 2022 • edited Loading

pmandiola commented Dec 20, 2022

adamjstewart commented Dec 20, 2022

pmandiola commented Dec 20, 2022

adamjstewart commented Dec 20, 2022

adamjstewart left a comment

Choose a reason for hiding this comment

nilsleh left a comment • edited Loading

Choose a reason for hiding this comment

pmandiola commented Feb 21, 2023

adamjstewart commented Feb 21, 2023

pmandiola commented Feb 21, 2023

adamjstewart commented Feb 7, 2024

pmandiola commented Oct 24, 2022 •

edited

Loading

pmandiola commented Oct 26, 2022 •

edited

Loading

pmandiola commented Dec 19, 2022 •

edited

Loading

nilsleh left a comment •

edited

Loading