Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splitting utilities for GeoDatasets #866

Merged
merged 51 commits into from
Feb 21, 2023

Conversation

pmandiola
Copy link
Contributor

This PR adds the ability to create train/val/test RasterDatasets from the same underlying raster files, inspired from a functionality in the rastervision package.

An extent_crop param is added to RasterDataset to indicate the portion of the raster files to be skipped. This way, one could create a train RasterDataset using only the top 80% of the files by specifying extent_crop=(0.2, 0, 0, 0), and a test RasterDataset using the remaining 20% with extent_crop=(0, 0, 0.8, 0).

I got a bit stuck with how to implement a test for this, but before figuring it out I wanted to check if this looks ok to you.

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Oct 24, 2022
@adamjstewart
Copy link
Collaborator

This feature serves the same purpose as GeoSampler's roi parameter. The only difference is that roi has to be in geospatial coordinates.

The reason I originally chose to put this parameter in the sampler instead of the dataset is in order to save time. Instantiating a dataset is kinda slow (you need to recursively search the filesystem for files), so if you only need to instantiate 1 dataset and 3 samplers, that's much faster than 3 datasets and 1 sampler.

That doesn't mean our current design is ideal, and I agree we at least need an easier way to easily split a dataset without knowledge of geospatial metadata. In addition to the split strategy you propose here, I would also love a way to overlay a grid over a dataset and randomly distribute cells to different splits (see Figure 2 in https://arxiv.org/abs/1805.02855 for an example of what I mean).

I feel like we should add a torchgeo.utils directory with utilities for splitting and joining datasets. This somewhat relates to #30. I think we'll find that there are too many ways to split a dataset and a single parameter in either the dataset or sampler won't cut it. What do you think?

@pmandiola
Copy link
Contributor Author

pmandiola commented Oct 24, 2022

One big difference with the GeoSampler's roi is when the dataset has non-contiguos rasters. In our specific problem, we have one raster per city so all the raster files in the RasterDataset are non-contiguos. Using one roi for the full dataset through the GeoSampler doesn't make sense in this case. Or at least I wouldn't know how to make it work.

With this method, when the RasterDataset is loaded, each file is cropped by reducing the bounds that get added to the index. So for our specific case with cities, we can easily divide each raster/city into train/val/test datasets. Also, not needing the CRS beforehand is a big advantage.

I don't know where it would be best to put all these splitting utilities but I think doing it on the dataset level gives you more flexibility. The GeoSampler's roi parameter can be added to the RasterDataset, but I don't think it's possible to do the same we are doing here.

@pmandiola
Copy link
Contributor Author

pmandiola commented Oct 26, 2022

@adamjstewart I used the same original logic for splitting and turned it into a function that could go into utils. What do you think of this approach? It might be possible to do the same with a GeoSampler by splitting its underlying dataset

def train_test_split(
    dataset: GeoDataset,
    test_size: float = 0.25,
    random_seed: int = None,
) -> List[GeoDataset]:

    assert 0 < test_size < 1
    
    np.random.seed(random_seed)

    index_train = Index(interleaved=False, properties=Property(dimension=3))
    index_test = Index(interleaved=False, properties=Property(dimension=3))

    for i, hit in enumerate(dataset.index.intersection(dataset.index.bounds, objects=True)):
        box = BoundingBox(*hit.bounds)
        horizontal, flip = np.random.randint(2, size=2)
        if flip:
            box_train, box_test = box.split(1 - test_size, horizontal)
        else:
            box_test, box_train = box.split(test_size, horizontal)
        index_train.insert(i, tuple(box_train))
        index_test.insert(i, tuple(box_test))
    
    dataset_train = deepcopy(dataset)
    dataset_train.index = index_train
    dataset_test = deepcopy(dataset)
    dataset_test.index = index_test

    return dataset_train, dataset_test

Also added a split() method to BoundingBox:

  def split(
      self,
      proportion: float,
      horizontal: bool = True,
  ) -> "BoundingBox":
      """Split BoundingBox in two.

      Args:
          proportion: split proportion
          horizontal: whether the split is horizontal (True) or
              vertical

      Returns:
          A tuple with the resulting BoundingBoxes
      """
      
      if horizontal:
          w = self.maxx - self.minx
          splitx = self.minx + int(w * proportion)
          bbox1 = BoundingBox(self.minx, splitx, self.miny, self.maxy, self.mint, self.maxt)
          bbox2 = BoundingBox(splitx, self.maxx, self.miny, self.maxy, self.mint, self.maxt)
      else:
          h = self.maxy - self.miny
          splity = self.miny + int(h * proportion)
          bbox1 = BoundingBox(self.minx, self.maxx, self.miny, splity, self.mint, self.maxt)
          bbox2 = BoundingBox(self.minx, self.maxx, splity, self.maxy, self.mint, self.maxt)

      return bbox1, bbox2

@adamjstewart
Copy link
Collaborator

I'm currently preparing for a conference, but I'll try to look at this in more detail next week. I'll have to decide whether splitting should happen at the dataset or sampler level.

@adamjstewart adamjstewart added the utilities Utilities for working with geospatial data label Oct 27, 2022
@adamjstewart adamjstewart added this to the 0.4.0 milestone Oct 27, 2022
@pmandiola
Copy link
Contributor Author

Hi @adamjstewart, any thoughts on this dataset/sampler splitting strategy? Let me know what you think so I can finalize it or adapt it to work for samplers instead.

One pro of splitting at the dataset level is that later one could use different sampling strategies for train/val/test, like using a RandomSampler for training and then a GridSampler for validation and testing.

@adamjstewart
Copy link
Collaborator

Apologies for taking so long to get back to this. Yes, this implementation is looking much better now!

Note that we do also have some splitting stuff in torchgeo.datamodules.utils as well. We may want to consider moving both of these to a shared file, could be torchgeo.datasets.splits, or torchgeo.utils.splits, or wherever. That would help us avoid that nasty circular import you discovered. We'll also need to be careful to ensure that names are specific enough to avoid overlap with future splitting utilities.

At this point, I think the exact implementation details are less of a blocker than the naming and file location. I think the best path forward would be to sit down and brainstorm all potential ways that someone might want to split our datasets, and compare them to splitting utilities in other libraries. Just trying to avoid adding functions here and then moving or renaming them later. We should also consider if we want to merge the sampler roi functionality into this. I've seen a lot of users find sampler splitting less intuitive than dataset splitting simply because torchvision has no samplers and everything must be done at the dataset level.

Sorry this isn't a quick and straightforward addition. This is how it always is when you want to make a big change. Adding a new dataset is routine, but adding a new way of interacting with datasets requires a lot of thought and careful design.

@pmandiola
Copy link
Contributor Author

No worries! I've been kind of busy and just now I have some time I can put into this.

I hadn't seen the dataset_split function in torchgeo.datamodules.utils before, thank for pointing that out. I think it makes sense to move both of these to something like torchgeo.datasets.splits.

What do you think about merging both into one random_split function and use the dataset type to decide how to do the split? We could use the same function naming as torch's random_split (or maybe that's not a good idea, not sure), and we could also use a similar logic for specifying the length/proportion of each dataset.

For the sampler's roi functionality, we could add another function in the same file and call it something like roi_split. The inputs could be a dataset and a list of roi's (instead of proportions) and the function returns the resulting datasets.

Let me know what you think.

@adamjstewart
Copy link
Collaborator

What do you think about merging both into one random_split function and use the dataset type to decide how to do the split?

I'm worried about this because in the case of GeoDataset there are several different ways you might want to split the data. So I don't think having a single function will work. That's why we need to enumerate all possible splitting strategies so we can decide on function names.

@pmandiola
Copy link
Contributor Author

pmandiola commented Dec 19, 2022

Ok, I see your point. Let's list them then and see how we can move forward. I'll start with some we have discussed here:

  1. Random split for NonGeoDatasets. This would be the usual way of splitting a torch dataset, currently implemented in the dataset_split function in torchgeo.datamodules.utils

  2. Randomly split each bounding box in a GeoDataset's index This is one way of randomly split a GeoDataset, as implemented in this PR.

  3. Randomly assign each bounding box in a GeoDataset's index to different splits. A very simple way of randomly splitting a GeoDataset, not implemented.

  4. Overlay a grid over each bounding box in a GeoDataset's index and randomly distribute cells to different splits. As described in https://arxiv.org/abs/1805.02855, not implemented.

  5. Split a GeoDataset by defining the ROI for each new dataset. Similar to what is now implemented with the GeoSampler's roi parameter, but in a separate function allowing multiple roi's to be defined for multiple resulting GeoDatasets.

  6. Split a GeoDataset by it's time dimension. Not sure if time series forecasting is already implemented in torchgeo, but this would be useful for doing it.

@pmandiola pmandiola changed the title Reuse raster files in different train/val/test RasterDatasets Add splitting utilities for GeoDatasets Dec 19, 2022
@pmandiola
Copy link
Contributor Author

I went forward and re-wrote the definitions into a more functional form and included a proposal for the function names. I think it would make a lot of sense to have this in a new torchgeo.datasets.splits file.

What are your thoughts @adamjstewart? Any other strategy you would add at this stage?

Regular splits:

random_nongeo_split: randomly splits a NonGeoDataset into non-overlapping new NonGeoDatasets.

Spatial splits:

random_bbox_assignment: randomly assigns each BoundingBox in a GeoDataset's index to new GeoDatasets.

random_bbox_splitting: splits each BoundingBox in a GeoDataset's index and randomly assigns each part to new GeoDatasets.

random_grid_cell_assignment: overlays a grid over each BoundingBox in a GeoDataset's index and randomly assigns each cell to new GeoDatasets.

roi_split: intersects a ROI with a GeoDataset for each desired new GeoDataset.

Time splits:

time_series_split: splits a GeoDataset along the time dimension into new GeoDatasets.

@adamjstewart
Copy link
Collaborator

Those names sound good to me. How do you propose these splitters work? Would they accept a dataset as input and create multiple datasets? Or would they be arguments to a sampler?

@pmandiola
Copy link
Contributor Author

I'm thinking on a function that takes a dataset as an argument and returns multiple datasets, like the train_test_split function I wrote in this PR.

I'm not sure whether to ask for val_pct and train_pct parameters like in the current torchgeo.datamodules.utils.dataset_split implementation, or move to use a list of lengths like in torch's random_split.

@adamjstewart
Copy link
Collaborator

The latest version of random_split accepts both a list of lengths and a list of percentages. I would like to do the same, both for consistency with PyTorch and because it's more useful this way.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 15, 2023
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a ton of tedious error-prone code required here (a couple releases back we fixed a ton of floating point rounding issues for stuff like this) but your unit tests are so thorough it gives me confidence that things are likely bug-free, nice job!

@calebrob6 is planning on drawing up some pictures to illustrate what each splitter does in a follow-up PR, should help with the docstrings.

Most of my review comments are clarification comments or type hint suggestions, I couldn't actually find any bugs or missing features.

tests/datasets/test_splits.py Outdated Show resolved Hide resolved
tests/datasets/test_splits.py Outdated Show resolved Hide resolved
tests/datasets/test_splits.py Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/splits.py Show resolved Hide resolved
torchgeo/datasets/splits.py Show resolved Hide resolved
torchgeo/datasets/splits.py Show resolved Hide resolved
torchgeo/datasets/splits.py Outdated Show resolved Hide resolved
torchgeo/datasets/splits.py Outdated Show resolved Hide resolved
torchgeo/datasets/splits.py Outdated Show resolved Hide resolved
@adamjstewart adamjstewart added this to the 0.5.0 milestone Feb 19, 2023
Copy link
Collaborator

@nilsleh nilsleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a super convenient utility, thanks a lot @pmandiola! Looking ahead with the WIP of #877 there are likely to be some other changes/adaptions since we have not settled on an implementation for Time Series Support. However, I think this is a great starting point and convenient to generally create time series splits from more difficult to inspect raster data as opposed to a dataframe where you can just slice your data. Another case important to spatio-temporal data looking ahead would be to create splits both across time and geospatial location.

With respect to your question about time-series analysis and rolling windows @adamjstewart , I have made some comments in the time-series support PR. So I think we can discuss that there.

torchgeo/datasets/splits.py Outdated Show resolved Hide resolved
torchgeo/datasets/splits.py Outdated Show resolved Hide resolved
@pmandiola
Copy link
Contributor Author

Thanks @adamjstewart and @nilsleh for your reviews! I think I addressed most of your comments but happy to change anything else if needed.

@adamjstewart adamjstewart merged commit ceeec81 into microsoft:main Feb 21, 2023
@adamjstewart
Copy link
Collaborator

Thanks again for the hard work on this! Sorry it took so long to review!

@pmandiola
Copy link
Contributor Author

No worries, happy to help!

yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* add extent_crop to BoundingBox

* add extent_crop param to RasterDataset

* train_test_split function

* minor changes

* fix circular import

* remove extent_crop

* move existing functions to new file

* refactor random_nongeo_split

* refactor random_bbox_splitting

* add roi_split

* add random_bbox_assignment

* add input checks

* fix input type

* minor reorder

* add tests

* add non-overlapping test

* more tests

* fix tests

* additional tests

* check overlapping rois

* add time_series_split with tests

* fix random_nongeo_split to work with fractions in torch 1.9

* modify random_nongeo_split test for coverage

* add random_grid_cell_assignment with tests

* add test

* insert object into new indexes

* check grid_size

* better tests

* small type fix

* fix again

* rm .DS_Store

* fix typo

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* bump version added

* add to __init__

* add to datasets.rst

* use accumulate from itertools

* clarify grid_size

* remove random_nongeo_split

* remove _create_geodataset_like

* black reformatting

* Update tests/datasets/test_splits.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* change import

* docstrings

* undo intersection change

* use microsecond

* use isclose

* black

* fix typing

* add comments

---------

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@adamjstewart
Copy link
Collaborator

Some of our users have commented that the documentation isn't clear enough about the difference between some of these splitting utilities. Also that we don't provide recommendations of which splitting utility to use for which situation.

@calebrob6 I know you're passionate about this. Any interest/time in making diagrams to explain the differences between these functions and adding some text to explain how to choose which one is right for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing utilities Utilities for working with geospatial data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants