Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GeoDataset to list files in VSI path(s) #1399

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
72702ef
Make RasterDataset accept list of files
Jun 22, 2023
b68dd5a
Fix check if str
adriantre Jun 22, 2023
e90d01c
Use isdir and isfile
adriantre Jun 23, 2023
dfad079
Add kwarg vsi to RasterDataset to support GDAL VSI
adriantre Jun 5, 2023
7291f3e
Fix formatting
adriantre Jun 5, 2023
6b41f18
Add type hints and docstring to method in utils
adriantre Jun 5, 2023
2b2be02
Fix missing import List
adriantre Jun 5, 2023
3f91e97
Fix type hints
adriantre Jun 5, 2023
f0d9475
Refactor with respect to other branch
adriantre Jun 22, 2023
7ca3cb7
Make try-catch more targeted
adriantre Jun 23, 2023
d519061
Remove redundant iglob usage
adriantre Jun 23, 2023
a841fb7
Merge main
adriantre Aug 6, 2024
2ee3c85
Remove unused imports
adriantre Aug 6, 2024
5c6e444
Add wildcard for directories
adriantre Aug 8, 2024
2da9c4a
Allow vsi files to not exist
adriantre Aug 8, 2024
e79061f
Make protected method public
adriantre Aug 8, 2024
00594de
Add zipped dataset to test vsi listdir
adriantre Aug 8, 2024
9ef7669
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 8, 2024
8a61108
Update fiona version in min-reqs.old
adriantre Aug 8, 2024
e8367ca
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
014e5f5
Update docstring of list_directory_recursive
adriantre Aug 12, 2024
0bf5d68
Set fiona version in min-reqs.old
adriantre Aug 12, 2024
41e3dfb
Remove redundant path exists
adriantre Aug 12, 2024
56b1c76
Remove duplicated import
adriantre Aug 12, 2024
e96cfa9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
d03120d
Create temp archive for test
adriantre Aug 12, 2024
dfc571e
Add docstring to listdir_vsi_recursive
adriantre Aug 12, 2024
aaea729
Change list_directory_recursive return type
adriantre Aug 12, 2024
740dcec
Bump fiona version in pyproject.toml
adriantre Aug 12, 2024
ba014af
Introduce fixture temp_archive for reuse in tests
adriantre Aug 12, 2024
e00b356
Make GeoDataset.files warn if VSI does not exist
adriantre Aug 12, 2024
65ef978
Replace failing test due to new behaviour of vsi
adriantre Aug 12, 2024
5192acd
Fix breaking test due to os.path.join not working on zip parent dir (…
adriantre Aug 12, 2024
9f800f1
Collect test on zip-archive in test class
adriantre Aug 12, 2024
93de375
Update versionadded in dataset/utils.py
adriantre Aug 12, 2024
37c4980
Remove patch from fiona version
adriantre Aug 12, 2024
395b4bf
Add .DS_Store to gitignore
adriantre Aug 13, 2024
6f041ee
Add test for https/curl files
adriantre Aug 13, 2024
dc334d5
Check filname_glob before adding to files property
adriantre Aug 13, 2024
0b80c12
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 13, 2024
dbbcec7
Define should_warn outside if
adriantre Aug 13, 2024
0cc7e15
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 14, 2024
c990591
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Aug 26, 2024
ccdbe9b
Try to support windows for vsi tests
adriantre Aug 26, 2024
a2dd0d6
Revert windows-specific path format
adriantre Aug 26, 2024
ecce011
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 26, 2024
08fde98
Skip TestVirtualFilesystems if platform is windows
adriantre Aug 27, 2024
26920ae
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
3baa587
Remove user-specific ignore from gitignore
adriantre Aug 27, 2024
be142aa
Properly remove changes from .gitignore
adriantre Aug 27, 2024
e7659a9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
c88111e
Rename VSI to VFS where appropriate
adriantre Aug 28, 2024
cffd707
Format docstring of listdir_vfs_recursive
adriantre Aug 28, 2024
04740a2
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
51a3d64
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
7b33550
Don't use os.path.join within VFS
adriantre Aug 28, 2024
b81591d
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
c786a57
String format wildcard instead of os.path.join
adriantre Aug 28, 2024
bb3ad78
Document raises
adriantre Aug 28, 2024
0824026
Dont use os.path.join for zip test
adriantre Aug 28, 2024
6a3d128
Update type of error in try except
adriantre Aug 28, 2024
8ea368f
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 28, 2024
0ad5db2
Simplify tests
adriantre Aug 28, 2024
87ea802
Simplify files property
adriantre Aug 28, 2024
94c3835
Remove unnecessary check in if
adriantre Aug 28, 2024
3ddee3a
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Sep 6, 2024
46e2375
Temp archive into tmp_path
adriantre Sep 6, 2024
2d54680
Make utility funcitons non-public
adriantre Sep 6, 2024
d1d8a4a
Fix typo in comment
adriantre Sep 6, 2024
d22a594
Move VFS tests to TestGeoDataset
adriantre Sep 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 24 additions & 6 deletions torchgeo/datasets/geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import re
import sys
from collections.abc import Sequence
from typing import Any, Callable, Optional, cast
from typing import Any, Callable, Optional, Union, cast

import fiona
import fiona.transform
Expand All @@ -29,7 +29,13 @@
from torchvision.datasets import ImageFolder
from torchvision.datasets.folder import default_loader as pil_loader

from .utils import BoundingBox, concat_samples, disambiguate_timestamp, merge_samples
from .utils import (
BoundingBox,
concat_samples,
disambiguate_timestamp,
list_directory_recursive,
merge_samples,
)


class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
Expand Down Expand Up @@ -329,7 +335,7 @@ def dtype(self) -> torch.dtype:

def __init__(
self,
root: str = "data",
root: Union[str, list[str]] = "data",
crs: Optional[CRS] = None,
res: Optional[float] = None,
bands: Optional[Sequence[str]] = None,
Expand All @@ -339,7 +345,8 @@ def __init__(
"""Initialize a new Dataset instance.

Args:
root: root directory where dataset can be found
root: root directory or list of absolute filepaths where
dataset can be found
crs: :term:`coordinate reference system (CRS)` to warp to
(defaults to the CRS of the first file found)
res: resolution of the dataset in units of CRS
Expand All @@ -358,11 +365,22 @@ def __init__(
self.bands = bands or self.all_bands
self.cache = cache

if isinstance(root, str):
root = [root]

filespaths: list[str] = []
for dir_or_file in root:
if os.path.exists(dir_or_file) and os.path.isfile(dir_or_file):
filespaths.append(dir_or_file)
else:
filespaths.extend(
list_directory_recursive(dir_or_file, self.filename_glob)
)
adriantre marked this conversation as resolved.
Show resolved Hide resolved

# Populate the dataset index
i = 0
pathname = os.path.join(root, "**", self.filename_glob)
filename_regex = re.compile(self.filename_regex, re.VERBOSE)
for filepath in glob.iglob(pathname, recursive=True):
for filepath in filespaths:
match = re.match(filename_regex, os.path.basename(filepath))
if match is not None:
try:
Expand Down
50 changes: 50 additions & 0 deletions torchgeo/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
import bz2
import collections
import contextlib
import fnmatch
import glob
import gzip
import lzma
import os
Expand All @@ -19,9 +21,11 @@
from datetime import datetime, timedelta
from typing import Any, cast, overload

import fiona
import numpy as np
import rasterio
import torch
from fiona.errors import FionaValueError
from torch import Tensor
from torchvision.datasets.utils import check_integrity, download_url
from torchvision.utils import draw_segmentation_masks
Expand All @@ -43,6 +47,7 @@
"draw_semantic_segmentation_masks",
"rgb_to_mask",
"percentile_normalization",
"list_directory_recursive",
)


Expand Down Expand Up @@ -737,3 +742,48 @@ def percentile_normalization(
(img - lower_percentile) / (upper_percentile - lower_percentile + 1e-5), 0, 1
)
return img_normalized


def _path_is_vsi(path: str) -> bool:
from rasterio._path import SCHEMES

prefix = path.split("://")[0]
schemes = prefix.split("+")
adamjstewart marked this conversation as resolved.
Show resolved Hide resolved
is_apache_vfs_scheme = set(schemes).issubset(set(SCHEMES))
is_gdal_vsi = path.startswith("/vsi")
return is_gdal_vsi or is_apache_vfs_scheme


adriantre marked this conversation as resolved.
Show resolved Hide resolved
def _listdir_vsi_recursive(root: str) -> list[str]:
dirs = [root]
files = []
while dirs:
dir = dirs.pop()
try:
subdirs = fiona.listdir(dir)
adriantre marked this conversation as resolved.
Show resolved Hide resolved
dirs.extend([os.path.join(dir, subdir) for subdir in subdirs])
except FionaValueError as e:
if "is not a directory" in str(e):
files.append(dir)
else:
raise e
return files


def list_directory_recursive(root: str, filename_glob: str) -> list[str]:
"""Lists files in directory recursively.

Also supports gdal virtual file systems (vsi).
adriantre marked this conversation as resolved.
Show resolved Hide resolved

Args:
root: directory to list. For vsi these can start with
e.g. /vsiaz or az:// for azure blob storage
filename_glob: filename pattern to filter filenames
adriantre marked this conversation as resolved.
Show resolved Hide resolved
"""
if not _path_is_vsi(root):
filepaths = _listdir_vsi_recursive(root)
filepaths = fnmatch.filter(filepaths, filename_glob)
else:
pathname = os.path.join(root, "**", filename_glob)
filepaths = list(glob.iglob(pathname, recursive=True))
return filepaths