Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading from GDAL virtual file systems (e.g. cloud storage) #1398

Open
adriantre opened this issue Jun 5, 2023 · 3 comments · May be fixed by #1399
Open

Support reading from GDAL virtual file systems (e.g. cloud storage) #1398

adriantre opened this issue Jun 5, 2023 · 3 comments · May be fixed by #1399
Labels
datasets Geospatial or benchmark datasets

Comments

@adriantre
Copy link
Contributor

adriantre commented Jun 5, 2023

pathname = os.path.join(root, "**", self.filename_glob)
filename_regex = re.compile(self.filename_regex, re.VERBOSE)
for filepath in glob.iglob(pathname, recursive=True):
match = re.match(filename_regex, os.path.basename(filepath))
if match is not None:

GDAL virtual file systems such as reading directly from Google Buckets (/vsigs/) are natively supported by rasterio (through gdal).

with rasterio.open("/vsigs/my_bucket/.../my_image.tif") as src:
    src.read()  # etc.

The glob-matching (source code linked above) is the only thing stopping this currently.

What do you think the best way is to do this? My initial guess is that supporting the glob-matching for all the different file systems would take some effort.

The quickest fix (for me at least) would be to add an optional parameter filenames:List that is iterated, and the (already existing) try/except would handle if the filename is wrong.

@adriantre
Copy link
Contributor Author

adriantre commented Jun 5, 2023

Edit: Better proposal below.

Proposed changes:

class RasterDataset(GeoDataset):
    def __init__(
        self,
        ...,  # existing params
        filenames: Optional[List[str]] = None
    ) -> None:

        ...

        # Populate the dataset index
        i = 0
        if not filenames:
            pathname = os.path.join(root, "**", self.filename_glob)
            filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        else:
            filepaths = [os.path.join(root, filename) for filename in filenames]
        for filepath in filepaths:
            # continue on line 366 in the original code

and filenames should contain eventual subdirectories.

@adriantre
Copy link
Contributor Author

adriantre commented Jun 5, 2023

Just found the listdir-method of fiona. It does not support recursive walks but will list sub-blobs in virtual file systems.

from fiona.errors import FionaValueError

def listdir_vsi_recursive(root):
    dirs = [root]
    files = []
    while dirs:
        dir = dirs.pop()
        try:
            subdirs = fiona.listdir(dir)
            dirs.extend([os.path.join(dir,subdir) for subdir in subdirs])
        except FionaValueError:
            files.append(dir)
    return files

class RasterDataset(GeoDataset):
    def __init__(
        self,
        ...,  # existing params
        vsi: bool = False
    ) -> None:

        ...

        # Populate the dataset index
        i = 0
        filename_regex = re.compile(self.filename_regex, re.VERBOSE)
        if vsi:
            filepaths = listdir_vsi_recursive(root)
        else:
            pathname = os.path.join(root, "**", self.filename_glob)
            filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        for filepath in filepaths:
            # continue on line 366 in the original code

@adriantre adriantre linked a pull request Jun 5, 2023 that will close this issue
@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Jun 9, 2023
@adamjstewart adamjstewart added this to the 0.5.0 milestone Jun 9, 2023
@adamjstewart adamjstewart removed this from the 0.5.0 milestone Sep 28, 2023
@adamjstewart
Copy link
Collaborator

Note that we technically support this in 0.5.0, although the user has to manually pass in a list of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants