Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for private remote object storage #441

Open
berombau opened this issue Jan 26, 2024 · 2 comments
Open

Support for private remote object storage #441

berombau opened this issue Jan 26, 2024 · 2 comments
Labels

Comments

@berombau
Copy link
Contributor

berombau commented Jan 26, 2024

Right now, there is support for local .zarr stores and remote stores publically accessible via HTTP or S3.
Private remote stores are more difficult, as they need certain options or credentials that are not representable by simply a string or Path. One option is to use a zarr.storage.FSStore, which can have storage_options or any fsspec.spec.AbstractFileSystem.

Two pull requests enable this:

Testing is difficult, but this is what I used:

import spatialdata as sd
import zarr
# works now, requires credentials in ~/.aws/credentials
root = zarr.open('s3://BUCKET/spatial-sandbox/visium_associated_xenium_io.zarr', storage_options = {'client_kwargs': {'endpoint_url': MINIO_URL}})
sd.read_zarr(root)
# still works, I think depends on zmetadata?
sd.read_zarr('https://s3.embl.de/spatialdata/spatialdata-sandbox/visium_associated_xenium_io.zarr/')
# still works
sd.read_zarr('~/visium_associated_xenium_io.zarr')
@berombau
Copy link
Contributor Author

I refactored to use UPath, which solves many issues I had with remote support. So I would recommend UPath over Path, str, ZarrLocation...

It works with my own object storage:

from upath import UPath
from spatialdata import SpatialData

p = UPath(
    "s3://BUCKET/spatial-sandbox/visium_associated_xenium_io_tables.zarr",
    endpoint_url="https://objectstor.vib.be",
)
full_sdata.write(p)
sdata = SpatialData.read(p)

I also added tests for the remote datasets and mocked remote tests. There are still some remaining issues:

  • reading from private remote storage over S3 works
  • writing to private remote storage over S3 works
  • test_remote_mock.py mock reading test using ome-zarr fails, so images and labels fail. I need to test this some more as I'm also using a patched ome_zarr.
  • test_remote.py reading the SpatialData remote datasets over HTTP fails for the points parquet files. I also can't reproduce the working implementation (maybe because of a package update?).

I will likely be a while until I can work on this some more.

@LucaMarconato @ArneDefauw

@berombau
Copy link
Contributor Author

This is still an open issue. There is an old fork at https://github.com/berombau/spatialdata main with closed PR #442 that does support remote reading using UPath, but merging it would be difficult, especially because the spatialdata code base underwent a lot of changes.

The current solutions without this support:

Future work plan:

  • after the PR's from the hackathon now, start from the latest version of SpatialData again
  • add the remote mock test code from the old PR
  • change str and Path to UPath until the remote tests work
  • do a clean PR with basic read support
  • work on write support
  • add more tests to keep the remote support, try to dissuade the use of str or Path for filesystem locations

It's still very doable, but currently not a priority and easier after the merging of a lot of open PR's

@BioinfoTongLI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants