-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convenience methods for converting STAC objects / linked assets to data containers #846
Comments
To me, Intake seems like the best way forward here. In case you do want to go forward with this proposal, I'd like to propose another variation. Instead of having these conversion functions in Pystac itself, provide an |
Thanks for this @TomAugspurger , I am 100% on board with this and would love to see such convenience functions, I end up copying over the same function across notebooks that converts STAC Items to a dataframe. I'd generally prefer targeting libraries to container types since container type names may be ambiguous. I prefer this over using intake, only because I don't use intake-stac myself, and it will just be another library to maintain, |
Thanks @zklaus.
Let me see if I understand this correctly. pystac itself would define a few methods: class ItemCollection:
def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
entrypoint = importlib.metadata.entry_points().get(...)
return entrypoint.load()(self, **kwargs)
def to_xarray_datarray(self, **kwargs): -> "xarray.DataArray":
entrypoint = importlib.metadata.entry_points().get(...)
return entrypoint.load()(self, **kwargs) I'm not 100% sure what goes in the def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
try:
import some_library
except ImportError:
# raise with a nice method
return some_library.item_collection_to_geopandas(self, **kwargs) A benefit / drawback of entrypoints is that you could have multiple implementations for the same "route". Some 3rd party can provide a better implementation with having to touch pystac or
We might need to figure out what to do if there are multiple implementations registered for a specific entrypoint. Perhaps provide a keyword like It might be worth saying a bit more about "why not intake" given that we are encroaching on its territory a bit. A laundry list:
|
Thanks for picking this up, @TomAugspurger. What you describe is close, but not quite what I had in mind. Indeed, I would say if Pystac has to know all the exporters, entry points would be a futile exercise. Instead, I think there should be only one entry point, say Conceivably, other classes could use the same mechanism with a different entry point, say to export to an |
Mmm, I'm not a huge fan of dynamically adding methods to classes as a side-effect of import. It makes debugging difficult, static typing impossible (I think), and will be incompatible with features like https://peps.python.org/pep-0690/ (I think). What's the reason not to add various |
Do we know how mypy, pylance etc feels about the return types in downstream contexts? E.g., if the method returns "geopandas.GeoDataFrame", and the types of pystac are published and consumed by downstream code that doesn't have geopandas installed - is mypy and pylance (vscode) OK with that, or would that cause weird typing errors? |
Both mypy and pylance seem to think the type is Setup: diff --git a/pystac/item_collection.py b/pystac/item_collection.py
index de38ac3..fdd4292 100644
--- a/pystac/item_collection.py
+++ b/pystac/item_collection.py
@@ -1,11 +1,14 @@
from copy import deepcopy
from pystac.errors import STACTypeError
-from typing import Any, Dict, Iterator, List, Optional, Collection, Iterable, Union
+from typing import Any, Dict, Iterator, List, Optional, Collection, Iterable, Union, TYPE_CHECKING
import pystac
from pystac.utils import make_absolute_href, is_absolute_href
from pystac.serialization.identify import identify_stac_object_type
+if TYPE_CHECKING:
+ import xarray
+
ItemLike = Union[pystac.Item, Dict[str, Any]]
@@ -235,3 +238,6 @@ class ItemCollection(Collection[pystac.Item]):
identify_stac_object_type(feature) == pystac.STACObjectType.ITEM
for feature in d.get("features", [])
)
+
+ def to_xarray_datarray(self) -> "xarray.DataArray":
+ ... Test file: # file: foo.py
import pystac
ic = pystac.ItemCollection([])
reveal_type(ic.to_xarray_datarray()) Output without xarray:
Output with xarray:
pylance has the same behavior. |
Historically, PySTAC has been very much in the design camp of "do the most simple, scoped thing possible, and carry no dependencies". I think this has served the library well. In that stance, this suggestion could be identified as scope creep - as you mentioned, getting the results these "to_" methods are aimed at is already possible outside of PySTAC, and adding them has the potential to inflate the codebase/scope of PySTAC. However, I think your other points are really good. Now that it's been around a while, PySTAC "owns the namespace" as you put it, and there's an opportunity to expand on the user convenience this library provides. Your experience with seeing this play out in the pandas ecosystem holds a lot of weight here. As you've laid out, we can keep the dependencies out of the core PySTAC package by using deferred imports. Adding "extras" packages would be a fine way to enable this extra behavior, like we do for As far as library scope - I think what you suggested works, if the heavy lifting/implementations live in other libraries. You mentioned the possibility of an All in all, I'm +1 on this idea. |
Instead of (or complementary to) I'm not sure how this would work for Pandas and GeoPandas, but for Xarray there could be asset = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
)
ds = xr.open_dataset(asset, engine="stac") item_collection = (
catalog
.search(collections="sentinel-2-l2a", bbox=bbox)
.get_all_items()
)
ds = xr.open_dataset(item_collection, engine="stac") Having |
I was poking around on intake-stac today and @scottyhq pointed me to this issue. I really like @benbovy's idea of specifying an IO backend that would be accessed from xarray itself. That seems like a natural fit to me and much more analogous to the |
Just to expand on that. I think one of the issues with a generic Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue. |
Thanks @jsignell!
That's my hope. That's what we're doing at https://github.com/TomAugspurger/xstac/blob/f17ed3bb79f34b747d1532fc425794442d938f9e/examples/terraclimate/terraclimate.json#L51. But that's still compatible with the idea of an I think the |
Yes exactly! I am going to mock up this approach now. |
Ok I made https://github.com/jsignell/xpystac closely based off of https://github.com/TomAugspurger/staccontainers. |
I'm trying to think what the next steps are here. I am happy to move Then maybe pystac would want to have a |
Yup, we don't have a formal process (see: https://stac-utils.github.io/developer-guide/policies/repositories.html) so if it feels right, go for it!
Makes sense to me. Are there other stubs we should add at the same time? It'd be a splashy feature to release so it'd be cool to have a couple different adapters, if possible. |
Ok! I just transferred the repository over to the stac-utils org.
Tom has a whole list of potential stubs in the original post and some implementations in staccontainers.. There are also some good arguments for not implementing stubs. I think if we do want to add stubs it would make more sense to do that as part of 2.0. Maybe as a first step, I should update the examples to use xpystac via open_dataset. |
worksforme |
For this sprint I am planning to add a little tutorial notebook that explains how to read an asset into an xarray object using xpystac. |
Thanks for continuing to work on this @jsignell! |
This sounds like most of my life with Python! ;-)
Now that I know about this And, after all, I think it's fair to say that the whole point of STAC is to make large geospatial/temporal raster assets findable and then accessible, so some convenience functions (e.g. living in a namespace package?) that make for a more seamless bridge between the catalog/finding and the raster data exploitation would be hugely welcome. |
Does not exist but it sounds interesting. IMO a separate |
I have been mulling over what would be the most discoverable and ergonomic way to give access to read methods and I am strongly leaning towards creating an xarray assets extension class. Then on this class there could be an import pystac
from pystac.extensions.xarray_assets import XarrayAssetsExtension
asset = pystac.Asset.from_file(...)
xr_ext = XarryAssetsExtension.ext(asset, add_if_missing=True)
xr_ext.open() Or if something like #1051 gains traction then: import pystac
asset = pystac.Asset.from_file(...)
asset.xarray.open() |
Hi all,
I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (
xarray.Dataset
,geopandas.GeoDataFrame
,pandas.DataFrame
, etc.).I'm opening this in
pystac
since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, andpystac
owns the namespaces I care about. You can already do everything I'm showing today without any changes topystac
but it feels less nice. I really think thatpd.read_csv
is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use aspd.read_csv
.Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to
to_dict()
.As a couple hypothetical examples, to give an idea:
Or building a datacube from a pystac-client search (which subclasses pystac).
Implementation details
This would be optional.
pystac
would not add required dependencies onpandas
,xarray
, etc. It would merely provide the methodsItem.to_xarray
,Asset.to_xarray
, ... Internally those methods would try to import the implementation and raise anImportError
if the optional dependencies aren't met at runtime.Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html).
pystac
certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with anengine
argument).Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63).
Maybe those could live in pystac; I'd be happy to maintain them.
https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.
Problems
A non-exhaustive list of reasons not to do this:
to_xarray
and notto_numpy()
,to_PIL
, ...? Whyto_pandas()
and notto_spark()
,to_modin
, ...?Alternatives
Alternatively, we could recommend using intake, along with intake-stac, which would wrap
pystac-client
andpystac
. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.
Examples
A whole bunch of examples, under the
<details>
, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.catalog -> collection -> item -> asset -> xarray (raster)
catalog -> collection -> item -> asset -> xarray (zarr)
catlaog -> collection -> item -> asset -> xarray (references)
catalog -> collection -> item -> asset -> geodataframe
search / ItemCollection -> geopandas
Proposed Methods
We should figure out what the "target" in each of these
to_
methods is. I think there are a couple things to figure out:to_dataframe(..., engine="dask")
orto_dataframe(..., npartitions=...)
) or alternative methods.to_dask_dataframe()
).But I would loosely propose
Asset.to_xarray
->xarray.Dataset
{Item,ItemCollection}.to_xarray
->xarray.Dataset
Asset.to_geopandas
-> geopandas.GeoDataFrameAsset.to_pandas
-> pandas.DataFrameAsset.to_dask_geopandas
->dask_geopandas.GeoDataFrame
Asset.to_dask_dataframe
->dask.dataframe.DataFrame
ItemCollection.to_geopandas
->geopandas.GeoDataFrame
There's a bunch more to figure out, but hopefully that's enough to get started.
The text was updated successfully, but these errors were encountered: