-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update design_doc.md #24
Conversation
Thanks for chiming in @danlooo! Thanks for the clarification on DGGS standards.
Yes I agree we should target that. However, I wouldn't say it is a MUST, since horizontal scaling of some aspects of DGGS (especially indexing) can be very challenging and will certainly require a lot of effort. Also Xarray is not yet 100% ready for supporting lazy indexes (e.g., based on a dask array for the DGGS cell coordinate), although we're slowly getting closer (pydata/xarray#8124). The primary goal of xdggs is to democratize DGGS so I wouldn't mind if it doesn't scale yet to the finest DGGS resolutions in its first released versions. Lots of examples that I've seen in GIS use DGGS (H3 or S2) on fairly limited extent spatial domains (e.g., as a way to aggregate point data) so xdggs with 10^4 - 10^6 cells would already be useful there. Also, I think that we could already reach acceptable resolutions (100M-500M of cells) on a global grid using recent hardware (even laptops and desktops). I haven't run experiments, though. Out of curiosity, do you have examples of existing real-world datasets on a DGGS that are bigger than 10^9 cells?
Although that is certainly possible, there would be a number of challenges in supporting >1-d coordinates with DGGSIndex (at least in its current implementation backed by a PandasIndex). Alternatively, I guess xdggs could provide index classes distinct from DGGSIndex that are designed for these specific cases.
While both are rectilinear, the 1st one is a global grid with geodetic coordinates while the 2nd one is a regional grid with projected coordinates (at least this is what I had in mind when writing those API examples).
|
design_doc.md
Outdated
|
||
Examples of common DGGS features that `xdggs` should provide or facilitate: | ||
|
||
- convert a DGGS from/to another grid (e.g., a DGGS, a latitude/longitude rectilinear grid, a raster grid, an unstructured mesh) | ||
- convert a DGGS from/to vector data (points, lines, polygons, envelopes) | ||
- nearest neighbor search and bounding box queries around a given cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these special cases of selection by geometries (already mentioned below)? For example, in xvec you can achieve that on vector data cubes using ds.xvec.query()
with a shapely.bbox
object or an array of shapely.Point
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. However, bounding boxes are convex polygons so we can use much faster algorithms for this special case.
The index has 2 major properties:
I suggest to see them as two separate DGGS. Let I haven't seen any native DGGS datasets yet that cover the entire earth. I think its because of the lack of tooling and file formats that we are currently developing ;) |
I'm not familiar enough with DGGRIDQ2DI (and actually DGGS in general, at least not as much as you :)) but yes I imagine that the specific properties of DGGS (and/or their implementation) may help to index / query the grid cells in a more optimal way than considering them as arbitrary vector geometries. Supporting spatial indexing via converting a DGGS data cube to a vector data cube has the advantage that, although suboptimal, it works for the general cases and also that we already can use Xvec, so it is a low-hanging fruit. More optimal, DGGS-specific indexing is certainly welcome and it is quite complementary (as I doubt we could easily reach the same flexibility, e.g., working with a variety of predicates). The challenging part is figuring out how to support that in a consistent way across all kinds of DGGS and via a common API. This would require some more thinking.
Usually we recommend not to subclass |
@danlooo those are all great additions and everything looks good to me except perhaps (very nit picking!) the "must" on scalability that I find too imperative (what do you think about "shall"?). Is there any other update you want to suggest in this PR? |
@benbovy All right. I'm fine with making scaleability a recommended thing to promote prototyping. This PR can be merged from my side. |
Great, thanks @danlooo! |
This PR addresses #22 to clarify the design document.
Scope
Clarify: xddgs defines the in-memory representation of DGGS data in Python environments.
Other specifications, e.g. DGGS data specification draft aims to define a storage (file) format for DGGS data for use in other programming languages as well. The data specification and xddgs should work together.
Basic operations should include neighbor-search and BBox queiries
Getting the cells within a lat/lon bounding box is the fundamental operation used for conversion between geodetic and discrete grids, for plotting, as well for convolutions. We should define the API for this functions that are implemented in the individual DGGSs. I think this is generic enough for all DGGSs.
Scaling should be a must
In my opinion, scaling to at least billions of cells is a MUST, because we are talking about global grids. We would use just an optimized projection instead of a DGGS for small sub-country level projects.
10^6 cells is still a very coarse resolution that equals to ~510km^2 per cell. DGGRID ISEA4H supports up to 4.3*10^10 cells and Uber H3 supports up to 5.7 * 10^14 cells.
Existing standards
The OGC abstract specification topic 21 defines properties of a DGGS including the reference systems of its grids.
However, there is no consensus yet about the actual specification on how to work with DGGD data.
Dimensionality of DGGSIndex
There are multi dimensional cell id systems, e.g. DGGRID PROJTRI and Uber H3 IJK. Fortunalley, a xarray coordinate may consist of multiple dimensions. How do we want to deal with this in
DGGSIndex
?Spatiotemporal DGGS
There are DGGS having also multiple time resolutions (e.g. daily, weekly, ...). Like different spatial resolutions, they are stored in different zarr groups. Therefore, we still have just one spatial and one temporal resolution in a dataset.
Raster vs lat/lon grid
What is the difference e.g. between functions
ds.dggs.from_latlon_grid
andds.dggs.from_raster
?Staggering
In climate models, different variables, e.g. air pressure and wind speed are stored at different locations within the cell, e.g. at center points, edges or vertices. We should at least put this info into the metadata.