Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update design_doc.md #24

Merged
merged 4 commits into from
Nov 21, 2023
Merged

Conversation

danlooo
Copy link
Contributor

@danlooo danlooo commented Nov 14, 2023

This PR addresses #22 to clarify the design document.

Scope

Clarify: xddgs defines the in-memory representation of DGGS data in Python environments.
Other specifications, e.g. DGGS data specification draft aims to define a storage (file) format for DGGS data for use in other programming languages as well. The data specification and xddgs should work together.

Basic operations should include neighbor-search and BBox queiries

Getting the cells within a lat/lon bounding box is the fundamental operation used for conversion between geodetic and discrete grids, for plotting, as well for convolutions. We should define the API for this functions that are implemented in the individual DGGSs. I think this is generic enough for all DGGSs.

Scaling should be a must

When possible, xdggs operations should scale to fine DGGS resolutions (millions of cells).

In my opinion, scaling to at least billions of cells is a MUST, because we are talking about global grids. We would use just an optimized projection instead of a DGGS for small sub-country level projects.
10^6 cells is still a very coarse resolution that equals to ~510km^2 per cell. DGGRID ISEA4H supports up to 4.3*10^10 cells and Uber H3 supports up to 5.7 * 10^14 cells.

Existing standards

The OGC abstract specification topic 21 defines properties of a DGGS including the reference systems of its grids.
However, there is no consensus yet about the actual specification on how to work with DGGD data.

Dimensionality of DGGSIndex

There are multi dimensional cell id systems, e.g. DGGRID PROJTRI and Uber H3 IJK. Fortunalley, a xarray coordinate may consist of multiple dimensions. How do we want to deal with this in DGGSIndex ?

Spatiotemporal DGGS

There are DGGS having also multiple time resolutions (e.g. daily, weekly, ...). Like different spatial resolutions, they are stored in different zarr groups. Therefore, we still have just one spatial and one temporal resolution in a dataset.

Raster vs lat/lon grid

What is the difference e.g. between functions ds.dggs.from_latlon_grid and ds.dggs.from_raster ?

Staggering

In climate models, different variables, e.g. air pressure and wind speed are stored at different locations within the cell, e.g. at center points, edges or vertices. We should at least put this info into the metadata.

@benbovy
Copy link
Member

benbovy commented Nov 14, 2023

Thanks for chiming in @danlooo!

Thanks for the clarification on DGGS standards.

In my opinion, scaling to at least billions of cells is a MUST, because we are talking about global grids.

Yes I agree we should target that. However, I wouldn't say it is a MUST, since horizontal scaling of some aspects of DGGS (especially indexing) can be very challenging and will certainly require a lot of effort. Also Xarray is not yet 100% ready for supporting lazy indexes (e.g., based on a dask array for the DGGS cell coordinate), although we're slowly getting closer (pydata/xarray#8124).

The primary goal of xdggs is to democratize DGGS so I wouldn't mind if it doesn't scale yet to the finest DGGS resolutions in its first released versions. Lots of examples that I've seen in GIS use DGGS (H3 or S2) on fairly limited extent spatial domains (e.g., as a way to aggregate point data) so xdggs with 10^4 - 10^6 cells would already be useful there. Also, I think that we could already reach acceptable resolutions (100M-500M of cells) on a global grid using recent hardware (even laptops and desktops). I haven't run experiments, though.

Out of curiosity, do you have examples of existing real-world datasets on a DGGS that are bigger than 10^9 cells?

There are multi dimensional cell id systems, e.g. DGGRID PROJTRI and Uber H3 IJK. Fortunalley, a xarray coordinate may consist of multiple dimensions. How do we want to deal with this in DGGSIndex ?

Although that is certainly possible, there would be a number of challenges in supporting >1-d coordinates with DGGSIndex (at least in its current implementation backed by a PandasIndex).

Alternatively, I guess xdggs could provide index classes distinct from DGGSIndex that are designed for these specific cases.

What is the difference e.g. between functions ds.dggs.from_latlon_grid and ds.dggs.from_raster ?

While both are rectilinear, the 1st one is a global grid with geodetic coordinates while the 2nd one is a regional grid with projected coordinates (at least this is what I had in mind when writing those API examples).

In climate models, different variables, e.g. air pressure and wind speed are stored at different locations within the cell, e.g. at center points, edges or vertices. We should at least put this info into the metadata.

Good point. Maybe we can also add a reference to xgcm here.

design_doc.md Outdated

Examples of common DGGS features that `xdggs` should provide or facilitate:

- convert a DGGS from/to another grid (e.g., a DGGS, a latitude/longitude rectilinear grid, a raster grid, an unstructured mesh)
- convert a DGGS from/to vector data (points, lines, polygons, envelopes)
- nearest neighbor search and bounding box queries around a given cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these special cases of selection by geometries (already mentioned below)? For example, in xvec you can achieve that on vector data cubes using ds.xvec.query() with a shapely.bbox object or an array of shapely.Point objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, bounding boxes are convex polygons so we can use much faster algorithms for this special case.

@danlooo danlooo marked this pull request as ready for review November 14, 2023 14:20
@danlooo
Copy link
Contributor Author

danlooo commented Nov 14, 2023

The index has 2 major properties:

  1. It affects the cell ordering also on disk and thus chunking and I/O performance
  2. It affects the algorithms used for nearest neighbor search and bounding box queries: In DGGRID Q2DI, we only need to convert the 4 corner points to get all cell ids within that box. However, for DGGRID SEQNUM, we need to query all points on a lat/lon grid individual to get the list of all seqnum cell ids.

I suggest to see them as two separate DGGS. Let DGGS be a subclass of xarray.Dataset. Then we can create a class DGGRIDQ2DI as a subclass of DGGS and implement special index aware functions for e.g. .dggs.query().

I haven't seen any native DGGS datasets yet that cover the entire earth. I think its because of the lack of tooling and file formats that we are currently developing ;)

@benbovy
Copy link
Member

benbovy commented Nov 14, 2023

I'm not familiar enough with DGGRIDQ2DI (and actually DGGS in general, at least not as much as you :)) but yes I imagine that the specific properties of DGGS (and/or their implementation) may help to index / query the grid cells in a more optimal way than considering them as arbitrary vector geometries.

Supporting spatial indexing via converting a DGGS data cube to a vector data cube has the advantage that, although suboptimal, it works for the general cases and also that we already can use Xvec, so it is a low-hanging fruit.

More optimal, DGGS-specific indexing is certainly welcome and it is quite complementary (as I doubt we could easily reach the same flexibility, e.g., working with a variety of predicates). The challenging part is figuring out how to support that in a consistent way across all kinds of DGGS and via a common API. This would require some more thinking.

I suggest to see them as two separate DGGS. Let DGGS be a subclass of xarray.Dataset. Then we can create a class DGGRIDQ2DI as a subclass of DGGS and implement special index aware functions for e.g. .dggs.query().

Usually we recommend not to subclass xarray.Dataset. However, we could allow special index logic be implemented in DGGSIndex subclasses (e.g., DGGRIDQ2DIIndex) and be executed via .dggs.query().

@benbovy
Copy link
Member

benbovy commented Nov 20, 2023

@danlooo those are all great additions and everything looks good to me except perhaps (very nit picking!) the "must" on scalability that I find too imperative (what do you think about "shall"?). Is there any other update you want to suggest in this PR?

@danlooo
Copy link
Contributor Author

danlooo commented Nov 21, 2023

@benbovy All right. I'm fine with making scaleability a recommended thing to promote prototyping. This PR can be merged from my side.

@benbovy
Copy link
Member

benbovy commented Nov 21, 2023

Great, thanks @danlooo!

@benbovy benbovy merged commit 6dc6a30 into xarray-contrib:main Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants