Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broader overview of (geo)spatial indexes and existing / possible solutions in Python #12

Open
benbovy opened this issue Jul 1, 2020 · 5 comments

Comments

@benbovy
Copy link
Member

benbovy commented Jul 1, 2020

I did further research on spatial indexes, I add below some interesting approaches (with links) that we could experiment with (after some development effort, though). We can use this issue to collect other approaches and have general discussion about it.

R-Tree / STR

The pygeos library (recently used by geopandas to support vectorized spatial operations) exposes the R-Tree structure STRtree implemented in GEOS: pygeos.strtree.

  • not limited to point features, although I'm not sure how to support spatial indexing on more advanced geometries with xarray (that may be useful for model grid cells with arbitrary geometries?)
  • limited to two-dimensional geospatial data
  • The index is built implicitly at the first query, although GEOS seems to allow to building it explicitly, which would be nicer in our case
  • pygeos doesn't seem to support nearest neighbor queries, although here again it seems to be supported in GEOS (It has been implemented in Shapely)

It might be worth to see if we can contribute to pygeos regarding the two latter limitations. I guess it would not require much effort.

GEOS also implements a quadtree, and I guess this could be easily wrapped in pygeos as well. That said, I don't know which cases exactly would benefit from a quadtree over a STRtree.

AABB Tree

From what I understand, AABB trees and R-Trees are pretty much the same thing.

The CGAL library has an implementation of AABB, and the new library scikit-geometry wraps some features and structures implemented in CGAL.

  • AABB in CGAL seems a bit more general than GEOS' STRTree (not only geo / 3 or more dimension?)
  • AABB is not yet available in scikit-geometry
  • Fast-loops or vectorized operations (e.g., using numpy arrays) are not yet well supported in scikit-geometry

Spatial indexes based on space filling curves

This is the approach used in most NoSQL database managers for geospatial indexing (scalability is critical for those database systems).

  • This does not allow exact search, although max. precision can be < 1 cm for the whole Earth.

Geohash is one example, for which there is a couple of Python (non-vectorized?!) implementations here and here. It can then be used for proximity searches, although it doesn't seem straightforward to do. I guess we could use numpy.sortedsearch for this. Unfortunately, dask does not support it yet (see dask/dask#4368).

s2geometry is a C++ library that seems to be used widely in database managers. It should be possible to wrap the simple use cases described here using pybind11 / xtensor-python, so that we can use it with Python / Numpy.

Parallel / distributed indexes

Neither of the solutions above support building distributed indexes, although s2geometry claims that its spatial partitioning cells could be reused for large distributed indexes.

@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2020

s2geometry is a C++ library that seems to be used widely in database managers.

I missed h3, and h3-py.

@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2020

Another interesting reference: spatialpandas has an implementation of a static Hilbert R-tree in numba.

@benbovy
Copy link
Member Author

benbovy commented Aug 11, 2020

Some more related work:

CellTree2d for cell location in (2D) unstructured grids. I seems to require full grid topology (faces, vertices).

Link to the paper: https://escholarship.org/content/qt0vq7q87f/qt0vq7q87f.pdf

There is a numba implementation here: https://github.com/Huite/xugrid/blob/master/xugrid/geometry/cell_tree.py. This one seems to support bbox (range?) queries, in addition to point queries.

@Huite
Copy link

Huite commented Aug 12, 2020

There's also a C++ CellTree (as well as a bunch of other structures) in vtk: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Filters/General/vtkCellTreeLocator.cxx
I did some looking around, but there didn't seem to be Python bindings, but I'm not quite sure I looked very well:
https://gitlab.kitware.com/vtk/vtk/-/tree/master/Wrapping/Python
I just did another check, seeing what tab completion turns up on import vtk; vtk. ... and indeed there is a vtkCellTreeLocator. But I doubt you can just pass in a numpy array with the vertices and the faces.

There's another R*Tree implementation (C++, Boost) in pyinterp: https://github.com/CNES/pangeo-pyinterp#unstructured-grids
(BSD 3 license)

I still intend to implement a ray tracing algorithm for the numba cell tree. The bbox query is supposed to collect all cells within and x and y interval yes (I've barely tested it though, might still contain silly mistakes).

There's some advantages / versatility to R*Tree and CellTree data structures, since they store a bounding box of every cell. I was looking at some mesh-to-mesh remapping. The available kD trees didn't help me much with this, since they only seem to work with point data.

I also did some brief benchmarking for the kDtrees. If I remember correctly, the scipy ckDTree outperformed the pykdtree (to my suprise, but I think the scipy ckDtree has been rewritten at some point). The scikit kDtree was the slowest (but nothing over a factor of 5). The numba implementation was actually the fastest, mentioned here: #9
But it was multi-threading, and I'm not sure that was working properly for the others (if available).

At any rate, it would be pretty interesting to define a number of benchmarks. These wouldn't just test the different implementations, but also feature some typical use cases (e.g. somewhat structured data versus random, etc.)

@benbovy
Copy link
Member Author

benbovy commented Aug 20, 2020

Thanks for those additional pointers @Huite!

There seems to be a general motivation recently in (re)implementing index trees in numba. I think numba is now mature and featured enough to be a pretty safe choice. It is also very convenient for prototyping.

I think we'll start in xoak with structures for indexing point data, but structures like CellTree will likely be useful at some point.

I agree that proper benchmarks are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants