-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broader overview of (geo)spatial indexes and existing / possible solutions in Python #12
Comments
Another interesting reference: |
Some more related work: CellTree2d for cell location in (2D) unstructured grids. I seems to require full grid topology (faces, vertices). Link to the paper: https://escholarship.org/content/qt0vq7q87f/qt0vq7q87f.pdf There is a numba implementation here: https://github.com/Huite/xugrid/blob/master/xugrid/geometry/cell_tree.py. This one seems to support bbox (range?) queries, in addition to point queries. |
There's also a C++ CellTree (as well as a bunch of other structures) in vtk: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Filters/General/vtkCellTreeLocator.cxx There's another R*Tree implementation (C++, Boost) in pyinterp: https://github.com/CNES/pangeo-pyinterp#unstructured-grids I still intend to implement a ray tracing algorithm for the numba cell tree. The bbox query is supposed to collect all cells within and x and y interval yes (I've barely tested it though, might still contain silly mistakes). There's some advantages / versatility to R*Tree and CellTree data structures, since they store a bounding box of every cell. I was looking at some mesh-to-mesh remapping. The available kD trees didn't help me much with this, since they only seem to work with point data. I also did some brief benchmarking for the kDtrees. If I remember correctly, the scipy ckDTree outperformed the pykdtree (to my suprise, but I think the scipy ckDtree has been rewritten at some point). The scikit kDtree was the slowest (but nothing over a factor of 5). The numba implementation was actually the fastest, mentioned here: #9 At any rate, it would be pretty interesting to define a number of benchmarks. These wouldn't just test the different implementations, but also feature some typical use cases (e.g. somewhat structured data versus random, etc.) |
Thanks for those additional pointers @Huite! There seems to be a general motivation recently in (re)implementing index trees in numba. I think numba is now mature and featured enough to be a pretty safe choice. It is also very convenient for prototyping. I think we'll start in xoak with structures for indexing point data, but structures like I agree that proper benchmarks are needed. |
I did further research on spatial indexes, I add below some interesting approaches (with links) that we could experiment with (after some development effort, though). We can use this issue to collect other approaches and have general discussion about it.
R-Tree / STR
The pygeos library (recently used by geopandas to support vectorized spatial operations) exposes the R-Tree structure STRtree implemented in GEOS: pygeos.strtree.
It might be worth to see if we can contribute to pygeos regarding the two latter limitations. I guess it would not require much effort.
GEOS also implements a quadtree, and I guess this could be easily wrapped in pygeos as well. That said, I don't know which cases exactly would benefit from a quadtree over a STRtree.
AABB Tree
From what I understand, AABB trees and R-Trees are pretty much the same thing.
The CGAL library has an implementation of AABB, and the new library scikit-geometry wraps some features and structures implemented in CGAL.
Spatial indexes based on space filling curves
This is the approach used in most NoSQL database managers for geospatial indexing (scalability is critical for those database systems).
Geohash is one example, for which there is a couple of Python (non-vectorized?!) implementations here and here. It can then be used for proximity searches, although it doesn't seem straightforward to do. I guess we could use
numpy.sortedsearch
for this. Unfortunately, dask does not support it yet (see dask/dask#4368).s2geometry is a C++ library that seems to be used widely in database managers. It should be possible to wrap the simple use cases described here using pybind11 / xtensor-python, so that we can use it with Python / Numpy.
Parallel / distributed indexes
Neither of the solutions above support building distributed indexes, although s2geometry claims that its spatial partitioning cells could be reused for large distributed indexes.
The text was updated successfully, but these errors were encountered: