-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 2d sel via Xarray accessor? #1
Comments
Here's a discussion about distributed KDTrees (https://gist.github.com/brendancol/a3dd4a35ecd94660411112999923d561) |
@benbovy Here's a gist (better view on nbviever) with a very simple ndim selection example. |
Thanks @willirath. Some thoughts:
|
I think there's two aspects here, both of which we can understand and optimize separately (if we design this properly):
# haversine metric
points["temp"] = ds.ndsel.sel(x=points["x"], y=points["y"], metric="haversine")["temp"]
# arbitrary metrix
points["temp"] = ds.ndsel.sel(x=points["x"], y=points["y"], metric=metric_func)["temp"]
# or even
points["temp"] = ds.ndsel.sel(points, metric="haversine")["temp"] For this, we can afford really inefficient internals.
I think we shold go for the implementation of the API first, because that makes it easier to play with / compare different optimizations.
Thanks for pointing me to this blog post. Sounds like this is a good way to go for the trees. As Jake's version seems to support arbitrary Python functions as metric, we can generalize this later.
Our target positions can get quite messy. If you think of all the autonomous floats and drifters, you get points that are not really easy to order, because these devices are scattered all across the globe. Here's the latest positions of the ARGO array: This (with completely new positions every ~10 days) is about the most difficult use case we'll have to handle.
I think going for fully distributed tree construction does not really matter. But (for chunked coords and / (x)or chunked ship sections) we could do something like:
Yes, we should think about this. It's probably difficult to do this in a way that survives on disk and across versions. But we could aim at only calculating trees once per session. |
For the public API, I was thinking of something like ds_selected = ds.xsomething.sel_nearest(lat=lat_points, lon=lon_points)
The API should be compatible with Xarray's advanced indexing using We could also provide a convenient method for the case where we have a set of points (observations) already in a Dataset or DataArray object: ds_selected = ds.xsomething.reindex_like(ds_points) Not sure |
Ah that's good to know, I didn't thought about this case!
So, do we want to support the case where the number of observation points is very high? I haven't thought about that either. |
I quickly tested sklearn's BallTree implementation here, it works pretty well (fast and low memory footprint, for 1e6 points). |
While our use case is positions on the globe, I'd strongly favor some implementation that uses arbitrary ndim indexing with an arbitrary metric and a simple wrapper for georeferenced data. Firstly, there's use cases that are very close to global and realistic Ocean model grids (although they're not really applicable to observational data from the real world): Idealized models with rectangular boxes are often used by the same people that are also interested in the realistic data. Secondly, the ndim selection could easily add value for people that are not Ocean or climate modellers without really adding complexity on our side. (I'm thinking of people who do stuff like plasma simulations in torodial geometry.) Thirdly (not sure people say this...), clearly separating out the metric ensures we design and code in a generalizable way, while just hard coding dim names |
Looks good. I was initially feeling a little hesitant to rely on |
(Although I'm still liking the idea of going for a wider use case. But this is probably something that will eventually be built elsewhere and closer to the Xarray core devs?) |
|
I'd also be in favor of a generic solution for indexing n-d coordinates! However, I wouldn't spend too much time trying to settle the API now. We are going to work very soon on Xarray's flexible indexes refactoring. Once indexes will be considered as first-class citizens of the Xarray data model, the recommended approach for addressing our problem will be to reuse/provide a custom index class and then directly use Dataset's or DataArray's Using an accessor should be a temporary solution before flexible indexes are implemented (hopefully within the next 12 months), so probably best is to avoid generalizing it too much. Ideally, the accessor methods should mimic the current Providing generic index classes for some use-cases including n-d irregular data indexing is also part of the Xarray development roadmap. I think this will require input from the broader community.
If this is something we want to support as soon as possible (< 12 months), and if those idealized models use non-regular grids, then flexible metrics make perfect sense. But this is something we could still easily support later, if needed. |
For the accessor API, we'll need to think how to handle index construction and index queries. There are two options: A. implicit index construction (cache) # 1st time call: triggers index construction
ds.balltree.sel(lat=[...], lon=[...])
# reuse the built index
ds.baltree.sel(lat=[...], lon=[...]) B. explicit index construction ds.balltree.set_index(['lat', 'lon'], metric='haversine')
ds.balltree.sel(lat=[...], lon=[...]) I'm in favor of option B, which is closer to what it will (most probably) look like after the indexes refactor in Xarray. Additionally, we could provide a |
If we are going for a more generic solution (which is probably not a bad idea after all :-) ), then I would rather choose a descriptive index name for the name of the accessor, e.g.,
I was too, and I'm still a bit hesitant of depending on the whole scikit-learn library just for reusing it's balltree implementation. I'll l have a look at the code to see if it would make sense to copy it instead. |
Are you available for a quick (5 min) skype call? I'll be in meetings for most of the rest of the day, then. |
For the initial development, depending is fine, I think. |
So this gist contains a preliminary implementation of a xarray extension for point-wise indexing of irregular data using a ball tree. A few comments:
|
This is extremely cool! I'll add a few tests later today or tomorrow in the morning. Also, I'll play with a few cases where Dask is used for the coords and for |
I'm wondering if |
According to the docs, the haversine distances are returned in units of radians too. But this is something that could be inconsistent across different implementations of the metrics? |
@benbovy I've parametrized your gist: https://nbviewer.jupyter.org/gist/willirath/e0380c02da41568eb91bfaf509faefff to
|
Nice! What are the largest grid sizes for NEMO/FESOM? What would be the largest number of indexer points? I'm wondering if/where we should put effort in to enable dask support (out-of-core / parallel / lazy computation). Possible issues/bottlenecks are:
We could reduce 2 and 3 by increasing the leaf size of the ball tree. This would make queries less efficient, but this is a embarrassingly parallel problem and supporting queries with dask might be as simple as inserting a Unfortunately, I have no idea on how to address 1 with a simple and efficient solution. The solution that you referenced above looks quite easy to implement but it is not optimal. Instead of searching in all trees, considering that NEMO/FESOM grids are dense, we could build a forest of ball tree from overlapping regions, then wrap the query to first determine which tree(s) to use. This is getting quite complicated and it is still not a robust solution, though. |
Size of the problem
Max Number of points on model gridOur current "big" models have Max size of indexersA worst / most expensive case would be a study where we want to sample all the surface positions of all the autonomous devices observing the Ocean. There's the two biggest efforts: ~1300 surface drifters measuring surface data continuously and ~4000 ARGO floats sampling profiles every ~10 days. Just handling all the positions for one point in time would need indexers with a length of approx. 10_000 positions. Vessel based measurements might be in the same range as well. But it gets hard, wenn we go for high time resoluton that is available from, e.g., the surface drifters: With hourly positions from the real drifters, we'd have approx Taking this to the next level (even though this is beyond the virtual field campaign scope) would be going for millions of virtual particles, each sampled at high time resolution, as is propsed in a paper by Ryan. Bottom lineWe should estimate how much work it is to get to 100_000_000 grid points and selectors of length 10_000_000 in minutes. Dask / other optimizations
That's an interesting thought. We should at least see how far we get there. Queries are easy to parallelize, because we just need to split the selector ds. But it won't leverage much of the parallel resources we usually have.
I think, we can assume that the number of blocks / chunks of the grid is always much smaller than the number of grid points per chunk. (Example: With 150_000_000 grid points, at double precision, we'd have ~ 2.5 GB for lat + lon. With Dask's default chunking, this would be ~ 10 to 20 chunks of ~ 125 to 250 MB each. As from a user's perspective, only wall time matters, we'll need to know the timings for
I agree that it's probably just a few
Loading the data from disk is nothing we can optimize in a development setting (laptop or other single machine). In real use cases with truly distributed Dask workers, we see that the IO bandwidth scales linearly with the number of workers / nodes we're on.
We can (sort of) assume dense grids for NEMO. But there are exceptions like folding the med-sea to Africa, because we can get a few CPUs in an MPI setting busy that would otherwise just handle very few grid points, etc. Also on the FESOM side, there is a way of using space-filling curves to number the nodes. But from what I gathered from talking with AWI people, this cannot be taken for granted. So we should be able to handle the worst case: Completely unstructured grid positions. |
Thanks for all the details about the size of the problem, that helps a lot!
I'm not sure to fully understand your last point. It all depends on whether/how the indexers are chunked, right? I was thinking of just reusing the indexer chunks (if it is chunked) to perform the queries in parallel. It requires that the chunks of each indexer (lat, lon) are aligned (we could re-chunk internally, but it would be better if it is done explicitly and externally IMO). One potential big limitation with those parallel queries is in a distributed setting. Because the ball tree structure is not distributed, this would result in poor performance (lots of data transfer) and/or high memory consumption (ball tree index data copied to all workers).
Are you going to try implementing chunked queries on a single tree? Or constructing a forest of ball trees and perform queries using this forest? Or both? I'm not sure how dask arrays would speed-up the construction of a single ball tree. Coordinate transform (if any) and stack operations are cheap compared to the sequential computation of the ball tree.
Before considering disk IO optimization, I was rather thinking about the memory limits / out-of-core issues of (1) on small machines. |
Yes, that's what I was having in mind. Assuming alignment of lat and lon is somethig we should do.
I'm still working on the timings. So far: Tree from 125 MB of lat+lon is 20-30 seconds (on a single MacBook CPU core) while querying for 125MB of positions takes forever (not sure how long. Still runs... :/). So the biggest benefit would be from chunked queries on a single tree.
I'm thinking in the direction of "have a flat collection of single-machine kd-trees and, when we need to query something we check them all". This is not optimal. But we can gain a lot more (wrt wall time) if we just make use of easy scaling to multiple CPUs and accept the overhead of finding the actual nearest neighbors among
Ah. Not sure memory-use for the coords this is the first wall we'll hit. Coords are tiny compared to the data. Horizontal positions of the ENORMOUS LLC4320 cost 2.5 GB. |
I'm curious how this would go if you reduce the leaf size. Probably the size of the tree would explode :-)
Yes I agree, in the end this could take more advantage of a highly distributed computing environment. Ideally, we should provide different solutions (i.e., different indexes / accessors) in a package so that we can pick up the best one depending on the problem and the computing environment. Maybe other structures are more adapted? Here we don't take full advantage of the ball tree, which out-performs other tree-based indexing structures especially for high-dimensional data. Maybe R-Trees would help? I've found this paper about a distributed R-Tree structure. R-Trees require more storage than kd-trees or ball-trees, but in our 2-dimensional case that's not a big issue. I don't know if it's compatible with any metric, though (e.g., haversine). Also, we would have to implement it. |
Another question, closer to the scientific aspect: what are the potential implications of doing unconstrained / naive nearest neighbor lookup? What if the selected nearest neighbor corresponds to a node where you need to cross land area before reaching it? Is that a big issue for the scientific applications? |
This can be a problem. We could probably solve this by implementing a metric that checks for this, but this easily gets really complicated. Also, for the FESOM data, there is no land points in the grid. So we'd start handling this with shape files and then we'd need a way of ensuring that these shapes really represent the model geometry rather than that of the real Earth (people use little modifications to the model topography especially in regions where the model resolution is of the same order as geographical features like seamounts, islands, straits etc). So I think we should handle these cases outside of the balltree activities for now. What could be helpful, would be a way of explicitly returning the distances. (Like the tolerance approach but with the user being able to check this outside of the accessor.) |
Yes, I don't know whether or not it's an edge case, but this could be addressed later. In the mid/long-term, it will be nice to have implementations of different tree-based indexes (xarray compatible) in a dedicated package. What would be nice is a numba flexible implementation that would allow plugging in user-defined, jitted metric functions.
Maybe like this: But I'm not sure if we really need it. |
It would at least be very easy to support KD trees and ball trees from Maybe it would be a good design decision to make the tree construction and the query completely pluggable? If I see that correctly, we'd only have to make self._index = BallTree(X, **kwargs) from
This is a good idea. |
Looking through what we learned in this issue discussion, I wonder if we should move Also, we could rename it to |
NNTree is already taken for a neural-network based tree. But what about ANTS. |
I think it would be a good design choice, unless the logic currently put in
Yes!
I like it. Although the library scope could be extended to other kinds of selection than nearest neighbor lookup. For example, a range tree implementation would be useful for orthogonal indexing. We could also use kd-trees and ball trees for point radius selection. What about " |
This would probably happen if we use tricks to parallelize / distribute the trees. |
I like
|
Just a quick heads-up: I'll create a repo with the latest version from your Gist. |
Great, thanks! I don't have strong opinion about package structure. I haven't used poetry yet. Recently I've started putting source files under Do you mind if I have commit rights on this repo? I'll continue contributing through PRs anyway. |
I think you have them per your status in ESM-VFC. Let me check... |
There's no visible merge button on my side. |
It works now, thanks! |
Needed to add write access per repository. I've added you as full admin. |
We have https://git.geomar.de/python/xorca_lonlat2ij which basically
lat
,lon
pair(
xorca_lonlat2ij.get_ij
is a good entry point to get an overview of the logic.)Then, we usually use these indices for a vector
.sel()
as seen here https://github.com/ESM-VFC/esm-vfc-workflows/blob/7edd9bbc7cd3532c0537f89f667154a9dd27156d/meteor_m85.1_vs_NEMO_FOCI_hydrography/compare_m81.1_hydrography_with_ORCA05_FOCI_Test_data.ipynbWhat I'd like to have is this implemented as an Xarray accessor or something that is similarly easy to use.
This would not act on the staggered C grids, we'll need XGCM for, but could be easily generalized to XGCM-compatible datasets if xgcm/xgcm#200 became reality at a later point.
(@benbovy This could be a first thing to work on.)
The text was updated successfully, but these errors were encountered: