Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolation of geo-referenced data #629

Closed
fbriol opened this issue May 14, 2019 · 16 comments
Closed

Interpolation of geo-referenced data #629

fbriol opened this issue May 14, 2019 · 16 comments
Labels

Comments

@fbriol
Copy link

fbriol commented May 14, 2019

To my knowledge, there is no library that can interpolate geo-referenced data in Python simply and very quickly. This is even more true for a mesh (cloud of points), representing a geophysical value (dynamic height of the oceans for example).

Cartesian Grid

For Cartesian grids, numpy, offers RegularGridInterpolator which can be adapted to process geophysical data. The difficulty here lies in correctly processing the discontinuity in longitude on a global numerical grid. One of the methods I have tried, to overcome this problem, is to duplicate the first column of the matrix, after the last one in order to properly manage the transition around the Prime meridian. It works, but the performance is not so good. In the code below, we do not manage the discontinuity of longitudes.

import scipy.interpolate
import netCDF4
ds = netCDF4.Dataset("mss.nc")
z = ds.variables['mss'][:].T
interp = scipy.interpolate.RegularGridInterpolator(
    tuple((ds.variables['lon'][:], ds.variables['lat'][:])),
    z.data,
    method='linear',
    bounds_error=False)
# 672 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
interp((x.flatten(), y.flatten()))
# 65.6 ms ± 97.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

By comparison, a method developed in C++ with a Python binding can resolve this problem in 46 ms in total on a single thread in a transparent way (management of the discontinuity of longitudes). As this problem is easily parallelizable, the same execution can be performed in only 15 ms with 12 threads. Here, the C++ routine is 2 to 4.5 faster.

import pyinterp.core as core
interp = core.interp2d.Bivariate(
    core.Axis(ds.variables['lon'][:], is_circle=True),
    core.Axis(ds.variables['lat'][:]),
    z.data) 
# 387 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
interp.evaluate(x.flatten(), y.flatten(),
                        core.interp2d.Bivariate.Type.kNearest, num_threads=1)
# 46.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
interp.evaluate(x.flatten(), y.flatten(),
                        core.interp2d.Bivariate.Type.kNearest, num_threads=12)
# 14.6 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unstructured grids

For unstructured grids, I used the KDTree tree, available in Scipy. This object can handle the problem, as long as the number of points is not important. Here is the Python code, which I used to interpolate my mesh.

import numpy as np
import xarray as xr
import scipy.spatial as spatial
import pyproj

def lla_2_ecef(x, y):
    ecef = pyproj.Proj(proj='geocent', ellps='WGS84', datum='WGS84')
    lla = pyproj.Proj(proj='latlong', ellps='WGS84', datum='WGS84')

    return pyproj.transform(
        lla, ecef, x, y, np.zeros(x.shape), radians=False)

def nearest(x, y, tree, k=4):
    points = numpy.dstack(lla_2_ecef(x, y))[0]
    return tree.query(points, k=k, n_jobs=-1)

def interpolate(distance, index, values):
    x = values.flatten()[index.flatten()]
    x = x.reshape(distance.shape)
    mask = ~x.mask

    wk = (1 / (distance * distance)) * mask
    n = numpy.sum(mask, axis=1)
    result = (numpy.sum(x * wk, axis=1) / numpy.sum(wk, axis=1)).data
    result[n == 0] = numpy.nan
    return result

ds = xr.open_dataset("ssu.nc")
xc = ds.lon.values
yc = ds.lat.values
x, y, z = lla_2_ecef(xc.flatten(), yc.flatten())
kdtree = spatial.cKDTree(np.vstack([x, y, z]).T)
lon = numpy.arange(-180, 180, 0.125)
lat = numpy.arange(-90, 90, 0.125)
mlon, mlat = numpy.meshgrid(lon, lat, indexing='xy')
result = nearest(mlon.flatten(), mlat.flatten(), kdtree)

This code, works, manages the discontinuity in longitude. But if the problem size grows (I tried to interpolate a mesh of 74 649 600 points), the algorithm no longer works, at least within a reasonable period of time (a few hours).

I coded an interpolator using the R-Tree implemented in the C++/boost library. And for the same problem size, the time elapsed to solve the problem with this new library is amazing (11 seconds, to build the tree, against more than 30 minutes with the cKDtree library) and 4.5 seconds to perform the interpolation of a numerical grid of 2880*1440 points.

mesh = core.interp2d.Mesh()
mesh.packing(xc.flatten(), yc.flatten(), data.flatten())
# 11 s ± 601 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
v, n = mesh.interpolate(
    mlon.flatten(), mlat.flatten(),
    radius=20000, k=4, around=False, num_threads=0)
# 4.54 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) with 12 threads

Could the small library I wrote be of any interest to your community? Have you been able to solve this same type of problem easily and efficiently using another method/library that I will not know?

@rabernat
Copy link
Member

@fbriol this is great! Thanks for sharing!

There has been a lot of discussion in our community about how best to provide this important functionality (see e.g. #356; where we discuss all the different packages related to "gridding"). Your experiments could be an important part of the solution.

It would be awesome to explore using this prototype to build a custom indexer for Xarray (see pydata/xarray#475).

@fbriol
Copy link
Author

fbriol commented May 14, 2019

Finding the nearest points is only part of the problem if you want to calculate an interpolation and not an extrapolation. For each point, once the nearest neighbors have been selected, it must be checked that the current point is within the polygon formed by its neighbors. I implemented this in my prototype and this verification takes little time with "boost." I don't think it's possible to do this control in a reasonable time with the KDTree tree. But maybe I'm wrong.

%timeit mesh.idv(x.flatten(), y.flatten(), around=False, radius=18000, k=16, num_threads=12)
# 1.09 s ± 26.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mesh.idv(x.flatten(), y.flatten(), around=True, radius=18000, k=16, num_threads=12)
# 1.11 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I will try to write an indexer for xarray using my prototype.

@fbriol
Copy link
Author

fbriol commented May 21, 2019

Hello,

I didn't fully understand how the indexes work in "xarray,” but if you give me a complete example I could take a look at it.

I have modified a large part of my interpolation software to extract the different indexers used: RTree (Cartesian & Geodetic), 1D-Axis.

I have created a Conda package for the moment only for Linux, but I will make a Windows and OSX version tonight. You can try it. To install it: conda install pyindex -c fbriol. Only for Python 3.7. The online help accessible using the help command.

There is no need to search for the source code, the core library is a dynamic shared module written in C++. For the moment, the code is not public, but I could publish them if the library has a use.

You will find an example of using different objects on the GIST

@amfriesz
Copy link

@fbriol, have you looked into pyresample (https://pyresample.readthedocs.io/en/latest/#)? There might be some overlap with what you've presented here.

@fbriol
Copy link
Author

fbriol commented May 21, 2019

Yes, I know this library, but like all the Python libraries I have used, they are not very efficient and do not handle longitude discontinuities well enough. The geographical trees used are KdTree. This structure is not very efficient when the number of points becomes large.

The RTree structure is much more efficient and is not implemented in other libraries.

Today, I use a library written entirely in C++ cross-platform that is very powerful and I am looking for ways to share it. A version is available on my anaconda channel if you want to try it : conda install pyinterp -c fbriol

@jhamman
Copy link
Member

jhamman commented May 21, 2019

I don't want to dissuade you in any way from working on the flexible indexes work in xarray (this is super important work) but I did want to point out Verde which already exists and fits in this space (cc @leouieda).

@leouieda
Copy link

@jhamman thanks for the shout out. If @fbriol wants performance, then Verde is probably not the best solution yet. We're working on performance but it will never be as fast as a nearest-neighbor algorithm. The longitude problem is something we're trying to address as well (see fatiando/verde#173 and fatiando/verde#181). I'm curious to see what you come up with and having native support in xarray would be a boost to everyone who relies on it (Verde included).

@apatlpo
Copy link
Member

apatlpo commented May 22, 2019

@fbriol, would be too early for you to talk about this (even briefly) at tomorrow meeting in Paris?

@delandmeterp
Copy link
Contributor

Hi @fbriol,

Thanks for this great library! It would be very useful for Parcels.

However, a difference between Parcels and what you propose is that you want to interpolate from a cloud of points, while we interpolate using a mesh, that includes a topology.

In this small notebook, I added the part that we would need in Parcels (we use structured curvilinear grids). Should we implement this extension in our own code or is it already present in your library?

@fbriol
Copy link
Author

fbriol commented Jun 5, 2019

Can you send me or make available one of your files so that I can do a unit test? I will look at what can be done.

Hi @fbriol,

Thanks for this great library! It would be very useful for Parcels.

However, a difference between Parcels and what you propose is that you want to interpolate from a cloud of points, while we interpolate using a mesh, that includes a topology.

In this small notebook, I added the part that we would need in Parcels (we use structured curvilinear grids). Should we implement this extension in our own code or is it already present in your library?

@delandmeterp
Copy link
Contributor

delandmeterp commented Jun 5, 2019

The mesh file is available here
https://surfdrive.surf.nl/files/index.php/s/ozDfKsPAr7dxLLv

Thanks!

@fbriol
Copy link
Author

fbriol commented Jun 7, 2019

Hello,

I just looked at your code. I think what you're doing is too specific to introduce it into the library. On the other hand, you can significantly accelerate your code by modifying your approach: you do the research, for all your points, you challenge them all and you memorize the points that could not be processed. You restart the search for the points for which the interpolation did not work. Example:

def find_cell(tree, lon, lat, x, y, k=16):
    if k > 2000:
        raise Exception(
            'find_cell should request k=2000 nearest neighbours in the rtree query'
        )

    n, m = lon.shape

    _, index = tree.query(np.asarray((x, y)).T, k=16, within=False)
    result = np.empty((index.shape[0], 2))

    # You can accelerate this loop with numba also by taking it out of this function.
    for ix in range(index.shape[0]):
        yi = (index[ix] / m).astype(np.int)
        xi = (index[ix] % m).astype(np.int)
        xi_final, yi_final, xsi, eta = loop(lon, lat, x[ix], y[ix], xi, yi, n, m)
        if xi_final == -1:
            result[ix, 0] = result[ix, 1] = np.nan
        else:
            result[ix, 0], result[ix, 1] = interpolate_lonlat(lon, lat, x[ix], y[ix], xi_final, yi_final, xsi, eta)        
    return result

def find_particles(n):
    plon = np.random.uniform(-180.0, 180.0, n)
    plat = np.random.uniform(-70.0, 88.0, n)
    result = find_cell(tree, glon, glat, plon, plat)
    idx = np.where(np.isnan(result[:, 0]))
    # You restart the research on the undetermined points with k*2
    # find_cell(tree, glon, glat, plon[idx], plat[idx])

This approach allows you to remove your main loop, which will speed things up. On my examples, I have to do two to three iterations to process your test set (32 and 64 neighbours).

Then, you can accelerate your interpolation functions using loops using numba. For example:

@numba.njit()
def interpolate_lonlat(lon, lat, x, y, xi, yi, xsi, eta):
    '''bi-linear interpolation within cell [j:j+2, i:i+2] as a function of the relative coordinates
       Here we simply interpolate the x, y coordinates, retrieving original coordinates
    '''

    phi = np.array(
        [(1 - xsi) * (1 - eta), xsi * (1 - eta), xsi * eta, (1 - xsi) * eta])
    px = np.array(
        [lon[yi, xi], lon[yi, xi + 1], lon[yi + 1, xi + 1], lon[yi + 1, xi]])
    px = np.where(px[:] - x > 180, px - 360, px)
    px = np.where(px[:] - x < -180, px + 360, px)
    py = np.array(
        [lat[yi, xi], lat[yi, xi + 1], lat[yi + 1, xi + 1], lat[yi + 1, xi]])
    #np.allclose([np.dot(phi, px), np.dot(phi, py)], [x, y])
    return np.dot(phi, px), np.dot(phi, py)

@numba.njit()
def get_relative_coordinates(lon, lat, x, y, xi, yi):
    '''returns relative coordinates xsi, eta
       that are the coordinates of the (x, y) point remapped into a square cell [0,1] x [0,1]
    '''
    invA = np.array([
        1.0, 0.0, 0.0, 0.0, -1.0, 1.0, 0.0, 0.0, -1.0, 0.0, 0.0, 1.0, 1.0,
        -1.0, 1.0, -1.0
    ]).reshape(4, 4)
    px = np.array(
        [lon[yi, xi], lon[yi, xi + 1], lon[yi + 1, xi + 1], lon[yi + 1, xi]])
    px = np.where(px[:] - x > 180, px - 360, px)
    px = np.where(px[:] - x < -180, px + 360, px)
    py = np.array(
        [lat[yi, xi], lat[yi, xi + 1], lat[yi + 1, xi + 1], lat[yi + 1, xi]])
    a = np.dot(invA, px)
    b = np.dot(invA, py)

    aa = a[3] * b[2] - a[2] * b[3]
    bb = a[3] * b[0] - a[0] * b[3] + a[1] * b[2] - a[2] * b[1] + x * b[
        3] - y * a[3]
    cc = a[1] * b[0] - a[0] * b[1] + x * b[1] - y * a[1]
    if abs(aa) < 1e-12:  # Rectilinear cell, or quasi
        eta = -cc / bb
    else:
        det2 = bb * bb - 4 * aa * cc
        if det2 > 0:  # so, if det is nan we keep the xsi, eta from previous iter
            det = np.sqrt(det2)
            eta = (-bb + det) / (2 * aa)
        else:  # should not happen, apart from singularities
            eta = 1e6
    if abs(a[1] + a[3] *
           eta) < 1e-12:  # this happens when recti cell rotated of 90deg
        xsi = ((y - py[0]) / (py[1] - py[0]) + (y - py[3]) /
               (py[2] - py[3])) * .5
    else:
        xsi = (x - a[0] - a[2] * eta) / (a[1] + a[3] * eta)
    return (xsi, eta)

Let me know if it helps you.

@fbriol
Copy link
Author

fbriol commented Jul 8, 2019

Hello,

I have finalized a first test version of the interpolation library for geo-referenced data. If you want to try it, you can find information here. The goal is to make a low-level library for interpolations in order to connect other more advanced libraries. This version is far from perfect, but it is a first draft to continue the discussion on the real stuff.

@delandmeterp
Copy link
Contributor

Thanks Frédéric, for both the library and the cleaning of my small script!

@stale
Copy link

stale bot commented Sep 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 10, 2019
@stale
Copy link

stale bot commented Sep 17, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants