Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a convex hull masking function #237

Merged
merged 10 commits into from
Mar 17, 2020
Merged

Add a convex hull masking function #237

merged 10 commits into from
Mar 17, 2020

Conversation

leouieda
Copy link
Member

Use scipy's Delaunay triangulation to determine points that are outside
the convex hull of the data points. Useful to mask out wildly
extrapolated points from grids. It's easier to use than distance_mask
because it doesn't require distance calculations (and the associated
projections). Use the new function in some examples and tutorials.

Fixes #126

Reminders:

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
  • Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
  • Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

Use scipy's Delaunay triangulation to determine points that are outside
the convex hull of the data points. Useful to mask out wildly
extrapolated points from grids. It's easier to use than `distance_mask`
because it doesn't require distance calculations (and the associated
projections). Use the new function in some examples and tutorials.

Fixes #126
verde/mask.py Outdated
"""
coordinates, shape = _get_grid_coordinates(coordinates, grid)
n_coordinates = len(data_coordinates)
triangles = Delaunay(np.transpose(n_1d_arrays(data_coordinates, n_coordinates)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does delaunay still use qhull library which uses 32 floats (or something else weird) at some point of it's processing pipeline. This forces one to use "normalized" coordinates, otherwise points are truncated to a "grid"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. From the docs, I would say "yes": https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.Delaunay.html#scipy.spatial.Delaunay but they don't say anything about that.

What do you mean by "normalized"? Scale them to [0, 1]? If you have any pointers that would be great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this page but there is nothing on that: http://www.qhull.org/html/qh-impre.htm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, thanks for finding these things Ari. Always things I would never have thought to look for 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalized --> small enough

I once did stuff with 1000 m x 1000 m area (points in random position) with real coordinates (northern sweden, some coordinate system with large values) and had this problem. It took sometime to figure out why results were funny.

Good thing is that can be done only for the data going in the Delaunay and then use the original data for rest of the workflow.

Maybe some simple script with large offset from origo could tell if this is still a problem or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers! I'll try to add a test for that to see if it's something we need to worry about.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/ahartikainen/bf2c0eb2ed493040ff490e0c97435ade

Not so sure if this can also fail for convex hull, probably it can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's severe actually. It probably will since the process I'm using is checking if points fall on any simplex. I'm add this and see if the gallery example changes much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the normalization code and support for projecting the coordinates prior to the masking so we can get really large coordinates. This is the gallery example without normalization:

qhull

And this is after normalization:

qhull-norm

So no difference in this case. I'll leave the normalization code in there since it doesn't seem to hurt the results and might prevent headaches in the future.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better than masking by distance.
In fact I was having some holes on the grid on fatiando/harmonica#152 that were rounded of data points, but because of the distance threshold, those grid points were being masked.

I left a single inline comment regarding how do we take coordinates from the xr.Dataset grid.

Comment on lines +199 to +201
if coordinates is None:
dims = [grid[var].dims for var in grid.data_vars][0]
coordinates = np.meshgrid(grid.coords[dims[1]], grid.coords[dims[0]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By choosing the northing and easting based on the order of dims could be a source of errors if the dimensions are not in the expected order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of this after #229 but we're getting dims from the DataArray so ii should be fine. The only thing that might cause problems is if the dims are different for the different vars. I could add a check for that and error if that's the case.

This is why adding these things back to xarray is hard. I don't think I could generalize this to arbitrary Dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem may appear if the user is using a different order on the dims of the DataArray. I've seen that after applying some xarray functions, sometimes the order of dims are changed (sorry, don't remember exactly which ones, but found this issue that may be related).
I don't have a final solution for this, just wanted to raise the discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I hadn't thought of that. It might be worth mentioning in the docs that we're assuming this order of dimensions then.

@leouieda
Copy link
Member Author

leouieda commented Mar 13, 2020

TODO:

  • Test case with projection
  • Mention in the docs that we assume that xarray dims are northing, easting

@leouieda
Copy link
Member Author

Should be done with this. @santisoler @ahartikainen if you take another look that would be great. As always, no rush on this :)

Copy link

@ahartikainen ahartikainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalization looks good.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @leouieda.

I think the warning for dims order is good enough.

Just made a minor suggestion on one comment.

tutorials/weights.py Outdated Show resolved Hide resolved
Co-Authored-By: Santiago Soler <santiago.r.soler@gmail.com>
@leouieda leouieda merged commit 5c300d4 into master Mar 17, 2020
@leouieda leouieda deleted the convex-hull branch March 17, 2020 15:42
@leouieda
Copy link
Member Author

Merged! Thank you for the reviews @santisoler and @ahartikainen

jessepisel added a commit to jessepisel/verde that referenced this pull request Oct 28, 2020
* Bug fix in grid_to_table for wrong order of Dataset coords (fatiando#229)

We were relying on the order of the coords attribute of the data set to meshgrid 
the coordinates. This is wrong because xarray takes the coordinate order from 
dims instead, which is what we should also do. Took this chance to refactor the 
code and included a test that would have caught the bug.

* Update files from fatiando/contributing (fatiando#233)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@91699a0

* Use newer Mac on Azure Pipelines (fatiando#234)

Changed from 10.13 (High Sierra) to 10.15 (Catalina) because Azure will
remove the 10.13 VMs soon.

* Allow grid_to_table to take DataArrays (fatiando#235)

Previously would fail because we rely on getting the data variable name
from the Dataset. Need to check if it's a Dataset (has the `data_vars`
attribute) and set the data column name to "scalars" if it's not.

Fixes fatiando#225

* Add function for rolling windows on irregular data (fatiando#236)

The `rolling_window` function returns the indices of points falling on
rolling/moving windows. It works for irregularly sampled data and is a
nice complement to `xarray.DataArray.rolling`  (which only works on
grids). Uses a KDTree to select points but is not compatible with
pykdtree (it's missing the method we need).

* Add function to split data on an expanding window (fatiando#238)

For selecting data in windows of various size around a center point.
Sizes do not have to be increasing necessarily. Returns only the indices
of points falling inside each window.

* Explicitly set PR trigger for all branches on Azure (fatiando#244)

For some reason, Azure Pipelines CI wasn't starting up PRs, only on
master and tags. Explicitly setting a trigger for PRs against any branch
seems to work.

* Fix bug for 3+ coordinates in windowing functions (fatiando#243)

The `rolling_window` and `expanding_window` functions fail if given more 2 coordinates.
If we had full support for N-D coordinates, they could be N-D windowing functions
but until that time, they will have to be 2D (horizontal) only functions. Add test cases using
`extra_coords` in `grid_coordinates` to generate a third coordinate.

Fixes fatiando#242

* Generalize coordinate systems in BaseGridder docs (fatiando#240)

Make docstrings of `BaseGridder` more general, without specifying
a coordinate system. Add `dims` as a class variable that will be used by
default on `grid`, `scatter` and `profile` methods.

* Add a convex hull masking function (fatiando#237)

Use scipy's Delaunay triangulation to determine points that are outside
the convex hull of the data points. Useful to mask out wildly
extrapolated points from grids. It's easier to use than `distance_mask`
because it doesn't require distance calculations (and the associated
projections). Use the new function in some examples and tutorials.
Normalize the coordinates prior to calculations to avoid errors
with floating point conversion when triangulating (as suggested by 
Ari Hartikainen).

Fixes fatiando#126

* Fix bug with profile distances being unprojected (fatiando#231)

`BaseGridder.profile` has the option to apply a projection to the
coordinates before predicting so we can pass geographic coordinates to
cartesian gridders. In these cases, the distance along the profile is
calculated by `profile_coordinates` with the unprojected coordinates (in
the geographic case it would be degrees). The profile point calculation
is also done assuming that coordinates are Cartesian, which is clearly
wrong if inputs are longitude and latitude. To fix this, project the input
points prior to passing to `profile_coordinates`. This means that the 
distances are Cartesian and generation of profile points is also 
Cartesian (as is assumed by `profile_coordinates`). The generated 
coordinates are projected back so that the user gets longitude and 
latitude but distances are still projected Cartesian. Add a warning in the
docstring of the modified behaviour due to this bug and a section in the
tutorial on projections.

* Update TravisCI configuration to new formats (fatiando#247)

Travis started warning that some options in the config were deprecated.
Update them to the new format to avoid crashing builds later on.

* Add function to project 2D gridded data (fatiando#246)

The `project_grid` function transforms a grid to the given projection. 
It re-samples the data using `ScipyGridder` (by default) and runs a 
blocked mean (optional) to avoid aliasing when the points aren't evenly 
distributed in the projected coordinates (like in polar projections). 
Applies a `convexhul_mask` to the grid always to avoid extrapolation to 
points that had no original data (this is done by scipy for linear and 
cubic interpolation already). The function only works for 
`xarray.DataArray`s (not `Dataset`) because I couldn't find an easy way 
to guarantee that a variables have the same dimensions. Move the new 
function and `project_region` to the new module `verde/projections.py`.

* Changelog entry for v1.4.0 (fatiando#248)

New minor release with new features and bug fixes. No breaking changes
or deprecations.

* Fix broken TravisCI deploy (fatiando#250)

The docs and configuration validation tool had recommended using
`cleanup: false` instead of `skip_cleanup: true` when deploying. Turns
out Travis decided to ignore the new parameter and our Github Pages
deploy was broken since the docs were deleted before we could deploy.
Revert back to the old argument until Travis really removes it.

* Add a block version of ShuffleSplit cross-validator (fatiando#251)

Cross-validation of spatial data with random splits can often lead to 
overestimation of accuracy (see https://doi.org/10.1111/ecog.02881 
for a nice overview). To account for this, the splits can be done by 
spatial blocks. In this case, the data are partitioned into blocks and 
blocks are split randomly. That guarantees some independence of 
measurements between blocks assigned to test/train sets. This 
change adds the `BlockShuffleSplit` cross-validator which does the 
random splitting of blocks in a scikit-learn compatible class. It can be 
used with `verde.cross_val_score` like any other scikit-learn cross-
validator. In fact, it can be used entirely outside of Verde for any type 
of machine learning on spatial data. The class takes care of balancing 
the random splits so that the right amount of train/test data are 
included in each (since blocks can have wildly different numbers of 
points).

* Add optional argument to least_squares to copy Jacobian matrix (fatiando#255)

Add new copy_jacobian optional argument to least_squares to avoid
scaling Jacobian matrix inplace. Add warning on its docstring warning
about memory consumption when copy_jacobian is True. Add test function
for this new feature.

* Update files from fatiando/contributing (fatiando#256)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@340dafd

* Add a blocked option for train_test_split (fatiando#253)

Add option `blocked=False` to `verde.train_test_split` to split along 
blocks instead of randomly. Uses `BlockShuffleSplit` if `True` instead 
of the scikit-learn `ShuffleSplit`. Add a gallery example comparing 
the two options.

* Fix typo in README contributing section (fatiando#258)

Equality -> Equally

* Pin pylint to 2.4.* (fatiando#259)

Version 2.5.0 introduced a bug with the multiprocessing option. This
happens often enough that we should have pylint pinned and only upgrade
when we choose to do it.

* Update tests to latest release of Scikit Learn (fatiando#267)

Change the expected representation of any gridder based on the new
default behaviour of Scikit Learn. After 0.23, Scikit Learn only shows
those parameters whose default value has been changed when giving
a string representation of the gridder. See
scikit-learn/scikit-learn#17061 and its changelog for more information.

* Gridders apply projections using only the first two coordinates (fatiando#264)

Create a private function used by BaseGridder that applies projections only on the first two
elements of the coordinates, but returns the whole set of coordinates, replacing the first two with their projected versions. Add a test function that passes extra_coords to include more than two coordinates when predicting through the grid, scatter and profile methods.

* Remove "blocked" argument from train_test_split (fatiando#257)

It was a bit redundant since we have to specify `shape` or `spacing`
along with it. So we can just ask for `spacing` or `shape` and assume
`blocked=True` if they are provided. Also fix the missing description in
the gallery example and make slight changes to its wording.

Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com>

* Add extra_coords on BaseGridder generated objects (fatiando#265)

After predicting through the profile, scatter and grid methods,
BaseGridder returns generated objects like xr.Dataset or pd.DataFrame.
Any extra coordinate passed through the extra_coords keyword argument
is now included on those objects as new columns or non-dimension
coordinates (depending on the generated object).

* Add a blocked K-Fold cross-validator (fatiando#254)

The `BlockKFold` class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Refactor code from 
`BlockShuffleSplit` into a base class that can be shared with the
new class. Folds are balanced by making splits based on the number
of data points in each fold instead of the number of blocks.

* Pin Cartopy to 0.17.* until we sort out problems (fatiando#270)

For some reason, the coast lines features are plotting on the entire
figure instead of only the land or ocean. Might be an error on our side.
For now, pin cartopy on CI to fix our docs.

* Changelog entry for v1.5.0 (fatiando#271)

Includes the new cross-validators and fixes to our base classes to
accommodate Harmonica.

* Allow specifing the scoring function in cross_val_score (fatiando#273)

Currently, `cross_val_score` only uses the estimator's `.score` method (which
is R² for all our gridders). But sometimes we want to use a different metric,
like MSE or MAE. These changes allow passing in any scikit-learn compatible
scorer through a new `scoring` parameter. This can be a string or callable,
like in scikit-learn's `cross_val_score`. Needed to delegate score computation
to a separate function in `verde.base.utils` to reuse in `cross_val_score` and
a dummy class to be able to use the scikit-learn scorers correctly.

* Format the doc/conf.py file with Black (fatiando#275)

Ran black on `doc/conf.py` and added it to the `black` and `flake8` runs
in the main Makefile. Removed the setting of `sys.path` since we're
installing the packaged in edit mode so it should be importable without
that hack.

* Require Black >=20.8b1 (fatiando#284)

After version 19, black changed the formatting slightly in a few files.
This updates the format and sets a minimum version of black in the
development environment and the CI.

* Fix Cartopy issue and require >= 0.18 (fatiando#283)

Unpin Cartopy because Cartopy 0.17.0 has a compatibility issue with Matplotlib > 3.3.0.
Remove the unneeded tight_layout that was causing errors.
Replace the filled polygons used for plotting land and ocean for the coastlines.

* Fix typos on docstrings (fatiando#281)

Fix typo on description of data_names parameters on BaseGridder.grid
method. Fix typo on description of BaseGridder.project_coordinates
method.

* Raise errors for invalid rolling windows arguments (fatiando#280)

Raise a specific error if no spacing or shape is passed to
rolling_window(). Raise a specific error if window size is larger than
the smaller dimension of the region. Add tests for these new methods
that include matching error messages to identify which error has been
raised.

* Remove configuration files for unused CI (fatiando#292)

We no longer use Stickler, Codacy, or CodeClimate. So there is no reason
to keep these files around.

* Update files from fatiando/contributing (fatiando#294)

Changes have been made in https://github.com/fatiando/contributing to:
CONTRIBUTING.md
Update the copies in this repository to match.
See fatiando/community@e899f3f

* Add function to convert grid to Dataset (fatiando#282)

Add a function that converts coordinates and data as np.arrays to
xarray.Dataset. This would make handling grids easier, with the option of
efficiently saving them as netCDF files. Make BaseGridder.grid method to use
this function instead of building the xarray.Dataset by itself. Add new
check_data_names function that converts a str into a tuple with a single
str. Make BaseGridder.grid, BaseGridder.scatter and BaseGridder.profile
able to take a single str for the data_names argument. Add test function
for default data_names. Change data_names on all examples and tutorials to
a single str.

* Refactor get_data_names and related check functions (fatiando#295)

Refactor `get_data_names` as a private method of `BaseGridder`: simplify
`get_data_names` logic, make `get_data_names` to use `check_data_names` to
convert single string to a tuple and check that `data_names` has the same
number of elements as data. Add `data_names_defaults` as a class attribute of
`BaseGridder`. Move `check_extra_coords_names` to `verde/base/utils.py` along
with the other check functions. Make `check_data_names` to raise error if
`data_names` is `None`. But `BaseGridder.get_data_names` will return default
values if `data_names` is `None`. Rename test functions for `make_xarray_grid`.

Co-authored-by: Leonardo Uieda <leouieda@gmail.com>
Co-authored-by: Fatiando a Terra Bot <50936856+fatiando-bot@users.noreply.github.com>
Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com>
Co-authored-by: Rowan Cockett <rowanc1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a convexhull masking function
3 participants