Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add zonal_stats #52

Merged
merged 88 commits into from
Dec 14, 2023
Merged

Conversation

masawdah
Copy link
Contributor

Thanks for providing this valuable package. The idea of this PR to aggregate raster data cube over a set of polygon geometries #35 as it needed to use within openEO process Open-EO/openeo-processes-dask#124 . It is based on dask array and parallelization which make it efficient and fast. It has been tested to aggregate polygons over data on continental level and shows a good performance.

I'd like to know your thoughts about that.

Here is an example:

import datetime
import geopandas as gpd
import pandas as pd
import xarray as xr
import numpy as np
import xvec
from shapely.geometry import Polygon
# Create a dataset with 2 variables and 3 time steps 
np.random.seed(0)

temperature = 15 + 8 * np.random.randn(20, 20, 3)
precipitation = 15 + 10 * np.random.randn(20, 20,3)
lat = np.linspace(30,40,20)
lon = np.linspace(10,20,20)



time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")


ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        x=lon,
        y=lat,
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)
ds = ds.chunk(dict(x=4,y=4))
# Create geometries over the dataset
num_polygons = 2  # Adjust the number of polygons as needed
polygons = []

for _ in range(num_polygons):
    # Generate random polygon coordinates within the bounding box of the downsampled dataset
    lon = np.random.uniform(ds.x.min(), ds.x.max(), 4)
    lat = np.random.uniform(ds.y.min(), ds.y.max(), 4)
    polygons.append(Polygon(zip(lon, lat)))


geoseries = gpd.GeoSeries(polygons)
gdf = gpd.GeoDataFrame(geometry=geoseries)

gdf = gdf.set_geometry('geometry')
gdf.crs = 'EPSG:4326'

polys = gdf.geometry.values
extracted = ds.xvec.agg_polys(polys, stat="mean")

image

@martinfleis
Copy link
Member

Thanks! I'll have a look later but note that you need to add new deps to https://github.com/xarray-contrib/xvec/tree/main/ci

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6bed799) 99.06% compared to head (c230edb) 99.20%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #52      +/-   ##
==========================================
+ Coverage   99.06%   99.20%   +0.13%     
==========================================
  Files           3        4       +1     
  Lines         321      377      +56     
==========================================
+ Hits          318      374      +56     
  Misses          3        3              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, a first pass, so far without running the code. A few notes below and some others in the code.

Do we need dask here? I mean, optimally we make use of it if available but don't require it for smaller use cases.

If possible, I would like to avoid storing data in the repository here. We can use geodatasets to fetch us counties and xarray example data to get some raster cubes.

Can you ensure that pre-commit checks pass?

ci/310.yaml Outdated Show resolved Hide resolved
ci/39.yaml Outdated Show resolved Hide resolved
ci/dev.yaml Outdated Show resolved Hide resolved
ci/latest.yaml Outdated Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
@masawdah
Copy link
Contributor Author

I'll go through the notes and work on them.

@masawdah
Copy link
Contributor Author

Hey, I went through the notes. Now the code passed the pre-commit. And I made it an optional to use dask and number of jobs if the data is big as the following:

extracted = ds.xvec.zonal_stats(polys, stat="mean", dask = False, n_jobs = -1)
ds1 = ds.chunk(dict(x=4,y=4))
extracted = ds1.xvec.zonal_stats(polys, stat="mean", dask = True, n_jobs = -1)

And I think it's possible to use geodatasets and xarray example data to get some raster cubes instead of storing the data in the repo. I'll look into that and let u know.

@masawdah
Copy link
Contributor Author

I've tested it using counties fetched from geodatasets and 'air_temperature' data cube. It works but the order of (x, y ) dimensions is not consistent with what implemented now (similar to data cube structure in openEO) so I had to modify the order manually.

We can add function to make sure the dimension order is as expected. What do u think ?

from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
polys = counties.geometry.values

ds = xr.tutorial.open_dataset("air_temperature")

extracted = ds.xvec.zonal_stats(polys[:2], stat="sum", dask = False, n_jobs = -1)

extracted

image

@martinfleis
Copy link
Member

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

@masawdah
Copy link
Contributor Author

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

I'll take a look at that. But for now I made (lon, lat) as default dims names to aggregate over them. And added a function to rename any other possible names (in a dictionary) to (lon, lat) to be consistent with the function requirement. So, now it works for both data cubes from openEO and data cubes from xarray.

masawdah and others added 3 commits October 28, 2023 13:59
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
@masawdah
Copy link
Contributor Author

Thanks for your review and feedback. I reviewed the suggestions and made the necessary modifications to the code accordingly.

xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
xvec/accessor.py Outdated Show resolved Hide resolved
@masawdah
Copy link
Contributor Author

masawdah commented Nov 3, 2023

Hi , I've adjusted the code to be more flexible and dimension-agnostic, as you can see below. I've just hard-coded the name of one coordinate as 'data_variables' to save the data variable names temporary, but we can make it a user input. However, I believe this is not critical, as it's an internal step that doesn't affect the output, unless we intend to have the output as an xarray.DataArray.

image

image

@martinfleis
Copy link
Member

Hey, I did a deep dive into this which resulted in refactoring of the code. I opted to use more xarray logic on the process than before, which simplifies the code quite a bit.

I also noticed that the implementation was providing wrong results when I compared it to the one based on geocube so I also fixed that and since I have a geocube-based Implementation now I will make a follow-up PR exposing it as another engine. It is faster but not suitable for small polygons. Hopefully, we can also expose exactextract once the Python bindings are released (a prototype is there).

I have removed chunking as I believe that is something a user can do if they run into issues but it was more of a bottleneck in most use cases I tried.

Plus some other changes you can see in the code.

I plan on adding tests etc and then merge. I will follow-up with the geocube version, user guide notebook and a release.

@martinfleis
Copy link
Member

Looking at it in more detail, we can probably use rasterio.features.rasterize ourselves without depending on geocube. I'll look into that during the follow up.

@martinfleis martinfleis changed the title Spatial aggregation ENH: add zonal_stats Dec 14, 2023
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to merge this now and will continue in separate PRs with the remaining stuff.

@masawdah thanks for starting this, my refactor was way easier than if I'd be starting from scratch! Let me know if there are some specific requirements on the openEO side you'd like to see included before a release.

@martinfleis martinfleis merged commit 18ee579 into xarray-contrib:main Dec 14, 2023
11 checks passed
@masawdah
Copy link
Contributor Author

Sounds great, looking forward to the new release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants