ENH: add zonal_stats #52

masawdah · 2023-09-25T09:17:30Z

Thanks for providing this valuable package. The idea of this PR to aggregate raster data cube over a set of polygon geometries #35 as it needed to use within openEO process Open-EO/openeo-processes-dask#124 . It is based on dask array and parallelization which make it efficient and fast. It has been tested to aggregate polygons over data on continental level and shows a good performance.

I'd like to know your thoughts about that.

Here is an example:

import datetime
import geopandas as gpd
import pandas as pd
import xarray as xr
import numpy as np
import xvec
from shapely.geometry import Polygon

# Create a dataset with 2 variables and 3 time steps 
np.random.seed(0)

temperature = 15 + 8 * np.random.randn(20, 20, 3)
precipitation = 15 + 10 * np.random.randn(20, 20,3)
lat = np.linspace(30,40,20)
lon = np.linspace(10,20,20)



time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")


ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        x=lon,
        y=lat,
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

ds = ds.chunk(dict(x=4,y=4))

# Create geometries over the dataset
num_polygons = 2  # Adjust the number of polygons as needed
polygons = []

for _ in range(num_polygons):
    # Generate random polygon coordinates within the bounding box of the downsampled dataset
    lon = np.random.uniform(ds.x.min(), ds.x.max(), 4)
    lat = np.random.uniform(ds.y.min(), ds.y.max(), 4)
    polygons.append(Polygon(zip(lon, lat)))


geoseries = gpd.GeoSeries(polygons)
gdf = gpd.GeoDataFrame(geometry=geoseries)

gdf = gdf.set_geometry('geometry')
gdf.crs = 'EPSG:4326'

polys = gdf.geometry.values

extracted = ds.xvec.agg_polys(polys, stat="mean")

…heir ids

… make it faster

martinfleis · 2023-09-25T09:32:44Z

Thanks! I'll have a look later but note that you need to add new deps to https://github.com/xarray-contrib/xvec/tree/main/ci

codecov · 2023-09-25T09:54:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6bed799) 99.06% compared to head (c230edb) 99.20%.
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #52      +/-   ##
==========================================
+ Coverage   99.06%   99.20%   +0.13%     
==========================================
  Files           3        4       +1     
  Lines         321      377      +56     
==========================================
+ Hits          318      374      +56     
  Misses          3        3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martinfleis

Hey, a first pass, so far without running the code. A few notes below and some others in the code.

Do we need dask here? I mean, optimally we make use of it if available but don't require it for smaller use cases.

If possible, I would like to avoid storing data in the repository here. We can use geodatasets to fetch us counties and xarray example data to get some raster cubes.

Can you ensure that pre-commit checks pass?

ci/310.yaml

ci/39.yaml

ci/dev.yaml

ci/latest.yaml

environment.yml

xvec/accessor.py

masawdah · 2023-09-26T11:50:13Z

I'll go through the notes and work on them.

masawdah · 2023-09-27T13:46:34Z

Hey, I went through the notes. Now the code passed the pre-commit. And I made it an optional to use dask and number of jobs if the data is big as the following:

extracted = ds.xvec.zonal_stats(polys, stat="mean", dask = False, n_jobs = -1)

ds1 = ds.chunk(dict(x=4,y=4))
extracted = ds1.xvec.zonal_stats(polys, stat="mean", dask = True, n_jobs = -1)

And I think it's possible to use geodatasets and xarray example data to get some raster cubes instead of storing the data in the repo. I'll look into that and let u know.

masawdah · 2023-09-28T10:43:34Z

I've tested it using counties fetched from geodatasets and 'air_temperature' data cube. It works but the order of (x, y ) dimensions is not consistent with what implemented now (similar to data cube structure in openEO) so I had to modify the order manually.

We can add function to make sure the dimension order is as expected. What do u think ?

from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
polys = counties.geometry.values

ds = xr.tutorial.open_dataset("air_temperature")

extracted = ds.xvec.zonal_stats(polys[:2], stat="sum", dask = False, n_jobs = -1)

extracted

martinfleis · 2023-09-28T12:46:04Z

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

masawdah · 2023-09-28T14:10:24Z

We can add function to make sure the dimension order is as expected. What do u think ?

In extract_points, we have the dimension mapping exposed directly as x_coords and y_coords (https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.extract_points.html). We could maybe use something like that?

I'll take a look at that. But for now I made (lon, lat) as default dims names to aggregate over them. And added a function to rename any other possible names (in a dictionary) to (lon, lat) to be consistent with the function requirement. So, now it works for both data cubes from openEO and data cubes from xarray.

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

…into spatial_aggregation

masawdah · 2023-10-28T12:22:34Z

Thanks for your review and feedback. I reviewed the suggestions and made the necessary modifications to the code accordingly.

xvec/accessor.py

masawdah · 2023-11-03T18:53:11Z

Hi , I've adjusted the code to be more flexible and dimension-agnostic, as you can see below. I've just hard-coded the name of one coordinate as 'data_variables' to save the data variable names temporary, but we can make it a user input. However, I believe this is not critical, as it's an internal step that doesn't affect the output, unless we intend to have the output as an xarray.DataArray.

martinfleis · 2023-12-13T20:16:48Z

Hey, I did a deep dive into this which resulted in refactoring of the code. I opted to use more xarray logic on the process than before, which simplifies the code quite a bit.

I also noticed that the implementation was providing wrong results when I compared it to the one based on geocube so I also fixed that and since I have a geocube-based Implementation now I will make a follow-up PR exposing it as another engine. It is faster but not suitable for small polygons. Hopefully, we can also expose exactextract once the Python bindings are released (a prototype is there).

I have removed chunking as I believe that is something a user can do if they run into issues but it was more of a bottleneck in most use cases I tried.

Plus some other changes you can see in the code.

I plan on adding tests etc and then merge. I will follow-up with the geocube version, user guide notebook and a release.

martinfleis · 2023-12-13T20:56:56Z

Looking at it in more detail, we can probably use rasterio.features.rasterize ourselves without depending on geocube. I'll look into that during the follow up.

martinfleis

Going to merge this now and will continue in separate PRs with the remaining stuff.

@masawdah thanks for starting this, my refactor was way easier than if I'd be starting from scratch! Let me know if there are some specific requirements on the openEO side you'd like to see included before a release.

masawdah · 2023-12-15T14:04:20Z

Sounds great, looking forward to the new release :)

masawdah added 9 commits July 20, 2023 16:23

example

2a2e504

assign the pixels ids based on their polygons then group them using t…

e1e0501

…heir ids

fix importing xagg and run an example

4a3d144

xagg using openEO data , it needs to implement parallel processing to…

e3a2443

… make it faster

clean up

d48a019

spatial_aggregation

4ee3653

clean up

5c008a5

rioxarray

9e5e462

packages

35121fe

masawdah added 3 commits September 25, 2023 11:45

deps

15c04ca

deps

5c2a97e

deps

f732a4a

martinfleis reviewed Sep 26, 2023

View reviewed changes

masawdah added 6 commits September 27, 2023 14:49

fix the notes

6fab486

zonal stats in pytest

fb1e4a1

handle long lines

6f2dbab

handle long lines

ce18648

fix pre-commit

4197063

updated files

b291d9f

masawdah added 2 commits September 28, 2023 15:20

support lat&lon names

64a976f

fix the dimension names on the fly

ae9ece0

masawdah added 2 commits September 28, 2023 16:16

testpy

5b808e0

dask support just axis as integer

ce594ba

masawdah and others added 3 commits October 28, 2023 13:59

Update xvec/accessor.py

d573e31

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

accessor

ba20f48

Merge branch 'spatial_aggregation' of https://github.com/masawdah/xvec …

7b48a6c

…into spatial_aggregation

martinfleis reviewed Nov 2, 2023

View reviewed changes

xvec/accessor.py Outdated Show resolved Hide resolved

xvec/accessor.py Outdated Show resolved Hide resolved

xvec/accessor.py Outdated Show resolved Hide resolved

masawdah added 3 commits November 3, 2023 19:28

dimension-agnostic

7b5dd1f

clean up

f63a662

update pytest

5b4a89a

martinfleis added 2 commits December 13, 2023 12:01

refactor

20c88bd

refactor structure

7060895

martinfleis added 3 commits December 13, 2023 21:47

geodatasets

bc42888

pyogrio for IO

c4e8946

api

2a3bbd1

martinfleis added 5 commits December 13, 2023 23:07

include vectorized rasterize-based method

2e40779

rasterio link

707f7e9

fix docstring

aa7cb1a

fmt

cd00342

fix for DataArray

cc51123

martinfleis mentioned this pull request Dec 14, 2023

Aggregate spatial Open-EO/openeo-processes-dask#194

Merged

testing

ff94fd4

martinfleis changed the title ~~Spatial aggregation~~ ENH: add zonal_stats Dec 14, 2023

stat -> stats

4a3d77d

martinfleis approved these changes Dec 14, 2023

View reviewed changes

fix keyword

c230edb

martinfleis merged commit 18ee579 into xarray-contrib:main Dec 14, 2023
11 checks passed

martinfleis mentioned this pull request Dec 15, 2023

ENH: support callable as stats in zonal_stats #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add zonal_stats #52

ENH: add zonal_stats #52

masawdah commented Sep 25, 2023

martinfleis commented Sep 25, 2023

codecov bot commented Sep 25, 2023 •

edited

Loading

martinfleis left a comment

masawdah commented Sep 26, 2023

masawdah commented Sep 27, 2023

masawdah commented Sep 28, 2023

martinfleis commented Sep 28, 2023

masawdah commented Sep 28, 2023

masawdah commented Oct 28, 2023

masawdah commented Nov 3, 2023

martinfleis commented Dec 13, 2023

martinfleis commented Dec 13, 2023

martinfleis left a comment

masawdah commented Dec 15, 2023

ENH: add zonal_stats #52

ENH: add zonal_stats #52

Conversation

masawdah commented Sep 25, 2023

martinfleis commented Sep 25, 2023

codecov bot commented Sep 25, 2023 • edited Loading

Codecov Report

martinfleis left a comment

Choose a reason for hiding this comment

masawdah commented Sep 26, 2023

masawdah commented Sep 27, 2023

masawdah commented Sep 28, 2023

martinfleis commented Sep 28, 2023

masawdah commented Sep 28, 2023

masawdah commented Oct 28, 2023

masawdah commented Nov 3, 2023

martinfleis commented Dec 13, 2023

martinfleis commented Dec 13, 2023

martinfleis left a comment

Choose a reason for hiding this comment

masawdah commented Dec 15, 2023

codecov bot commented Sep 25, 2023 •

edited

Loading