Add a function to enrich STM using data from another dataset #66

SarahAlidoost · 2024-03-12T14:23:27Z

closes #63

This PR implements two functions:

stm.enrich_from_dataset: for both points and raster datasets
utils.crop: a util function for cropping a dataset, either points or raster

see #69 on why I changed _io.py in this PR.

…put data

rogerkuou

Hi Sarah, thanks for the nice implementation! It works well in general.

I have two comments regarding adding more checks on the incoming data. See above.

These comments come from a test I made with KNMI data. I also attached the notebook and the small example dataset here.

example_test.zip

stmtools/stm.py

rogerkuou · 2024-04-02T15:13:28Z

stmtools/stm.py

+
+    # do selection
+    indexers = {coord: ds[coord] for coord in list(datapoints.coords.keys())}
+    selections = datapoints.sel(indexers, method="nearest")


I did a test using sel to get nearest time location, with datetime64[ns] format. This works under two conditions:

For time dimension, in a specific combination: method=nearest + datetime64[ns] format, Xarray requires the time coordinates of ds to be monotonic increasing or decreasing.

For lat and lon dimension, the duplicated values should not exists in the corresponding coordinates.

Can we add check in the beginning of enrich_from_dataset to enforce this?

An example check for time:

# This should be True np.all(np.diff(ds['time'].values) > 0) or np.all(np.diff(ds['time'].values) < 0)

An example check for lat and lon:

# Following should be True np.unique(ds['lat'].values).shape == ds['lat'].values.shape np.unique(ds['lon'].values).shape == ds['lon'].values.shape

good point. It seems that xarray does not enforce cf conventions. I added two util functions to do the checks and call them in stem _enrich_from_points_block here.

I noticed that with KDTree, these two conditions are not needed. So I removed the checks from _enrich_from_points_block.

I added some tests for these conditions

@rogerkuou to check if the points are unique, the test np.unique(ds['lat'].values).shape == ds['lat'].values.shape is not enough because it only checks the duplicates in one dimension here lat. However, for example, points can be located on one line.

Instead, we need a test if there are cases where (lat, lon, time) are duplicated. Functions like xarray.Dataset.drop_duplicates and pandas.DataFrame.duplicated can be used to write a test. But these functions only work on dim and not coords. In our cases, lat and lon are coords and space is the dim. So we might need to use unstack which leads to memory problems.

As discussed, scipy KDTree works if coordinates (lat, lon, time) are duplicated and the values of variables e.g. temperature are the same too, I added a test for this. If the values of variables are not the same for duplicated coordinates, MacOs and linux behave differently to pick up a value related to the nearest neighbor. For this extreme case, I suggest creating a new issue and fixing it in another PR.

Co-authored-by: Ou Ku <o.ku@esciencecenter.nl>

…eck on macos

SarahAlidoost · 2024-04-12T11:47:36Z

Hi Sarah, thanks for the nice implementation! It works well in general.

I have two comments regarding adding more checks on the incoming data. See above.

These comments come from a test I made with KNMI data. I also attached the notebook and the small example dataset here.

example_test.zip

thanks for the review. It is ready for another look.

rogerkuou

Hi @SarahAlidoost, thanks for the nice implementation!

I doced the follow-up issues in #75 and #76

I will merge it now.

SarahAlidoost added 16 commits March 12, 2024 15:21

draft implementation of querying data

a64bb06

add scipy and xarray io to dependencies

9a5feb4

refactor enrich_from_dataset to two approaches of point and raster in…

ba41174

…put data

use KDTree instead of cKDTree

e133f5f

replace KDTree with sel method of xarray, fix a bug in fields

90ef61d

fix tests

99844c8

remove ds copy

6926a4b

add test if operations are lazy

f61f355

add util functions for cropping and unstack operations

b3101c8

fix stm enrich function

605db7e

fix and refactor util function for cropping

dc33ecb

fix an error msg

2002854

fix linter errors

9113b02

fix linters

86d7b6d

remove scipy because it is included in xarray io

da020a4

fix linter errors in _io

3dc1756

SarahAlidoost mentioned this pull request Mar 25, 2024

Github workflow in a pull request should only run on changed files #69

Open

SarahAlidoost requested a review from rogerkuou March 25, 2024 08:32

SarahAlidoost marked this pull request as ready for review March 25, 2024 08:32

fix minor things

0691161

rogerkuou requested changes Apr 2, 2024

View reviewed changes

SarahAlidoost and others added 4 commits April 8, 2024 11:27

Update stmtools/stm.py

8c473a7

Co-authored-by: Ou Ku <o.ku@esciencecenter.nl>

add two utils functions for checking coordinates

92d5301

fix test unique coords in test_util

70089d1

add a check if coords are monotonic and unigue is stm

b85415a

rogerkuou mentioned this pull request Apr 9, 2024

57 add documentation #71

Merged

SarahAlidoost added 4 commits April 10, 2024 12:21

use scipy KDTree instead of xarray unstack and sel functions

c030cf9

fix linter errors

f123336

add test for non monotonic an dduplicates coords

73acbbd

add a test for non monotonic time

3d01481

SarahAlidoost added 8 commits April 12, 2024 10:08

add type to coordinates in tests

9b3ddd8

fix a linter error

45e1900

debug: add debuging to pytest in workflow, and comment the test to ch…

0412f10

…eck on macos

debug comment the test to check on macos

ea580a2

fix tests comparing values instead of data arrays

9dc7804

remove util function for checking unique values

0666878

fix the test

fc784c8

remove -vv from action build

90ef7be

SarahAlidoost requested a review from rogerkuou April 12, 2024 11:47

rogerkuou mentioned this pull request May 6, 2024

Unit test for duplicated points in an STM #75

Open

rogerkuou approved these changes May 6, 2024

View reviewed changes

rogerkuou merged commit 2c62a0f into main May 6, 2024
16 checks passed

SarahAlidoost deleted the fix_63 branch May 6, 2024 14:24

rogerkuou mentioned this pull request Jun 5, 2024

STM add temporal contextual information #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a function to enrich STM using data from another dataset #66

Add a function to enrich STM using data from another dataset #66

SarahAlidoost commented Mar 12, 2024 •

edited

Loading

rogerkuou left a comment •

edited

Loading

rogerkuou Apr 2, 2024

SarahAlidoost Apr 8, 2024 •

edited

Loading

SarahAlidoost Apr 10, 2024

SarahAlidoost Apr 11, 2024

SarahAlidoost Apr 12, 2024

SarahAlidoost commented Apr 12, 2024

rogerkuou left a comment

Add a function to enrich STM using data from another dataset #66

Add a function to enrich STM using data from another dataset #66

Conversation

SarahAlidoost commented Mar 12, 2024 • edited Loading

rogerkuou left a comment • edited Loading

Choose a reason for hiding this comment

rogerkuou Apr 2, 2024

Choose a reason for hiding this comment

SarahAlidoost Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

SarahAlidoost Apr 10, 2024

Choose a reason for hiding this comment

SarahAlidoost Apr 11, 2024

Choose a reason for hiding this comment

SarahAlidoost Apr 12, 2024

Choose a reason for hiding this comment

SarahAlidoost commented Apr 12, 2024

rogerkuou left a comment

Choose a reason for hiding this comment

SarahAlidoost commented Mar 12, 2024 •

edited

Loading

rogerkuou left a comment •

edited

Loading

SarahAlidoost Apr 8, 2024 •

edited

Loading