Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement the PintIndex #163

Merged
merged 61 commits into from
Jul 9, 2024
Merged

implement the PintIndex #163

merged 61 commits into from
Jul 9, 2024

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Mar 26, 2022

As mentioned in #162, it is possible to get the indexing functions to work, although there still is no public API.

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex

ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air

new_arr = xr.DataArray(
    arr.variable,
    coords={
        "lat": arr.lat.variable,
        "lon": arr.lon.variable,
        "time": arr.time.variable,
    },
    indexes={
        "lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
        "lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
        "time": arr.xindexes["time"],
    },
    fastpath=True,
)
new_arr.sel(
    lat=ureg.Quantity([75, 70, 65], "deg"),
    lon=ureg.Quantity([200, 202.5], "deg"),
)

This will fail at the moment because xarray treats dask arrays differently from duck-dask arrays, but passing single values works!



class PintMetaIndex(Index):
# TODO: inherit from MetaIndex once that exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I'm not sure how a MetaIndex class would look like. So far we used the generic term "meta-index" to refer to indexes that would wrap one or several indexes, but I don't know if there will be a need to provide a generic class for that.

Copy link
Collaborator Author

@keewis keewis Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it doesn't really look like we actually need a base class for that, but I noticed that a few methods don't make sense for meta-indexes, from_variables for example. It's probably fine to use the default for those, though.

@benbovy
Copy link
Member

benbovy commented Mar 28, 2022

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Here are a few comments. Happy to answer questions if any.

There are some Index methods of like isel, stack, rename that I guess do not depend on units and where you could just forward the call + args to the wrapped index.

For some other methods like concat, join, equals, reindex, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise). An alternative option would be to implicitly convert the units of those indexes, but it's difficult to do that in an index-agnostic way unless we define and adopt in Xarray some protocol for that purpose.

The general approach used in the Xarray indexes refactor heavily relies on the type of the indexes (at least when we need to compare them together). That's not super flexible with the PintMetaIndex solution here, but I think it's reasonable enough. For example, alignment will only work when the indexes found for a common set of coordinates / dimensions are all PintMetaIndex objects. This might behave weirdly when the PintMetaIndex objects do not wrap indexes of the same type, although this would be easy to check.

I wonder whether whether PintMetaIndex should accept one unit or a map of units. The latter makes sense if, e.g., you want to reuse it to wrap multi-indexes where the corresponding coordinate variables (levels) may have different units.

Regarding Index methods like from_variables and create_variables, I guess PintMetaIndex would implement a lightweight wrapping layer to respectively get and assign a pint quantity from / to the Xarray variables.

You should also be careful when converting the units of indexed coordinates as it may get out of sync with their index. As there's no concept of "duck" index, the easiest would probably be to drop the index (and maybe reconstruct it from scratch) when the coordinates are updated.

@keewis keewis mentioned this pull request Jun 14, 2023
@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

@keewis I have been looking into this once again and now I think I better understand what you'd like to achieve with the PintMetaIndex wrapper. A few other suggestions:

Wrap index coordinate variables as unit-aware variables

I'm not familiar with pint, but if a pint.Quantity works with any duck-array without coercing it into a specific type, I think that something like this may work:

class PintMetaIndex:

    def create_variables(self, variables=None):
        index_vars = self.index.create_variables(variables)

        index_vars_units = {}
        for name, var in index_vars.items():
            data = array_attach_unit(var.data, self.units[name])
            var_units = xr.Variable(var.dims, data, attrs=var.attrs, encoding=var.encoding)
            index_vars_units[name] = var_units
        
        return index_var_units

We cannot use IndexVariable since (if I remember well) it coerces the data as a pd.Index. I think it is fine to stick with Variable for now, it will work in most cases except for, e.g., xr.testing.assert_identical() that maybe checks for the variable type (although maybe check_default_indexes=False may disable it). We'll need to change that in Xarray anyway.

Set new Pint index(es)

Since PintMetaIndex is meant to wrap any existing index, it could be done via the accessors while quantifying the variables. Something like this might work:

@register_dataset_accessor("pint")
class PintDatasetAccessor:

    def quantify(self, units=_default, unit_registry=None, **unit_kwargs)):
        ...

        ds_xindexes = self.ds.xindexes
        new_indexes, new_index_vars = ds_xindexes.copy_indexes()

        for idx, idx_vars in ds_xindexes.group_by_index():
            idx_units = {k: v for k, v in units.items() if k in idx_vars}
            new_idx = PintMetaIndex(idx, idx_units)
            new_indexes.update({k: new_idx for k in idx_vars})
            new_index_vars.update(new_idx.create_variables(idx_vars))

        new_coords = xr.Coordinates(new_index_vars, new_indexes)

        # needs https://github.com/pydata/xarray/pull/8094 to work properly
        ds_updated_temp = self.ds.assign_coords(new_coords)

        ...

It is still useful to implement PintMetaIndex.from_variables() with the common PandasIndex case so that it also works with Dataset.set_xindex:

class PintMetaIndex:
    
    @classmethod
    def from_variables(cls, variables, options):
        index = xr.indexes.PandasIndex.from_variables(variables)
        units_dict = {index.index.name: options.get("units")}
        return = cls(index, units_dict)


ds = xr.Dataset(coords={"x": [1, 2]})

ds_units = ds.drop_indexes("x").set_xindex("x", PintMetaIndex, units="m")

Data selection

Beware that Index.sel() may return new index objects along with the dimension integer indexers. the PintMetaIndex wrapper will need to wrap them before returning the query results.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

nit: I would rename PintMetaIndex to just PintIndex.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

Further comments:

Implementing Dataset.pint.dequantify() would look very much the same than in Dataset.pint.quantify() except that it would extract the wrapped index: new_idx = self.index.

For Dataset.pint.to(), I think it is possible to convert the variables first, then re-create the underlying index with converted_idx = type(self.index).from_variables(converted_vars) and then wrap the latter with PintMetaIndex(converted_idx, new_units). I don't think it is always a good idea, though, as re-building the underlying index from scratch may be expensive and may load lazy coordinate data.

@keewis
Copy link
Collaborator Author

keewis commented Sep 13, 2023

@benbovy, with a few tweaks to your suggestions this:

In [1]: import xarray as xr
   ...: import pint_xarray
   ...: 
   ...: ureg = pint_xarray.unit_registry
   ...: ds = xr.tutorial.open_dataset("air_temperature")
   ...: q = ds.pint.quantify({"lat": "degrees", "lon": "degrees"})
   ...: q.sel(lat=ureg.Quantity(75, "deg").to("rad"))
.../xarray/core/indexes.py:473: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  index = pd.Index(np.asarray(array), **kwargs)
Out[1]: 
<xarray.Dataset>
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 [deg] 75.0
  * lon      (lon) float32 [deg] 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lon) float32 [K] 241.2 242.5 243.5 ... 241.48999 241.79
Indexes:
    lon      PintMetaIndex
    time     PintMetaIndex
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

does work 🎉

The only hickup is that somehow we seem to call safe_cast_to_index, which gives us the warning above. This doesn't seem to affect the functionality of the index, though (at least for sel, which is the only thing I tested).

@keewis
Copy link
Collaborator Author

keewis commented Sep 13, 2023

(the failing tests are expected, I will have to update some of the workaround code)

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis Excellent!

Hmm the warning is weird. safe_cast_to_index is called when a new PandasIndex object is created but there no reason to create one in your example.

Small suggestion: you could implement PintMetaIndex._repr_inline_ so that it displays the type of the wrapped index (e.g., PintMetaIndex(PandasIndex))

pint_xarray/conversion.py Outdated Show resolved Hide resolved
@benbovy
Copy link
Member

benbovy commented Sep 19, 2023

Regarding all the errors in CI TypeError: <pint_xarray.index.PintMetaIndex object at ...> cannot be cast to a pandas.Index object, an easy fix would be to implement PintMetaIndex.to_pandas_index() and just call that method for the wrapped index. This could be done for most of the rest of the Index API I think.

@keewis
Copy link
Collaborator Author

keewis commented Sep 19, 2023

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

@benbovy
Copy link
Member

benbovy commented Sep 19, 2023

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In the case of .to_pandas_index() just delegate is enough. For other methods (.rename, .roll, .isel, etc.) wrapping the result will also be needed. For .sel it might also be needed to check and wrap any index returned within the IndexSelResults.

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

Yes definitely, I think I fixed it in one of my open PRs in the Xarray repo.

keewis added 3 commits July 6, 2024 16:09
In theory, we could also use `sel` and `loc` directly, but that would
not allow us to change the units of the result (or at least, as far as
I can tell).
@keewis
Copy link
Collaborator Author

keewis commented Jul 6, 2024

with this, all the tests pass locally (except the doctests, which I didn't try yet). For them to pass in CI as well, we need a released version of xarray's main branch (otherwise the indexes will be cast to pandas indexes in the tests). looks like for some reason they pass with the released xarray as well.

@keewis
Copy link
Collaborator Author

keewis commented Jul 6, 2024

with the recent commits, this should be ready for reviews (cc @TomNicholas, @benbovy, @jthielen).

Note that even though the index implements a couple of the methods of the custom index base class, the wrapper methods (.pint.*) should still be preferred. The reason for that is that these implementations are not yet complete: for example, PintIndex.sel should modify the IndexSelResult object returned by the wrapped index. This will be fixed in follow-up PRs, as this PR is already way too big.

@keewis keewis requested review from benbovy and TomNicholas July 6, 2024 14:33
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!

So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?

pint_xarray/conversion.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator Author

keewis commented Jul 8, 2024

So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?

What exactly are you referring to? Not sure if this is it, but the entire code base is built upon call_on_dataset, which (just like much of the methods on DataArray) converts to a temporary dataset, applies the function, then converts back.

@TomNicholas
Copy link
Member

What exactly are you referring to?

I was actually referring to your top-level example in your github comment:

Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex

ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air

new_arr = xr.DataArray(
    arr.variable,
    coords={
        "lat": arr.lat.variable,
        "lon": arr.lon.variable,
        "time": arr.time.variable,
    },
    indexes={
        "lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
        "lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
        "time": arr.xindexes["time"],
    },
    fastpath=True,
)
new_arr.sel(
    lat=ureg.Quantity([75, 70, 65], "deg"),
    lon=ureg.Quantity([200, 202.5], "deg"),
)

This will fail at the moment because xarray treats dask arrays differently from duck-dask arrays, but passing single values works!

But that seems to be the case when I look at the DataArray source.

@benbovy
Copy link
Member

benbovy commented Jul 9, 2024

So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?

Yes but this is really for internal use along with the fastpath argument. Ideally we might want to remove both and have a fast-path factory instead for DataArray, but last time I checked it looked less straightforward to refactor than it seemed.

@benbovy
Copy link
Member

benbovy commented Jul 9, 2024

Great to see this ready @keewis ! I went through the changes and it looks good to me (cannot tell much about the pint-specific logic, though).

@keewis
Copy link
Collaborator Author

keewis commented Jul 9, 2024

I was actually referring to your top-level example in your github comment:

Yes but this is really for internal use along with the fastpath argument

that comment is also two years old, you'd use the Coordinates object now

@LecrisUT
Copy link

LecrisUT commented Jul 9, 2024

I don't wish to block or postpone this PR in any way. But can someone give a quick overview of how the workflow changes with this? From the commit I see that .pint is no longer necessary (apart from .quantify, .to operations)? My rough understanding is that .integrate and such operations are now possible without .dequantify first.

@keewis
Copy link
Collaborator Author

keewis commented Jul 9, 2024

right now .pint would still be necessary, with the big advantage that indexed coordinates can carry units (which means that, yes, .integrate will actually give you the right units if you integrate by a indexed coordinate)

In new PRs, we can step by step extend the index and deprecate most of the workaround methods in the .pint namespace (I'd have to look into which ones can actually be replaced).

@LecrisUT
Copy link

LecrisUT commented Jul 9, 2024

Again unrelated, but speaking about deprecation. I have played around with warnings.deprecated with the backport to typing-extensions. It works wonderfully, highly recommend it.

@keewis keewis changed the title implement the PintMetaIndex implement the PintIndex Jul 9, 2024
@keewis
Copy link
Collaborator Author

keewis commented Jul 9, 2024

thanks all, I'll merge this now and we can continue improving in separate PRs.

The only thing I was slightly worried about is backwards-compatibility, but given that this is such a major improvement I guess we can get away with it (also, we have been recommending .pint.quantify as a way to construct quantified objects, so we might be breaking less code than expected).

@keewis keewis merged commit 20a9301 into xarray-contrib:main Jul 9, 2024
16 checks passed
@keewis keewis deleted the pint-meta-index branch July 9, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for set_xindex? Wrong units when using da.integrate() PintMetaIndex
4 participants