Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropna() for a Series indexed by a CFTimeIndex #2688

Closed
spencerkclark opened this issue Jan 17, 2019 · 3 comments
Closed

dropna() for a Series indexed by a CFTimeIndex #2688

spencerkclark opened this issue Jan 17, 2019 · 3 comments

Comments

@spencerkclark
Copy link
Member

Code Sample, a copy-pastable example if possible

Currently something like the following raises an error:

In [1]: import xarray as xr

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: times = xr.cftime_range('2000', periods=3)

In [5]: series = pd.Series(np.array([0., np.nan, 1.]), index=times)

In [6]: series
Out[6]:
2000-01-01 00:00:00    0.0
2000-01-02 00:00:00    NaN
2000-01-03 00:00:00    1.0
dtype: float64

In [7]: series.dropna()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-45eb0c023203> in <module>
----> 1 series.dropna()

~/pandas/pandas/core/series.py in dropna(self, axis, inplace, **kwargs)
   4169
   4170         if self._can_hold_na:
-> 4171             result = remove_na_arraylike(self)
   4172             if inplace:
   4173                 self._update_inplace(result)

~/pandas/pandas/core/dtypes/missing.py in remove_na_arraylike(arr)
    539         return arr[notna(arr)]
    540     else:
--> 541         return arr[notna(lib.values_from_object(arr))]

~/pandas/pandas/core/series.py in __getitem__(self, key)
    801         key = com.apply_if_callable(key, self)
    802         try:
--> 803             result = self.index.get_value(self, key)
    804
    805             if not is_scalar(result):

~/xarray-dev/xarray/xarray/coding/cftimeindex.py in get_value(self, series, key)
    321         """Adapted from pandas.tseries.index.DatetimeIndex.get_value"""
    322         if not isinstance(key, slice):
--> 323             return series.iloc[self.get_loc(key)]
    324         else:
    325             return series.iloc[self.slice_indexer(

~/xarray-dev/xarray/xarray/coding/cftimeindex.py in get_loc(self, key, method, tolerance)
    300         else:
    301             return pd.Index.get_loc(self, key, method=method,
--> 302                                     tolerance=tolerance)
    303
    304     def _maybe_cast_slice_bound(self, label, side, kind):

~/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2595                                  'backfill or nearest lookups')
   2596             try:
-> 2597                 return self._engine.get_loc(key)
   2598             except KeyError:
   2599                 return self._engine.get_loc(self._maybe_cast_indexer(key))

~/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '[ True False  True]' is an invalid key

Problem description

We currently rely on this in the resampling logic within xarray for a Series indexed by a DatetimeIndex:

first_items = first_items.dropna()

It would be nice if we could do the same with a Series indexed by a CFTimeIndex, e.g. in #2593.

Expected Output

In [7]: series.dropna()
Out[7]:
2000-01-01 00:00:00    0.0
2000-01-03 00:00:00    1.0
dtype: float64

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) [Clang 9.0.0 (clang-900.0.37)] python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.10.9+117.g80914e0.dirty
pandas: 0.24.0.dev0+1332.g5d134ec
numpy: 1.15.4
scipy: 1.1.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.0.0
distributed: 1.25.2
matplotlib: 3.0.2
cartopy: None
seaborn: 0.9.0
setuptools: 40.6.3
pip: 18.1
conda: None
pytest: 3.10.1
IPython: 7.2.0
sphinx: None

@spencerkclark
Copy link
Member Author

The issue seems to stem from the fact that the TypeError produced by index.get_value(series, [True, False, True]) is not one of the exceptions that pandas.Series.__getitem__ is written to handle.

In the case of a DatetimeIndex, index.get_value(series, [True, False, True]) raises an InvalidIndexError in place of a TypeError initially raised:

In [1]: import xarray as xr; import pandas as pd; import numpy as np

In [2]: times = pd.date_range('2000', periods=3)

In [3]: series = pd.Series([0., np.nan, 1.], index=times)

In [4]: times.get_value(series, [True, False, True])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/pandas/pandas/core/indexes/base.py in get_value(self, series, key)
  4290             return self._engine.get_value(s, k,
-> 4291                                           tz=getattr(series.dtype, 'tz', None))
  4292         except KeyError as e1:

~/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

~/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

~/pandas/pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

TypeError:

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
<ipython-input-7-1b8f8313de2a> in <module>
----> 1 times.get_value(series, [True, False, True])

~/pandas/pandas/core/indexes/datetimes.py in get_value(self, series, key)
   934
   935         try:
--> 936             return com.maybe_box(self, Index.get_value(self, series, key),
   937                                  series, key)
   938         except KeyError:

~/pandas/pandas/core/indexes/base.py in get_value(self, series, key)
  4310             if is_scalar(key):  # pragma: no cover
  4311                 raise IndexError(key)
-> 4312             raise InvalidIndexError(key)
  4313
  4314     def set_value(self, arr, key, value):

InvalidIndexError: [True, False, True]

This would seem to offer a simple fix for us for CFTimeIndex.get_value (i.e. catch the TypeError and raise an InvalidIndexError); however, InvalidIndexError unfortunately is not a public exception in pandas. Raising a KeyError instead happens to work, but I'm not sure if it's safe to rely on that either (because we're sort of at the whim of how it gets handled in Series.__getitem__).

@shoyer do you think you might have a recommendation here? Does either one of those options make sense, or might there be an alternative?

@shoyer
Copy link
Member

shoyer commented Jan 29, 2019

@spencerkclark Ugh, this part of pandas is a real mess. Probably the easiest option would be to support boolean indexers directly in CFTimeIndex.get_value (by checking for a boolean dtype).

@spencerkclark
Copy link
Member Author

Probably the easiest option would be to support boolean indexers directly in CFTimeIndex.get_value (by checking for a boolean dtype).

Good idea -- I'll see what I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants