Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 6, 2017

This changes the datetime field accessors of a DatetimeIndex (and PeriodIndex, etc) to return an Index object instead of a plain array:

So for example:

# PR

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx
Out[2]: 
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 10:00:00',
               '2015-01-01 20:00:00', '2015-01-02 06:00:00',
               '2015-01-02 16:00:00'],
              dtype='datetime64[ns]', freq='10H')

In [3]: idx.hour
Out[3]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

instead of

# master

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx.hour
Out[2]: array([ 0, 10, 20,  6, 16], dtype=int32)

@jorisvandenbossche jorisvandenbossche added API Design Datetime Datetime data dtype labels Mar 6, 2017
@jorisvandenbossche
Copy link
Member Author

One failing test I am not sure what to do about. On master, the following preserves the name:

In [1]: idx = pd.date_range("2015-01-01", periods=10, freq='10H', name='name!')

In [2]: idx.map(lambda x: x.hour)
Out[2]: Int64Index([0, 10, 20, 6, 16, 2, 12, 22, 8, 18], dtype='int64', name='name!')

but now not anymore. The reason for this is the DatetimeIndex.map implementation, where if the function returns an Index, this is returned, but otherwise Index.map is used and this passes through the attributes.
There actually also seems to be a bug in the map implementation:

def map(self, f):
try:
result = f(self)
# Try to use this result if we can
if isinstance(result, np.ndarray):
self._shallow_copy(result)
if not isinstance(result, Index):
raise TypeError('The map function must return an Index object')
return result
except Exception:
return self.asobject.map(f)
(line 337 the shallow_copy is called, but nothing is done with the result) cc @nateyoder

@jorisvandenbossche
Copy link
Member Author

Maybe also more in general: should those field accessors preserve the index name?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

these should have the name of the Series themselves (e.g. the name of the values)

@@ -106,6 +106,8 @@ def _delegate_property_get(self, name):
elif not is_list_like(result):
return result

result = np.asarray(result)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why are you converting back to an ndarray here? I don't think this necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the take_1d 2 lines below needs an array, not an index.
I could only convert it specifically for that, but thought it couldn't do harm to put it here, as it is otherwise passed to Series as values, so will be converted to array anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can change it to

result.take(...) which will handle this. It was one this way because it was an array originally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to convert to an array is so that Series does not take a copy of the values (which it does if you pass an Index object I think)

@jorisvandenbossche
Copy link
Member Author

these should have the name of the Series themselves

There is no Series here, it is only about Index

@@ -77,16 +77,19 @@ def f(self):

result = tslib.get_start_end_field(values, field, self.freqstr,
month_kw)
result = self._maybe_mask_results(result, convert='float64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can have a single result = self._maybe_mask_results(result, convert='float64') just before returning; it won't do anything to something w/o nan's anyhow (and is more clear code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is with the weekday_name, which gives strings, and for this the astype('float64') will fail

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I am also not sure why the is_leap_year is treated differently, but converting missing values back to NaN would be an API change, as for some reason that attribute currently keeps it missing values as False:

In [14]: idx = pd.DatetimeIndex(['2012-01-01', pd.NaT, '2013-01-01'])

In [15]: idx.is_leap_year
Out[15]: Index([True, False, False], dtype='object')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About is_leap_year, was done like this in purpose in #13739, citing @sinhrks "pd.NaT.is_leap_year results in False, as I think users want bool array."

But, this does not seem very consistent with other is_ methods .. (but I would keep this for another issues/PR)

Copy link
Contributor

@jreback jreback Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, these are a bug in _mask_missing_values then. It needs to ignore object and boolean dtypes. (or better yet, only work on is_numeric_dtype).

If you can't get it work (in time you have allowed), lmk and i'll take a look.


return self._maybe_mask_results(result, convert='float64')
return Index(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name=self.values.name

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

Maybe also more in general: should those field accessors preserve the index name?

yes

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

obviously when finished, need a sub-section in whatsnew for this, it is technically an API change, though actually should be back-compat

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

these should have the name of the Series themselves
There is no Series here, it is only about Index

no what I mean is the name of the result index should be the name of the original Series values

IOW

In [16]: s = Series(pd.date_range('20130101',periods=3), name='foo')

In [17]: s.dt.day
Out[17]: 
0    1
1    2
2    3
Name: foo, dtype: int64

In [18]: Index(s.dt.day, name='foo')
Out[18]: Int64Index([1, 2, 3], dtype='int64', name='foo')

@jorisvandenbossche
Copy link
Member Author

Sorry, I still don't understand. Do you mean that eg the s.index.day attribute takes s.name as its name (instead of s.index.name)?
But and Index can live completely independent of a Series, so I don't see why this should be the case (or do we have examples of that somewhere else in pandas?)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

Sorry, I still don't understand. Do you mean that eg the s.index.day attribute takes s.name as its name (instead of s.index.name)?
But and Index can live completely independent of a Series, so I don't see why this should be the case (or do we have examples of that somewhere else in pandas?)

yes of course, you are working on the values, so you return the values .name attribute. this is standard practice, for example any type of operation.

This is de-facto the same as doing.

In [1]: s = Series([1,2,3],index=Index(list('abc'), name='bar'), name='foo')

In [2]: s
Out[2]: 
bar
a    1
b    2
c    3
Name: foo, dtype: int64

In [3]: pd.Index(s)
Out[3]: Int64Index([1, 2, 3], dtype='int64', name='foo')

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 6, 2017

@jreback the starting object in this PR is an index, not a series. So the values I pass to Index are coming from an Index, not from a Series.
So I suppose it has to take the name of the Index, but there is no Series involved here.

(self.name is the Index name)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@jorisvandenbossche this only affects the delegates (which is always a Series). NOT directly from the index (though actually that should also propogate the name).

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 6, 2017

@jorisvandenbossche - I don't feel strongly about this, but given that the dt accessors return a like shaped array, wouldn't it make sense to wrap the results back in a Series? E.g., no different than this:

In [35]: s = pd.Series(['a', 'b', 'c'])

In [36]: s.str.upper()
Out[36]: 
0    A
1    B
2    C
dtype: object

@jorisvandenbossche
Copy link
Member Author

@chris-b1 this PR is about Index, not Series (will add better description at the top and whatsnew to make this more clear). So the equivalent example is:

In [55]: s = pd.Index(['a', 'b', 'c'])

In [56]: s.str.upper()
Out[56]: Index(['A', 'B', 'C'], dtype='object')

So in fact I make the datetime fields more consistent with the the str methods, as the first now return an array, while the string methods already return the result wrapped in an Index.

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 6, 2017

Oh, yep that makes sense then, sorry I basically only read the title.

@jorisvandenbossche jorisvandenbossche changed the title [WIP] API: return Index instead of array from datetime field accessors (GH15022) [WIP] API: return Index instead of array from DatetimeIndex field accessors (GH15022) Mar 6, 2017
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 6, 2017

Ah, yes :-) updated the title to make that more clear (although it is not only for DatetimeIndex, but also PeriodIndex and TimedeltaIndex). And that reminds me, I don't think I already changed this for TimedeltaIndex

TODO:

  • same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

this should be for all datetime-like accessors I think (no exclusions).

@@ -509,6 +509,10 @@ def test_fields(self):
tm.assert_series_equal(s.dt.seconds, Series(
[10 * 3600 + 11 * 60 + 12, np.nan], index=[0, 1]))

# preserve name (GH15589)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to add something to

pandas/tests/indexes/datetimelike.py. These are inherited by all of the datetimelike test indexes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that they don't have a common field attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now ensured I have a test for each of period, timedelta, datetime that checks the name preservation, but indeed, ideally would have a test in datetimelike.py for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that they don't have a common field attribute.

you should simply run it for index._datetimelike_ops which are defined per-class

but no big deal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would indeed be a possibility, and just checked and eg also freq is included in this list (which has a different return type). So would start skipping those, which would also not be that clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah prob should just define these as fixtures I think, then would make it really easy

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_datetime_values.py#L30

@@ -52,7 +52,8 @@
def _field_accessor(name, alias, docstring=None):
def f(self):
base, mult = _gfc(self.freq)
return get_period_field_arr(alias, self._values, base)
result = get_period_field_arr(alias, self._values, base)
return Index(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name=self.name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, still busy :-)

@jorisvandenbossche
Copy link
Member Author

same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

this should be for all datetime-like accessors I think (no exclusions).

Yep, I did that, but also noticed that there are quite some other operations on Index objects that also return an array. Maybe we should have a more general discussion on where to draw the line (but that is for another issue)

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Mar 7, 2017
@jorisvandenbossche jorisvandenbossche changed the title [WIP] API: return Index instead of array from DatetimeIndex field accessors (GH15022) API: return Index instead of array from DatetimeIndex field accessors (GH15022) Mar 7, 2017
@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

Yep, I did that, but also noticed that there are quite some other operations on Index objects that also return an array. Maybe we should have a more general discussion on where to draw the line (but that is for another issue)

yes pls create an issue (maybe with checkboxes)?

@jorisvandenbossche
Copy link
Member Author

@jreback the failing test is one where a boolean Series is now object dtyped. This is because we don't have a boolean index, so it gets object dtype. But if it is then converted to a Series, this keeps object dtype

In [24]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [25]: idx.is_month_start
Out[25]: Index([True, True, True, False, False], dtype='object')

In [26]: pd.Series(idx).dt.is_month_start
Out[26]: 
0     True
1     True
2     True
3    False
4    False
dtype: object

Is there a good way to deal with this? (I can infer the dtype when it is object within the Properties delegator)

@jorisvandenbossche
Copy link
Member Author

That is actually a side effect of this PR I did not consider. Returning an object index with booleans is not really good ..

On second thoughts, it is actually totally not acceptable, because filtering with a mask (boolean indexing) does not work anymore.
So if I want to keep this PR, I will have to distinguish the return type (array vs Index) on the dtype of the result (bool vs numerical/string). Unless we have a bool support in Index.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

ehen we have a comparison method that returns a boolean array we just return the array directly

see _add_comparison_methods in indexes/base

so i would do the same here, just return the ndarray

@jorisvandenbossche
Copy link
Member Author

@jreback updated this, if you could have a look again

This PR has the consequence that it introduces an inconsistency between the return type of different datetime field accessors (-> array for boolean fields and Index for all others). So we have to be sure we are OK with introducing this.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

@jorisvandenbossche yes will put some comments.

FYI don't cancel any travis jobs....testing the deduping auto cancellation.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

This PR has the consequence that it introduces an inconsistency between the return type of different datetime field accessors (-> array for boolean fields and Index for all others). So we have to be sure we are OK with introducing this.

as I said before, I think this is ok. but let me look.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. suggestion for consolidating how fields are referenced a bit.

@@ -471,6 +471,38 @@ New Behavior:

s.map(lambda x: x.hour)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a ref here

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The several datetime-related attributes (see :ref:`here <timeseries.components>`
for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-backticks on DatetimeIndex etc.

The several datetime-related attributes (see :ref:`here <timeseries.components>`
for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously
returned numpy arrays, now they will return a new Index object (:issue:`15022`).
Only in case of a boolean field, still a boolean array is returned to support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only in the case of a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence is awkward, see if you can reword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicity list the Index boolean methods? (e.g. is_quarter_start.....)


# boolean fields
fields = ['is_leap_year']
# other boolean fields like 'is_month_start' and 'is_month_end'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose let's make an issue for this NaT enhancement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will open an issue for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -64,6 +64,7 @@ def f(self):
if self.tz is not utc:
values = self._local_timestamps()

# boolean accessors -> return array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth it to add something like this:

class DatetimeIndex....:


    _boolean_ops = ['is_month_start'......]
    _datetimelike_ops = [....] + _boolean_ops

then you can use that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, that is indeed cleaner. But, the problem is that I would still have to distinguish here in another way, as the is_leap_year is also a boolean one, but has to be processed differently. So not sure if that is then worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is is_leap_year different? seems that it should be the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's wrong in the code, it can be treated exactly like the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the handling of NaNs is different (that is related to the other issue of NaT not having the boolean fields, will open an issue about that). For is_leap_year (which returns False for NaT), the handling of missing values in self._maybe_mask_results(result, convert='float64') would return the wrong result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok. pls open a new issue and I will do a followup to fixup this. It much too specially casey. So good to go when you are ready.

elif field in ['is_leap_year']:
# no need to mask NaT
return libts.get_date_field(values, field)

# non-boolean accessors -> return Index
elif field in ['weekday_name']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above maybe list this in DatetimeIndex, maybe _other_ops = ['weekday_name'] or something
just to avoid explicity listing these in two places.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

@jorisvandenbossche let's merge this unless anything else (I'll rebase on top after).

@codecov
Copy link

codecov bot commented Mar 22, 2017

Codecov Report

Merging #15589 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15589      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49403    49407       +4     
==========================================
- Hits        44967    44960       -7     
- Misses       4436     4447      +11
Impacted Files Coverage Δ
pandas/tseries/util.py 100% <100%> (ø) ⬆️
pandas/tseries/common.py 88.09% <100%> (-1.07%) ⬇️
pandas/tseries/converter.py 62.95% <100%> (ø) ⬆️
pandas/tseries/period.py 92.67% <100%> (+0.01%) ⬆️
pandas/tseries/tdi.py 90.23% <100%> (ø) ⬆️
pandas/tseries/index.py 95.4% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.96% <0%> (-0.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79581ff...ffacd38. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

thanks!

@jreback jreback closed this in 1a266ee Mar 22, 2017
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
… (GH15022)

closes pandas-dev#15022

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes pandas-dev#15589 from jorisvandenbossche/api-dt-fields-index and squashes the following commits:

ffacd38 [Joris Van den Bossche] doc fixes
41728a9 [Joris Van den Bossche] FIX: boolean fields should still return array
6317b6b [Joris Van den Bossche] Add whatsnew
96ed069 [Joris Van den Bossche] Preserve name for PeriodIndex field accessors
cdf6cae [Joris Van den Bossche] Preserve name for DatetimeIndex field accessors
f2831e2 [Joris Van den Bossche] Update timedelta accessors
52f9008 [Joris Van den Bossche] Fix tests
41008c7 [Joris Van den Bossche] API: return Index instead of array from datetime field accessors (GH15022)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: let DatetimeIndex date/time components return a new Index instead of array
3 participants