Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/REF: TimedeltaIndex.__new__ #23539

Merged
merged 19 commits into from
Nov 11, 2018
Merged

Conversation

jbrockmendel
Copy link
Member

TimedeltaIndex.__new__ type-checking is tough to decipher, and hides multiple bugs

>>> dti = pd.date_range('2016-01-01', periods=3)
>>> pd.TimedeltaIndex(np.array(dti))
TimedeltaIndex(['16801 days', '16802 days', '16803 days'], dtype='timedelta64[ns]', freq=None)

In addition to this incorrect behavior being supported, it is actually tested in tests.indexes.timedeltas.test_ops in what I have to assume is a copy/paste mixup of some kind.

>>> pd.TimedeltaIndex([0.5])
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

This PR fixes the constructor to

  1. be much more linear and clear than the existing constructor
  2. reject datetime64 and anything that isn't explicitly supported
  3. accept floating-dtype if and only if they can be converted losslessly: np.array([19.0]) is OK, np.array([19.5]) is not. np.nan is correctly converted to NaT

Other assorted small changes:
Move an incorrectly placed # ---- section divider in scalar timedelta tests.
Change a repeated-4-times 4-liner (setting result.freq = inferred_freq) to a 2-linear
Move raising and returning-early cases to the top of DatetimeIndex.__new__ and TimedeltaIndex.__new__

Has tests, does not have whatsnew note, pending suggestions as to which parts merit notes.

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@gfyoung gfyoung added Bug Refactor Internal refactoring of code Datetime Datetime data dtype Timedelta Timedelta data type labels Nov 7, 2018
pandas/core/arrays/datetimes.py Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 7, 2018

@jbrockmendel It seems you are adding net a lot of code. Is that because some of the things were previously handled by the to_timedelta call?

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here, and the TimedeltaIndex constructor can keep making use of that.

pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Show resolved Hide resolved

# Convert whatever we have into timedelta64[ns] dtype
if is_object_dtype(data) or is_string_dtype(data):
# no need to make a copy, need to convert if string-dtyped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would u check is_string_dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we have several tests that specifically pass e.g. np.array(['1 days', '4 days'], dtype='<S6'). (BTW we don't have similar tests for DatetimeIndex, which I intend to fix in the DTI analogue of this PR)

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

@TomAugspurger
Copy link
Contributor

What's the issue with passing a datetime64[ns] array to TDI? NumPy is happy with it

In [50]: np.asarray(dr).astype("timedelta64[ns]")
Out[50]:
array([1452816000000000000, 1452902400000000000, 1452988800000000000,
       1453075200000000000, 1453161600000000000, 1453248000000000000],
      dtype='timedelta64[ns]')

and we match that

In [51]: np.asarray(dr).astype("timedelta64[ns]") == np.asarray(pd.TimedeltaIndex(np.asarray(dr)))
Out[51]: array([ True,  True,  True,  True,  True,  True])

@jreback
Copy link
Contributor

jreback commented Nov 7, 2018

@TomAugspurger you can do that but IMHO this is a TypeError
if you actually want a timedelta from a datetime the. you need to subtract a datetime
this is really preventing foot shooting

@jorisvandenbossche
Copy link
Member

I would personally be fine with raising a TypeError, since I think this can be quite surprising behaviour (it's just using the integer values). But also fine to not deviate from numpy behaviour.

But given it is numpy behaviour, so if we want to change it, I think we should deprecate it first.

@TomAugspurger
Copy link
Contributor

IMO this is a TypeError

"is a" -> "should be"?

I don't really disagree, and we do raise for TimedeltaIndex(DatetimeIndex), so some consistency there would be nice. But I suspect that we have people relying on this, so a deprecation cycle would be needed.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2018

ok by me to deprecate first

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 7, 2018

this is really preventing foot shooting

I do realize my argument of "but NumPy allows it" isn't great :)

In [3]: np.array([1, np.nan]).astype(int)
Out[3]: array([                   1, -9223372036854775808])

setting aside the TDI(ndarray[timedelta64]) stuff. I think the important question is whether we want

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel By "constructor" do you mean TimedeltaArray.__init__, or just a generic term for creating a TimedeltaArray from some input?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here,

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really shouldn't have mentioned it all in the OP.

@jbrockmendel
Copy link
Member Author

I think the important question is whether we want [...] By "constructor" do you mean TimedeltaArray.init, or just a generic term for creating a TimedeltaArray from some input?

@TomAugspurger same as above; that is an important question, but I went out of my way to keep it totally orthogonal to this PR.

@jbrockmendel
Copy link
Member Author

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

@TomAugspurger
Copy link
Contributor

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

I think so.

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really

Ok. I think I saw "refactor TimedeltaIndex.__new__" and thought it meant something closer to what we did for PeriodIndex.__new__, but we may not be there yet.

@jorisvandenbossche
Copy link
Member

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

The bug you are mentioning (pd.TimedeltaIndex([0.5]) ignoring the decimal) is coming from to_timedelta doing that (I just verified it is indeed taking this path). So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

Yes, I think so.

@jbrockmendel
Copy link
Member Author

So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

AFAICT the to_timedelta call (L168 in master) is almost superfluous given what comes below (which was recently moved there IIRC because it was bizarrely located in _simple_new before that). If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

That said, an argument that I would find compelling is that we definitely don't want the behavior of TimedeltaIndex([x, y, z]) to diverge from the behavior of to_timedelta([x, y, z]), which this PR currently does. So I'll take a look at combining these after some more caffeine.

@jorisvandenbossche
Copy link
Member

If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

If you are going to look at it, I would recommend trying to make the checks you mention here redundant, by letting to_timedelta handle it. Eg strings is something that to_timedelta is made for to handle.
I think the object dtype handling could be removed, see below (but certainly didn't check all cases, so this might be too optimistic)

Dived into the (current master) code for a moment, and you have this:

# convert if not already
if getattr(data, 'dtype', None) != _TD_DTYPE:
data = to_timedelta(data, unit=unit, box=False)
elif copy:
data = np.array(data, copy=True)
data = np.array(data, copy=False)
if data.dtype == np.object_:
data = array_to_timedelta64(data)

Here, the conversion of everything that is not yet a timedelta64 array is first handled with to_timedelta. But at the end of the snippet, we still call array_to_timedelta64(data) (what is used inside to_timedelta) if the data is object dtype. But didn't we already just convert everything not m8 with to_timedelta?
So this happens if data is a TimedeltaArrayMixin (or at least one of the cases): TimedeltaArrayMixin still has a m8[ns] numpy dtype, so the getattr(data, 'dtype', None) != _TD_DTYPE is passed; so we then convert it with np.array(data) to an array of Timedelta objects, to then convert it back to m8 with array_to_timedelta64.
So this is of course just a redundant back and forth conversion, that can be avoided with a check if data is a TimedeltaArrayMixin, and in that case get the underlying numpy array of it.
(similar to what DatetimeIndex also does, it has a check if isinstance(data, DatetimeArrayMixin))

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche I think we're in agreement on the relevant part which is that the status quo is redundant and unclear. Give me some time to adjust the PR to see if we can reach something closer to optimal.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/tools/timedeltas.py Outdated Show resolved Hide resolved
sn = pd.to_timedelta(Series([pd.NaT]))
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
# Passing datetime64-dtype data to TimedeltaIndex is deprecated
sn = pd.to_timedelta(Series([pd.NaT]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is that this is all nan and inference is ambiguous at this point. so I wouldn't expect the warning (unless the dtype IS explicity passed). Can address in a followup, but pls open an issue, this is IMHO a bug.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

@gfyoung if you have any further comments.

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018
@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

@jbrockmendel ok let's merge this on green. a couple of minor followups (they are unresolved conversations).

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions and info about the docstrings format.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Show resolved Hide resolved
@jreback jreback merged commit 2c982df into pandas-dev:master Nov 11, 2018
@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

thanks @jbrockmendel if you'd apply @datapythonista comments next pass would be great. trying to not block your followups.

thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018
* upstream/master:
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
@jbrockmendel jbrockmendel deleted the pre-fixes branch November 11, 2018 23:42
@jbrockmendel
Copy link
Member Author

@datapythonista I'm adding suggested edits into #23587 (needs rebasing anyway). Definitely good suggestions.

For the "unit" description the best I've come up with is "The timedelta unit to treat integers as multiples of"; I can't think of a natural-sounding way to rephrase this that doesn't end in a preposition. Thoughts?

@datapythonista
Copy link
Member

That description sounds good to me. Can't think of anything better right now.

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
box2 = pd.Series if box is pd.Index else box
expected = tm.box_expected(expected, box2)

result = idx * Series(rng5f + 0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel why did you change this? Left-over from initially changing the behaviour on floats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Refactor Internal refactoring of code Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants