-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EA: revert treatment of i8values #24623
Conversation
No idea how we would go about fixing the failing feather tests. |
@TomAugspurger any idea about the feather thing. AFAICT any change that restores the DTA/Timestamp i8 equivalence will run into this problem. |
I think you're correct. On your branch In [32]: df = pd.DataFrame({"A": pd.date_range('2000', periods=2, tz="US/Central")})
In [33]: df
Out[33]:
A
0 2000-01-01 00:00:00-06:00
1 2000-01-02 00:00:00-06:00
In [34]: pa.Table.from_pandas(df).to_pandas()
Out[34]:
A
0 2000-01-01 06:00:00-06:00
1 2000-01-02 06:00:00-06:00 Which seems strange to me... Why would us changing back to the old way break pyarrow? I'll do some digging. |
pandas/core/arrays/datetimes.py
Outdated
|
||
if values.dtype != _NS_DTYPE: | ||
msg = ( | ||
if values.dtype == np.bool_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do u only check for book here? shouldn’t this. e inside _from_sequeneve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because up until moments ago I thought we had a test checking that pd.to_datetime(False)
returned NaT
. No idea where I got that idea... Just moved this check down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the good news is that I'm not hallucinating non-existent test cases. the relevant test was for to_datetime(False, errors="coerce")
, so I'm moving this check back up to __init__
. To be clear, I don't think the three checks in __init__
should be there at all, but for the moment I'm trying to minimize behavior changes and just get the tests passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not because to_datetime
has that behaviour that there should be a check for that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not because to_datetime has that behaviour that there should be a check for that here?
It's because to_datetime
has a coerce
kwarg that specified fallback behavior. If we want to disallow bools (and other dtypes presumably, bools were just the only one tested in 24024) in DatetimeArray but not in sequence_to_dt64ns, the check needs to be here.
Codecov Report
@@ Coverage Diff @@
## master #24623 +/- ##
===========================================
- Coverage 92.37% 43.03% -49.35%
===========================================
Files 166 166
Lines 52384 52368 -16
===========================================
- Hits 48390 22536 -25854
- Misses 3994 29832 +25838
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24623 +/- ##
==========================================
+ Coverage 92.38% 92.38% +<.01%
==========================================
Files 166 166
Lines 52327 52293 -34
==========================================
- Hits 48343 48312 -31
+ Misses 3984 3981 -3
Continue to review full report at Codecov.
|
@jbrockmendel this needs additional tests? |
Yes. Waiting to get weigh-in from @TomAugspurger @jorisvandenbossche @mroeschke. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally -1 on adding yet another constructor that can parse datetime-like values. We already have to_datetime
(and DatetimeIndex and to some extent pd.array
and Series) that can do that. Users that want to parse such values, can use those.
I would personally keep the discussion around the handling of integer values separated from that.
Adding a few comments asking for clarifications on some changes.
pandas/core/arrays/datetimes.py
Outdated
|
||
if values.dtype != _NS_DTYPE: | ||
msg = ( | ||
if values.dtype == np.bool_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not because to_datetime
has that behaviour that there should be a check for that here?
@@ -340,7 +340,7 @@ def test_from_array_keeps_base(self): | |||
arr = np.array(['2000-01-01', '2000-01-02'], dtype='M8[ns]') | |||
dta = DatetimeArray(arr) | |||
|
|||
assert dta._data is arr | |||
assert dta._data.base is arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because a view of arr
is taken, so the assertion is untrue as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why was it passing on master? Because you changed it to take a view? Is it important to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I changed it back to use _from_sequence, which uses sequence_to_dt64ns, which yes takes a view. And it is important in as much as in the PR implementing sequence_to_dt64ns when the original version only took a view if it was necessary, there was a request to do it unconditionally to save a line
pandas/core/arrays/datetimes.py
Outdated
# TODO: get tz out of here altogether | ||
assert dtype is None | ||
tz = timezones.tz_standardize(tz) | ||
dtype = DatetimeTZDtype(tz=tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this tz normalization needs to be applied to the case of user-provided dtype=DatetimeTZDtype(...)
as well. I'll try to construct a failing test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this (roughly) passes on master, fails here
def test_foo():
idx = pd.DatetimeIndex(['2000', '2001'], tz='US/Central')
t1 = pd.DatetimeTZDtype(tz=idx.tz)
t2 = pd.DatetimeTZDtype(tz=idx[0].tz)
x = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t1)
y = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t2)
assert x.dtype.tz == y.dtype.tz
I think it should pass.
I've been out of the loop with DatetimeArray development, but overall I am +1 with the goals outlined in #24559 (comment) It would be nice if there was 1 entry point to parse all datetime values, but given that we have so many constructors already; this PR's step of maintaining consistency with the others is good. |
@jorisvandenbossche can you clarify this objection? You would like the following to raise, as it does on master? In [6]: pd.arrays.DatetimeArray(np.array(['2015-01-01']))
Out[6]:
<DatetimeArray>
['2015-01-01 00:00:00']
Length: 1, dtype: datetime64[ns] |
@jbrockmendel can you rebase |
Yes, I would prefer the situation as it is on master. Reasons for my current thinking on that:
It's basically what we have been discussing regularly before (in #23212, and probably some other places that I don't directly recall), and what we have done for IntegerArray. |
Thanks. There's a 3rd (less important) argument: performance. The slowdown in #24666 comes down to using |
This PR is on hold pending discussion in #24559. There are parts of this PR that I expect to remain unchanged regardless of how that is resolved:
|
@jbrockmendel the performance in
that's fine by me, as long as there's a fastpath when the input is known to be an ndarray, Datetime/TimedeltaArray (or box around those) with the correct dtype. |
Are we holding the RC for this? I haven't followed too closely, but IIRC there's two issues
Where are we on number 2? Do we have agreement on the desired behavior, and does this PR implement that? |
On the desired behavior (let's keep the discussion here now that the PR is open): #24559 (comment) has a nice summary. @jorisvandenbossche prefers
I think that's my preference as well, if I understand things correctly. @jbrockmendel do you need / want any help pushing this across the line? I'm busy with there things most of today, but can carve out some time if needed. It'd be good to do the RC today I think. |
As far as I understand, this PR is not related to the handling of integers (at least that is what I understood from asking @jbrockmendel about it), so I would prefer to keep the discussion about that in the issue. |
About this PR, if we want the changes of this PR in general, it's indeed easy to defer the discussion about the flexibility of the constructor by simply adding a check if dtype is M8 or int, and otherwise raising an error. But I thought there was still disagreement about whether we want this freq validation or not? (because of the performance hit) Or is that resolved? |
One thing that it does, though (from #24559 (comment)) is changing the behaviour of DatetimeArray. But so not the reverting of DatetimeIndex + adding a warning there. I would personally keep that for a separate PR, to keep it distinct from the unrelated discussion here (eg flexibility of constructor, freq validation) |
I propose to do the following:
Everyone on board? |
Not fully on board, that still changes the behaviour of DatetimeArray regarding M8 input, which is something that first should be decided upon (see the table I added in #24559, writing yet another comment with possible options) |
I'm specifically proposing to revert changes to the behavior of DatetimeArray with regard to M8 input. edit: "revert" as in revert changes made in an earlier version of this PR. |
But so without using |
The suggested game-plan didn't discuss how to get from here to there. Just that the DTA behavior would revert to how it in master. |
Hello @jbrockmendel! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 09, 2019 at 17:54 Hours UTC |
ATM I'm trying to put together a branch that only adds freq validation. De-duplicating code, making _from_sequence match |
Can't you move the freq validation of |
(that comment was about the current diff, not about your last comment, to be clear) |
That's close to what I'm doing in the other branch I mentioned. It's a little inelegant to do as a helper function since it needs to do a bit at the beginning and a bit at the end. But It's only 3 lines in the new branch, so I'm not going to sweat it for now. |
See #24686. |
One option for #24559, with some spillover into #24567.
Restores freq validation (needs test)
Retains a private fastpath