-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Simplify Datetimelike constructor dispatching #23140
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #23140 +/- ##
==========================================
+ Coverage 92.19% 92.19% +<.01%
==========================================
Files 169 169
Lines 50959 50986 +27
==========================================
+ Hits 46980 47009 +29
+ Misses 3979 3977 -2
Continue to review full report at Codecov.
|
pandas/core/arrays/datetimelike.py
Outdated
@@ -332,6 +344,9 @@ def _validate_frequency(cls, index, freq, **kwargs): | |||
# Frequency validation is not meaningful for Period Array/Index | |||
return None | |||
|
|||
# DatetimeArray may pass `ambiguous`, nothing else allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this? can you comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clarify this comment. kwargs
gets passed below to cls._generate_range
, and the only kwarg that is valid there is "ambiguous", and that is only for DatetimeArray.
values = dt64arr_to_periodarr(values, freq) | ||
|
||
elif is_object_dtype(values) or isinstance(values, (list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be is_list_like? (for the isinstance check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically for object dtype (actually, I need to add dtype=object
to the np.array
call below) since we're calling libperiod.extract_ordinals, which expects object dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically what happens if other non ndarray list likes hit this path? do they need handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do need handling, but we're not there yet. The thought process for implementing these constructors piece-by-piece is
a) The DatetimeIndex/TimedeltaIndex/PeriodIndex constructors are overgrown; let's avoid that in the Array subclasses.
b) Avoid letting the implementations get too far ahead of the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other question: where was this handled previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard for me to say what's better in the abstract.
From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods
constructor.
Regardless, I think our two goals with the array constructors should be
- Maximizing developer happiness (i.e. not users at the moment)
- Making it easier to reuse code between Index & Array subclasses
If you think we're likely to end up in a situation where being able to pass an array of objects to the main __init__
will make things easier, then by all means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am a bit puzzled why you would handle lists and and ndarray differently (tom and joris); these are clearly doing the same thing and we have a very similar handling for list likes throughout pandas
separating these is a non starter - even having a separate constructor is also not very friendly. pandas does inference on the construction which is one of the big selling points. trying to change this, esp at the micro level is a huge mental disconnect.
if you want to propose something like that pls do it in other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am a bit puzzled why you would handle lists and and ndarray differently (tom and joris)
I don't think we are.
But, my only argument was
From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.
If that's not persuasive then I'm not going to argue against handling them in the init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having to think carefully
+1
Maximizing developer happiness
+1
Making it easier to reuse code
+1
If you think we're likely to end up in a situation where being able to pass an array of objects to the main
Yes, I think we should be pretty forgiving about what gets accepted into __init__
(for all three of Period/Datetime/Timedelta Arrays). Definitely don't want the start, end, periods
currently in the Index subclass constructors. I think by excluding those we'll keep these constructors fairly straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am a bit puzzled why you would handle lists and and ndarray differently
It's not about lists vs arrays, it's about arrays of Period objects vs arrays of ordinal integers, which is something very different.
I think we should be pretty forgiving about what gets accepted into init
Being forgiving is exactly what lead to the complex Period/DatetimeIndex constructors. I think we should not make the same choice for our Array classes.
Of course it doesn't need to be that complex, as I think there are here two main usecases discussed: an array of scalar objects (eg Periods or Timestamps), or an array of the underlying storage type (eg datetime64 or ordinal integers).
I personally also think it makes the code clearer to even separate those two concepts (basically what we also did with IntegerArray), but maybe let's open an issue to further discuss that instead of here in a hidden review comment thread? (i can only open one later today )
pandas/core/arrays/period.py
Outdated
values = dt64arr_to_periodarr(values, freq) | ||
|
||
elif is_object_dtype(values) or isinstance(values, (list, tuple)): | ||
# e.g. array([Period(...), Period(...), NaT]) | ||
values = np.array(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if this is an int array? or is that prohibited? (except via _from_ordinals)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it gets passed through simple_new unchanged.
pandas/core/indexes/datetimelike.py
Outdated
@@ -430,6 +430,10 @@ def min(self, axis=None, *args, **kwargs): | |||
-------- | |||
numpy.ndarray.min | |||
""" | |||
if axis is not None and axis >= self.ndim: | |||
raise ValueError("`axis` must be fewer than the number of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this here, rather this should be in valididate_* functions (if you think this is really necessary and you have a test for it)
pandas/core/indexes/datetimelike.py
Outdated
@@ -458,6 +462,10 @@ def argmin(self, axis=None, *args, **kwargs): | |||
-------- | |||
numpy.ndarray.argmin | |||
""" | |||
if axis is not None and axis >= self.ndim: | |||
raise ValueError("`axis` must be fewer than the number of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for all of these
pandas/core/indexes/datetimes.py
Outdated
return cls._generate_range(start, end, periods, name, freq, | ||
tz=tz, normalize=normalize, | ||
closed=closed, ambiguous=ambiguous) | ||
out = cls._generate_range(start, end, periods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out -> result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
@@ -45,6 +45,19 @@ def datetime_index(request): | |||
return pi | |||
|
|||
|
|||
@pytest.fixture | |||
def timedelta_index(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually promote these to conftest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For now this is a pretty bare-bones version to get the ball rolling.
@@ -344,7 +344,8 @@ def _validate_frequency(cls, index, freq, **kwargs): | |||
# Frequency validation is not meaningful for Period Array/Index | |||
return None | |||
|
|||
# DatetimeArray may pass `ambiguous`, nothing else allowed | |||
# DatetimeArray may pass `ambiguous`, nothing else will be accepted | |||
# by cls._generate_range below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldn’t u just pop the kwarg for key and pass it directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually that ends up being appreciably more verbose. We have to do separate cls._generate_range calls for TimedeltaArray vs DatetimeArray
values = dt64arr_to_periodarr(values, freq) | ||
|
||
elif is_object_dtype(values) or isinstance(values, (list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically what happens if other non ndarray list likes hit this path? do they need handling?
pandas/core/indexes/datetimelike.py
Outdated
raise ValueError("`axis` must be fewer than the number of " | ||
"dimensions ({ndim})".format(ndim=self.ndim)) | ||
|
||
_validate_minmax_axis(axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not what i mean
add this specifically to no.validate_* there are mechanisms for this already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, done.
pandas/core/indexes/datetimelike.py
Outdated
Raises | ||
------ | ||
ValueError | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above
pandas/core/indexes/datetimes.py
Outdated
# TODO: Remove this when we have a DatetimeTZArray | ||
# Necessary to avoid recursion error since DTI._values is a DTI | ||
# for TZ-aware | ||
return self._ndarray_values.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you removing those? Those will need to be added back once we do the actual index/array split anyway, as they will be calling in the underlying array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you removing those? Those will need to be added back
Because I am OK with needing to add them back in a few days (hopefully)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can you then try to explain me what the advantage is of moving it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
To make it clear what still needs to be moved/implemented at the Array level. e.g. Tom's PeriodArray PR implements some things in PeriodArray that should instead be in DatetimeLikeArrayMixin. Moving these prevents this kind of mixup.
-
Because there are already a bunch of things that are going to need to be inherited from
self.values
, its better to get them all in one place and do that all at once. -
Because in the next pass I'll be implementing a decorator to do something like:
# TODO: enable this decorator once Datetime/Timedelta/PeriodIndex .values
# points to a pandas ExtensionArray
# @inherit_from_values(["ndim", "shape", "size", "nbytes",
# "asi8", "freq", "freqstr"])
class DatetimeIndexOpsMixin(DatetimeLikeArrayMixin):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving these prevents this kind of mixup.
As long as one of the index classes is still inheriting from the ArrayMixin, there will be wrong / strange mixups, that need to be cleaned up
Because in the next pass I'll be implementing a decorator to do something like:
But how would you do that if the underlying values don't yet have those attributes, because it is not yet our internal array class?
And why not move them when implementing such a decorator? Then you actually have overview of the full changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have sufficiently frustrated me into reverting this so we can move this down the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche if you're still up, can you take a look at the newest push and verify that the parts you have a problem with have been removed?
@@ -211,6 +219,10 @@ def astype(self, dtype, copy=True): | |||
# ------------------------------------------------------------------ | |||
# Null Handling | |||
|
|||
def isna(self): | |||
# EA Interface | |||
return self._isnan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to have the _isnan
concept on the arrays? We use it in some internal methods on the Index class, but for Arrays it seems to me additional complexity compared to simply defining isna
appropriately on each Array ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed elsewhere; can we mark as resolved?
pandas/core/indexes/datetimelike.py
Outdated
@@ -430,6 +430,7 @@ def min(self, axis=None, *args, **kwargs): | |||
-------- | |||
numpy.ndarray.min | |||
""" | |||
nv.validate_minmax_axis(axis) | |||
nv.validate_min(args, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there reason not to add the axis validation to the existing validate_min
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly I don't want another function, rather you can simply check this in side the function which is already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I'm not wild about the fact that the nv.validate_(min|max|argmin|argmax) functions now implicitly assume they are only being called on 1-dim objects, but at least the assumption is correct for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, that makes sense.
And adding them in a single validation is actually also mixing two kinds of validation: validation of arguments that are purely for numpy compat (things like out
), opposed to validation of valid arguments for pandas (axis
in the Series and Index methods is also there for consistency with DataFrame, than for compat with numpy)
values = dt64arr_to_periodarr(values, freq) | ||
|
||
elif is_object_dtype(values) or isinstance(values, (list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other question: where was this handled previously?
pandas/core/indexes/datetimelike.py
Outdated
@@ -430,6 +430,7 @@ def min(self, axis=None, *args, **kwargs): | |||
-------- | |||
numpy.ndarray.min | |||
""" | |||
nv.validate_minmax_axis(axis) | |||
nv.validate_min(args, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly I don't want another function, rather you can simply check this in side the function which is already there.
See joris's comment above. |
The non-controversial parts of this have been ported to separate PRs. Closing. |
Implement several missing tests, particularly for TimedeltaArray
Move several things to DatetimeLikeArrayMixin that will need to be there eventually.
Misc cleanups.