Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Simplify Datetimelike constructor dispatching #23140

Closed
wants to merge 22 commits into from

Conversation

jbrockmendel
Copy link
Member

Implement several missing tests, particularly for TimedeltaArray

Move several things to DatetimeLikeArrayMixin that will need to be there eventually.

Misc cleanups.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Oct 14, 2018

Codecov Report

Merging #23140 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23140      +/-   ##
==========================================
+ Coverage   92.19%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50959    50986      +27     
==========================================
+ Hits        46980    47009      +29     
+ Misses       3979     3977       -2
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.28% <49.18%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/period.py 93.22% <ø> (-0.02%) ⬇️
pandas/core/arrays/datetimelike.py 94.97% <100%> (+0.07%) ⬆️
pandas/core/indexes/datetimelike.py 98.25% <100%> (+0.02%) ⬆️
pandas/core/arrays/timedeltas.py 94.47% <100%> (+0.5%) ⬆️
pandas/compat/numpy/function.py 87.97% <100%> (+1.31%) ⬆️
pandas/core/indexes/timedeltas.py 90.65% <100%> (-0.12%) ⬇️
pandas/core/arrays/period.py 95.97% <100%> (+0.4%) ⬆️
pandas/io/pytables.py 92.44% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 96.47% <100%> (-0.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 913f71f...b5827c7. Read the comment docs.

@@ -332,6 +344,9 @@ def _validate_frequency(cls, index, freq, **kwargs):
# Frequency validation is not meaningful for Period Array/Index
return None

# DatetimeArray may pass `ambiguous`, nothing else allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this? can you comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clarify this comment. kwargs gets passed below to cls._generate_range, and the only kwarg that is valid there is "ambiguous", and that is only for DatetimeArray.

values = dt64arr_to_periodarr(values, freq)

elif is_object_dtype(values) or isinstance(values, (list, tuple)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be is_list_like? (for the isinstance check)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically for object dtype (actually, I need to add dtype=object to the np.array call below) since we're calling libperiod.extract_ordinals, which expects object dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically what happens if other non ndarray list likes hit this path? do they need handling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do need handling, but we're not there yet. The thought process for implementing these constructors piece-by-piece is

a) The DatetimeIndex/TimedeltaIndex/PeriodIndex constructors are overgrown; let's avoid that in the Array subclasses.
b) Avoid letting the implementations get too far ahead of the tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other question: where was this handled previously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to say what's better in the abstract.

From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.

Regardless, I think our two goals with the array constructors should be

  1. Maximizing developer happiness (i.e. not users at the moment)
  2. Making it easier to reuse code between Index & Array subclasses

If you think we're likely to end up in a situation where being able to pass an array of objects to the main __init__ will make things easier, then by all means.

Copy link
Contributor

@jreback jreback Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit puzzled why you would handle lists and and ndarray differently (tom and joris); these are clearly doing the same thing and we have a very similar handling for list likes throughout pandas

separating these is a non starter - even having a separate constructor is also not very friendly. pandas does inference on the construction which is one of the big selling points. trying to change this, esp at the micro level is a huge mental disconnect.

if you want to propose something like that pls do it in other issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit puzzled why you would handle lists and and ndarray differently (tom and joris)

I don't think we are.

But, my only argument was

From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.

If that's not persuasive then I'm not going to argue against handling them in the init.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to think carefully

+1

Maximizing developer happiness

+1

Making it easier to reuse code

+1

If you think we're likely to end up in a situation where being able to pass an array of objects to the main

Yes, I think we should be pretty forgiving about what gets accepted into __init__ (for all three of Period/Datetime/Timedelta Arrays). Definitely don't want the start, end, periods currently in the Index subclass constructors. I think by excluding those we'll keep these constructors fairly straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit puzzled why you would handle lists and and ndarray differently

It's not about lists vs arrays, it's about arrays of Period objects vs arrays of ordinal integers, which is something very different.

I think we should be pretty forgiving about what gets accepted into init

Being forgiving is exactly what lead to the complex Period/DatetimeIndex constructors. I think we should not make the same choice for our Array classes.
Of course it doesn't need to be that complex, as I think there are here two main usecases discussed: an array of scalar objects (eg Periods or Timestamps), or an array of the underlying storage type (eg datetime64 or ordinal integers).

I personally also think it makes the code clearer to even separate those two concepts (basically what we also did with IntegerArray), but maybe let's open an issue to further discuss that instead of here in a hidden review comment thread? (i can only open one later today )

values = dt64arr_to_periodarr(values, freq)

elif is_object_dtype(values) or isinstance(values, (list, tuple)):
# e.g. array([Period(...), Period(...), NaT])
values = np.array(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this is an int array? or is that prohibited? (except via _from_ordinals)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it gets passed through simple_new unchanged.

@@ -430,6 +430,10 @@ def min(self, axis=None, *args, **kwargs):
--------
numpy.ndarray.min
"""
if axis is not None and axis >= self.ndim:
raise ValueError("`axis` must be fewer than the number of "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this here, rather this should be in valididate_* functions (if you think this is really necessary and you have a test for it)

@@ -458,6 +462,10 @@ def argmin(self, axis=None, *args, **kwargs):
--------
numpy.ndarray.argmin
"""
if axis is not None and axis >= self.ndim:
raise ValueError("`axis` must be fewer than the number of "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for all of these

return cls._generate_range(start, end, periods, name, freq,
tz=tz, normalize=normalize,
closed=closed, ambiguous=ambiguous)
out = cls._generate_range(start, end, periods,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out -> result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

@@ -45,6 +45,19 @@ def datetime_index(request):
return pi


@pytest.fixture
def timedelta_index(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually promote these to conftest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For now this is a pretty bare-bones version to get the ball rolling.

@jreback jreback added Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Period Period data type labels Oct 14, 2018
@@ -344,7 +344,8 @@ def _validate_frequency(cls, index, freq, **kwargs):
# Frequency validation is not meaningful for Period Array/Index
return None

# DatetimeArray may pass `ambiguous`, nothing else allowed
# DatetimeArray may pass `ambiguous`, nothing else will be accepted
# by cls._generate_range below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn’t u just pop the kwarg for key and pass it directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually that ends up being appreciably more verbose. We have to do separate cls._generate_range calls for TimedeltaArray vs DatetimeArray

values = dt64arr_to_periodarr(values, freq)

elif is_object_dtype(values) or isinstance(values, (list, tuple)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically what happens if other non ndarray list likes hit this path? do they need handling?

raise ValueError("`axis` must be fewer than the number of "
"dimensions ({ndim})".format(ndim=self.ndim))

_validate_minmax_axis(axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not what i mean
add this specifically to no.validate_* there are mechanisms for this already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, done.

Raises
------
ValueError
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

# TODO: Remove this when we have a DatetimeTZArray
# Necessary to avoid recursion error since DTI._values is a DTI
# for TZ-aware
return self._ndarray_values.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing those? Those will need to be added back once we do the actual index/array split anyway, as they will be calling in the underlying array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing those? Those will need to be added back

Because I am OK with needing to add them back in a few days (hopefully)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can you then try to explain me what the advantage is of moving it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To make it clear what still needs to be moved/implemented at the Array level. e.g. Tom's PeriodArray PR implements some things in PeriodArray that should instead be in DatetimeLikeArrayMixin. Moving these prevents this kind of mixup.

  2. Because there are already a bunch of things that are going to need to be inherited from self.values, its better to get them all in one place and do that all at once.

  3. Because in the next pass I'll be implementing a decorator to do something like:

# TODO: enable this decorator once Datetime/Timedelta/PeriodIndex .values
#   points to a pandas ExtensionArray
# @inherit_from_values(["ndim", "shape", "size", "nbytes",
#                       "asi8", "freq", "freqstr"])
class DatetimeIndexOpsMixin(DatetimeLikeArrayMixin):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving these prevents this kind of mixup.

As long as one of the index classes is still inheriting from the ArrayMixin, there will be wrong / strange mixups, that need to be cleaned up

Because in the next pass I'll be implementing a decorator to do something like:

But how would you do that if the underlying values don't yet have those attributes, because it is not yet our internal array class?

And why not move them when implementing such a decorator? Then you actually have overview of the full changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have sufficiently frustrated me into reverting this so we can move this down the field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche if you're still up, can you take a look at the newest push and verify that the parts you have a problem with have been removed?

@@ -211,6 +219,10 @@ def astype(self, dtype, copy=True):
# ------------------------------------------------------------------
# Null Handling

def isna(self):
# EA Interface
return self._isnan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to have the _isnan concept on the arrays? We use it in some internal methods on the Index class, but for Arrays it seems to me additional complexity compared to simply defining isna appropriately on each Array ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed elsewhere; can we mark as resolved?

@@ -430,6 +430,7 @@ def min(self, axis=None, *args, **kwargs):
--------
numpy.ndarray.min
"""
nv.validate_minmax_axis(axis)
nv.validate_min(args, kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there reason not to add the axis validation to the existing validate_min ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly I don't want another function, rather you can simply check this in side the function which is already there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm not wild about the fact that the nv.validate_(min|max|argmin|argmax) functions now implicitly assume they are only being called on 1-dim objects, but at least the assumption is correct for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, that makes sense.
And adding them in a single validation is actually also mixing two kinds of validation: validation of arguments that are purely for numpy compat (things like out), opposed to validation of valid arguments for pandas (axis in the Series and Index methods is also there for consistency with DataFrame, than for compat with numpy)

values = dt64arr_to_periodarr(values, freq)

elif is_object_dtype(values) or isinstance(values, (list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other question: where was this handled previously?

@@ -430,6 +430,7 @@ def min(self, axis=None, *args, **kwargs):
--------
numpy.ndarray.min
"""
nv.validate_minmax_axis(axis)
nv.validate_min(args, kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly I don't want another function, rather you can simply check this in side the function which is already there.

@jbrockmendel
Copy link
Member Author

can u just make the validation for axis generic

See joris's comment above.

@jbrockmendel
Copy link
Member Author

The non-controversial parts of this have been ported to separate PRs. Closing.

@jbrockmendel jbrockmendel deleted the dlike8 branch October 18, 2018 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Period Period data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants