Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: general concat with ExtensionArrays through find_common_type #33607

Merged
merged 14 commits into from
May 2, 2020

Conversation

jorisvandenbossche
Copy link
Member

Exploring what is being discussed in #22994. @TomAugspurger your idea seems to be working nicely! (it almost removes as much code than it adds (apart from tests/docs), ànd fixes the EA concat bugs ;))

Few notes compared to proposal in #22994:

  • Since we already have a find_common_type function, decided to use this as the "get_concat_dtype", since it seems this does what is needed for concat
  • Extending the EA interface with a ExensionDtype._get_common_type method that is used in pandas' find_common_type function to dispatch the logic to the extension type

What I already handled:

  • general protocol, using this for concat with axis=0 (eg concatting series) and using it as example for IntegerDtype
  • handling categoricals this way (which removes the concat_categorical helper). This turned up a few cases where we have value-dependent behaviour right now, which we can't easily preserve (mainly regarding NaNs and int dtype, see below)

Still need to handle sparse (those has failing tests now) and maybe datetime, and check other failures.

Comment on lines 111 to 121
if (
is_categorical_dtype(arr.dtype)
and isinstance(dtype, np.dtype)
and np.issubdtype(dtype, np.integer)
):
# problem case: categorical of int -> gives int as result dtype,
# but categorical can contain NAs -> fall back to object dtype
try:
return arr.astype(dtype, copy=False)
except ValueError:
return arr.astype(object, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complication is to try to preserve some of the value-dependent behaviour of Categorical (in case of integer categories: missing values present or not?)

Eg when concatting integer categorical with integer series:

pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3, 4])]) 
-> results in int dtype
pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3, 4])]) 
-> results in object dtype

Currently, when concatting, a Categorical with integer categories gets converted to int numpy array if no missing values are present, but object numpy array if missing values are present (to preserve the integers)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 17, 2020

I needed to make one change to the tests related to categorical (another value-dependent behaviour). Assume those examples involving integer categorical and float arrays:

In [13]: pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3.5, 4.5])])    
Out[13]: 
0      1
1    NaN
0    3.5
1    4.5
dtype: object

In [14]: pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3.5, 4.5])])    
Out[14]: 
0    1.0
1    2.0
0    3.5
1    4.5
dtype: float64

With a find_common_type, we will always find float. But the old behaviour is that the integer categorical becomes object if there are missing values present.
But I would say: if you are concatting with a float anyway, it doesn't make sense to try to provide the "integer categories" by keeping it object instead of casting to float.

So I would say that the new behaviour (always returning float in the above two examples) is better.

pandas/core/arrays/integer.py Show resolved Hide resolved
@@ -322,3 +323,33 @@ def _is_boolean(self) -> bool:
bool
"""
return False

def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm can we keep the return type as ExtensionDtype? Do you envision cases where we'd like to return a plain NumPy dtype?

Oh... I suppose tz-naive DatetimeArray might break this, since it wants to return a NumPy dtype...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my first thought as well. But, right now, eg Categorical can end up with any kind of numpy dtype (depending on the dtype of its categories).

As long as not yet all dtypes have a EA version, I don't think it is feasible to require ExtensionDtype here

@jorisvandenbossche
Copy link
Member Author

I ran into one other hairy question:

  • Should _get_common_type be able to handle "not fuly initialized" dtype objects? Like a CategoricalDtype without categories? (which we have to handle "category" string).
    And if so, what is the expected behaviour?

Right now I added some special cases in Categorical._get_common_type to have the find_common_type tests passing, but not sure it is necessarily doing the correct thing in a general way.

@jorisvandenbossche jorisvandenbossche added API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 17, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 17, 2020
@@ -95,6 +95,15 @@ def construct_array_type(cls) -> Type["IntegerArray"]:
"""
return IntegerArray

def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be common_type or common_dtype? we've been loose about this distinction so far and i think it has caused amibiguity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care that much. I mainly used "type", because it is meant to be used in find_common_type.

(that find_common_type name is inspired on the numpy function, and that one actually handles both dtypes and scalar types, which I assume is the reason for the name. The pandas version, though, doesn't really make the distinction, so could have been named "find_common_dtype")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to "common_dtype" instead of "common_type". The internal function that uses this is still find_common_type, but that name from numpy is actually a misnomer here, since we are only dealing with dtypes, and not scalar types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for indulging me on this nitpick

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally good concept will look in more detail later

pandas/core/arrays/categorical.py Show resolved Hide resolved
pandas/core/arrays/integer.py Show resolved Hide resolved
and arr.dtype.kind in ["m", "M"]
and dtype is np.dtype("object")
):
# wrap datetime-likes in EA to ensure astype(object) gives Timestamp/Timedelta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 quality comment


elif _contains_datetime or "timedelta" in typs or _contains_period:
elif _contains_datetime or "timedelta" in typs:
return concat_datetime(to_concat, axis=axis, typs=typs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do the DTA/TDA casting above, and do isinstance(obj, ExtensionArray) checks, can all of the dt64/td64 cases be handled by the EA code above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because they are not using ExtensionDtype.

@jbrockmendel
Copy link
Member

I like this idea, will want to give it another look after more caffeine. Haven't looked at the tests or Categorical-specific nuances yet.


return concat_categorical(to_concat)
return union_categoricals(to_concat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC only a relatively small part of the logic of concat_categorical/union_categoricals is actually needed here. I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around (since union_categoricals handles a lot of cases). Could be considered orthogonally to this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around

Yes, that's indeed a good idea (union_categoricals does way much more that is all not needed here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you planning to update this, or is that a topic for a separate PR?

@jorisvandenbossche
Copy link
Member Author

Are we implicitly assuming that self.dtype is a member of dtypes?

Right now, this is the case yes. What do you mean with making that explicit? (just better documenting it?)

Should _get_common_type be a classmethod?

It could be, but personally I don't really see to not make it an instance method. Yes, self will always be in the list of dtypes (see question above), but I can assume that for the implementation it can be useful to know self (instead of needing to infer it from the list of dtypes). Not that this should impact the result, I only think it can be handy in the implementation (but this is rather hypothetical feeling at the moment).
The main thing is: we always call this from the instance, so I don't see much value in making it a class method instead of instance method.

In the case (mentioned in a comment on the base class) where dtypes is a singleton, can we require that the return value be that dtype?

This is the only sensible return value in such a case (an EA doing anything different will behave very strange). But what do you mean with "require"? Better document it? Or we could have a base extension test that asserts it?

But see also my question above (#33607 (comment)): in principle we don't even need to call this _get_common_type method if there is only one dtype in the list. In the concat code / find_common_type code, we can simply use this dtype and shortcut calling _get_common_type.

Can we nail down the common_type vs common_dtype semantics before pushing this through? I feel like that has been a major cause of trouble with the existing concat code.

Can you explain this a bit more? I am not aware of any troubles regarding that.

@jorisvandenbossche
Copy link
Member Author

I have also been reading up on numpy dtype proposals regarding this. Specifically this part of the draft NEP relates to "common dtypes": https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst#common-dtype-operations

It's actually quite similar in idea, but some notable differences:

  • The dtype special method is limited to a single other dtype (they have Dtype.__common_dtype__(self, other) where other is a single dtype), instead of passing the list of all dtypes
  • They split the "common dtype" special method in two: determine the common class (from 2 different dtype classes), and determine the common instance (from 2 instances of the same class). And it is actually split in 3, because those 2 steps are then combined with as cast-lookup.

The first is an interesting idea. The second might only complicate it more for current pandas (although I understand its desire to split it up on steps to avoid duplicate logic in different parts).
And I also don't know if first doing it on the class level will work for pandas' current promotion logic. Eg the common dtype of Categorical + int dtype depends on the dtype of the categories of the Categorical.

@jbrockmendel
Copy link
Member

Can we nail down the common_type vs common_dtype semantics before pushing this through? I feel like that has been a major cause of trouble with the existing concat code.

Can you explain this a bit more? I am not aware of any troubles regarding that.

"trouble" in that we haven't been consistent about where some logic belongs. e.g. DTA._concat_same_type requires a single dtype, not just same-type.

@jorisvandenbossche
Copy link
Member Author

With this PR, EA._concat_same_type will only be called with arrays of all the same dtype.
(and which needs to be better documented now, as that was not clear from its docstring what the expectations were before)

Does that answer your concern? Otherwise you will still need to clarify further what you mean.

@jbrockmendel
Copy link
Member

Does that answer your concern? Otherwise you will still need to clarify further what you mean.

Close enough.

@jorisvandenbossche
Copy link
Member Author

I made a bunch of updates (renamed to use "dtype" instead of "type", documented the public API change and updated the EA interface docs, added a base extension test).
Ready for another review (or approval ;)) round

@jreback jreback merged commit fd3e5a1 into pandas-dev:master May 2, 2020
@jreback
Copy link
Contributor

jreback commented May 2, 2020

thanks @jorisvandenbossche very nice

@jorisvandenbossche jorisvandenbossche deleted the concat-protocol branch May 2, 2020 17:26
@jorisvandenbossche
Copy link
Member Author

And thanks all for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants