-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add in extension dtype registry #21185
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21185 +/- ##
==========================================
+ Coverage 91.9% 91.91% +<.01%
==========================================
Files 154 154
Lines 49659 49673 +14
==========================================
+ Hits 45640 45657 +17
+ Misses 4019 4016 -3
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`) | ||
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instaniate a registered ``DecimalDtype`` (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo instantiate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that there are changes in this PR that go beyond just adding in the extension dtype registry. Wasn't your goal to just make this PR about the registry?
split out ops to a separate PR #21191 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't look at all the tests yet.
In addition:
- We need to discuss what is the exact contract for this registry and
array_type
: where is it used, and how is it used?
I think now it is used in constructors or astype if you specify a (string) dtype? But then it calls the class constructor of the extension array? Shouldn't this rather be the_from_sequence
method ? - Can you add some explanation of the registry to the extension array documentation?
- I think this also gives an enhancement for the existing types (interval, datetimetz, period, ..) that you can now specify string dtypes? (since they are added to the registry) In that case, can you add tests for this and a whatsnew notice?
doc/source/whatsnew/v0.24.0.txt
Outdated
ExtensionType Changes | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the addition of dropna
and append
orthogonal to the registry, or are they in some way needed for that?
I think we should be hesitant in adding more and more methods to the ExtensionArray if they are not strictly needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, these are necessary for followons. The api is not fleshed out enough.
pandas/core/arrays/base.py
Outdated
"""Construct a new ExtensionArray from a sequence of scalars. | ||
|
||
Parameters | ||
---------- | ||
scalars : Sequence | ||
Each element will be an instance of the scalar type for this | ||
array, ``cls.dtype.type``. | ||
copy : boolean, default True | ||
if True, copy the underlying data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this exactly mean?
Often, the scalars objects are either not stored as is, and then copy or not copy makes no difference. Or if they are stored, they don't necessarily have a 'copy' method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same guarantee with copy we have already, if its True, then copy if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that to the docstring?
pandas/core/arrays/base.py
Outdated
* _concat_same_type | ||
* array_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an attribute of the array class I think?
|
||
# dispatch on extension dtype if needed | ||
if is_extension_array_dtype(dtype): | ||
return dtype.array_type._from_sequence(arr, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be any validation about arr
? Eg that it is 1D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this validation is a contract of the Array itself
pandas/core/dtypes/dtypes.py
Outdated
""" | ||
Parameters | ||
---------- | ||
dtype : PandasExtension Dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document constructor parameter?
and "PandasExtension Dtype" -> "ExtensionDtype"
pandas/core/dtypes/dtypes.py
Outdated
---------- | ||
dtype : PandasExtension Dtype | ||
""" | ||
if not issubclass(dtype, (PandasExtensionDtype, ExtensionDtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't ExtensionDtype be enough? (our internals one subclass both I think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet, are extension dtypes are not all ExtensiDtypes yet (this is why we have the whole PandasExtensionDtype bizness)
pandas/core/dtypes/dtypes.py
Outdated
if string.startswith('period[') or string.startswith('Period['): | ||
# do not parse string like U as period[U] | ||
return PeriodDtype.construct_from_string(string) | ||
raise TypeError("could not construct PeriodDtype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider rather changing construct_from_string
itself? As those don't really adhere to the ExtensionDtype requirements I think?
If we keep both, I would make this one private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/extension/base/ops.py
Outdated
|
||
class BaseOpsTests(BaseExtensionTests): | ||
"""Various Series and DataFrame ops methos.""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a left-over from splitting it from the previous PR?
all fixed up @jorisvandenbossche (except for docs on Registry in Extension section) |
see if can make this dispatch to create_array_type as a function |
92a1322
to
10aab3c
Compare
updated docs @jorisvandenbossche @TomAugspurger if any final comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more comments.
From my previous review:
We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?
Can you add some explanation about this to the docs/docstrings? (eg that it relies on _from_sequence
)
pandas/core/arrays/base.py
Outdated
"""Construct a new ExtensionArray from a sequence of scalars. | ||
|
||
Parameters | ||
---------- | ||
scalars : Sequence | ||
Each element will be an instance of the scalar type for this | ||
array, ``cls.dtype.type``. | ||
copy : boolean, default True | ||
if True, copy the underlying data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that to the docstring?
pandas/core/dtypes/base.py
Outdated
type | ||
""" | ||
if array is None: | ||
return cls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cls
is never the correct return? (since it is a dtype class, while this should return an array class?)
pandas/core/dtypes/dtypes.py
Outdated
""" Registry for dtype inference | ||
|
||
We can directly construct dtypes in pandas_dtypes if they are | ||
a type; the registry allows us to register an extension dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence "We can directly construct dtypes in pandas_dtypes if they are a type" is not really clear to me. What does "if they are a type" mean? To which context does this refer?
'dtype, expected', | ||
[('int64', None), | ||
('interval', IntervalDtype()), | ||
('interval[int64]', IntervalDtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace this (or also add) one that is not the default?
Eg ('interval[datetime64[ns]]', IntervalDtype('datetime64[ns]')),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although that is maybe covered with the period or datetime64 one below. Is 'interval[datetime64[ns]]'
already tested for IntervalDtype?)
dtype = data.dtype | ||
|
||
expected = pd.Series(data) | ||
result = pd.Series(np.array(data), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a restriction on what the result of __array__
should be? Maybe not necessarily the scalars?
So maybe better to do list(data)
or data.astype(object)
@pytest.mark.xfail(reason="not implemented constructor from dtype") | ||
def test_from_dtype(self, data): | ||
# construct from our dtype & string dtype | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you actually implement this? (only change needed would be to register DecimalDtype I think?)
As that way we actually have a test for external dtypes registering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do in subsequent PR's trying to keep this diff down.
# TODO(extension) | ||
@pytest.mark.xfail(reason=( | ||
"raising AssertionError as this is not implemented, " | ||
"though easy enough to do")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can change the AssertionError to a ValueError in the code, and then we can still test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok for now
@pytest.mark.xfail(reason="not implemented constructor from dtype") | ||
def test_from_dtype(self, data): | ||
# construct from our dtype & string dtype | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test here that an appropriate error is raised (instead of skipping the test) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
@@ -109,6 +109,11 @@ class ExtensionDtype(_DtypeOpsMixin): | |||
* name | |||
* construct_from_string | |||
|
|||
Optionally one can override construct_array_type for construction | |||
with the name of this dtype via the Registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would give some additional explanation of what the Registry is, because now this is not explained here?
doc/source/whatsnew/v0.24.0.txt
Outdated
ExtensionType Changes | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, I am not convinced we should necessarily add those methods (certainly append), so would personally leave this for another PR to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with not adding append.
Could go either way on dropna.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need them for subsquent PR's I guess will move them.
doc/source/whatsnew/v0.24.0.txt
Outdated
ExtensionType Changes | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with not adding append.
Could go either way on dropna.
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`) | ||
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore | ||
the dtype has gained the ``construct_array_type`` (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'the dtype' -> ExtensionDtype has gained :meth:`ExtensionDtype.construct_array_type`
doc/source/whatsnew/v0.24.0.txt
Outdated
.. _whatsnew_0240.api.other: | ||
|
||
Other API Changes | ||
^^^^^^^^^^^^^^^^^ | ||
|
||
- | ||
- Invalid consruction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consruction -> construction.
@@ -379,6 +383,16 @@ def fillna(self, value=None, method=None, limit=None): | |||
new_values = self.copy() | |||
return new_values | |||
|
|||
def dropna(self): | |||
""" Return ExtensionArray without NA values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the leading space. Add a trailing .
pandas/core/dtypes/base.py
Outdated
|
||
Parameters | ||
---------- | ||
array : array-like, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche by "this" you mean the array
argument right? I'm also wondering that.
@@ -8,6 +8,64 @@ | |||
from .base import ExtensionDtype, _DtypeOpsMixin | |||
|
|||
|
|||
class Registry(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking at the uses yet, could we simplify this a by just allowing string lookup? Ideally, registry
would be a simple class holding a dict mapping dtype.name -> ExtensionDtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current code also supports finding the dtype for eg 'interval[int64]'
and not just interval
(so parametrized versions of the strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I suppose we want that to support .astype('interval[int64]')
. That's fair...
pandas/core/indexes/interval.py
Outdated
@@ -800,7 +800,7 @@ def astype(self, dtype, copy=True): | |||
@cache_readonly | |||
def dtype(self): | |||
"""Return the dtype object of the underlying data""" | |||
return IntervalDtype.construct_from_string(str(self.left.dtype)) | |||
return IntervalDtype(str(self.left.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left.dtype.name
? I'm not sure when these differ, but using .name
seems safer.
ok all fixed up. @jorisvandenbossche @TomAugspurger if you want to look. |
@jreback I'm wondering if there is a meta issue with the registry that should be thought about. Let's say that pandas registers 10 types - they are always in the registry for all pandas users. Now let's say you have 2 different people who independently extend pandas with the EA capabilities creating libraries ABC and DEF. But they happen to both call their extension array type I realize this is a highly unlikely occurrence. Put another way, shouldn't there be code for the registry that says "this type has already been registered"? Also, the implementation of the registry is using a list. Shouldn't you use a dict or set? |
@Dr-Irv you don;t need to register the EA types, this is only for the purpose of translating a string dtype name -> EA dtype (e.g. 'categorical' -> Categorical), basically for convenience. so sure you could have a collision, but then user would have to deal with that (this wouldn't show up on pandas because we won't have colliding names). |
@Dr-Irv I had this an OrderdedDict but its actually not necessary, and is actually not correct. If we have a string, we have to ask the dtypes can you construct a type from it. The issue is that multiple strings can map to a dtype when its parameterized, e.g. |
@jreback Thanks for the clarification of the need versus require of registration of EA types. On the implementation, wouldn't OrderedDict be faster once there are a lot of types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from adding EA.append, I'm +1 on this.
Given that NumPy doesn't implement it, I don't think we should either.
I had removed append FYI (it should not longer be in this PR) |
This reverts commit 6d5c67b.
any final comments. going to rebase. |
merging. |
precursor to #21160