Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add in extension dtype registry #21185

Merged
merged 22 commits into from
Jul 3, 2018
Merged

ENH: add in extension dtype registry #21185

merged 22 commits into from
Jul 3, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 24, 2018

precursor to #21160

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 24, 2018
@jreback jreback added this to the 0.24.0 milestone May 24, 2018
@codecov
Copy link

codecov bot commented May 24, 2018

Codecov Report

Merging #21185 into master will increase coverage by <.01%.
The diff coverage is 92.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21185      +/-   ##
==========================================
+ Coverage    91.9%   91.91%   +<.01%     
==========================================
  Files         154      154              
  Lines       49659    49673      +14     
==========================================
+ Hits        45640    45657      +17     
+ Misses       4019     4016       -3
Flag Coverage Δ
#multiple 90.29% <92.98%> (ø) ⬆️
#single 41.95% <75.43%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/format.py 98.25% <ø> (ø) ⬆️
pandas/core/algorithms.py 94.85% <100%> (ø) ⬆️
pandas/core/dtypes/base.py 92.3% <100%> (+0.2%) ⬆️
pandas/core/series.py 94.19% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 93.13% <100%> (ø) ⬆️
pandas/core/dtypes/common.py 95.07% <100%> (+0.44%) ⬆️
pandas/core/arrays/base.py 87.85% <100%> (+0.26%) ⬆️
pandas/core/dtypes/cast.py 88.36% <50%> (-0.13%) ⬇️
pandas/core/internals.py 95.52% <81.81%> (-0.07%) ⬇️
pandas/core/dtypes/dtypes.py 95.98% <96.77%> (+0.03%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad76ffc...d2c91d7. Read the comment docs.

^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instaniate a registered ``DecimalDtype`` (:issue:`21185`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo instantiate

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there are changes in this PR that go beyond just adding in the extension dtype registry. Wasn't your goal to just make this PR about the registry?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2018

split out ops to a separate PR #21191

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at all the tests yet.

In addition:

  • We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?
    I think now it is used in constructors or astype if you specify a (string) dtype? But then it calls the class constructor of the extension array? Shouldn't this rather be the _from_sequence method ?
  • Can you add some explanation of the registry to the extension array documentation?
  • I think this also gives an enhancement for the existing types (interval, datetimetz, period, ..) that you can now specify string dtypes? (since they are added to the registry) In that case, can you add tests for this and a whatsnew notice?

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the addition of dropna and append orthogonal to the registry, or are they in some way needed for that?

I think we should be hesitant in adding more and more methods to the ExtensionArray if they are not strictly needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, these are necessary for followons. The api is not fleshed out enough.

"""Construct a new ExtensionArray from a sequence of scalars.

Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this exactly mean?
Often, the scalars objects are either not stored as is, and then copy or not copy makes no difference. Or if they are stored, they don't necessarily have a 'copy' method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same guarantee with copy we have already, if its True, then copy if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the docstring?

* _concat_same_type
* array_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an attribute of the array class I think?


# dispatch on extension dtype if needed
if is_extension_array_dtype(dtype):
return dtype.array_type._from_sequence(arr, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be any validation about arr ? Eg that it is 1D?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation is a contract of the Array itself

"""
Parameters
----------
dtype : PandasExtension Dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document constructor parameter?
and "PandasExtension Dtype" -> "ExtensionDtype"

----------
dtype : PandasExtension Dtype
"""
if not issubclass(dtype, (PandasExtensionDtype, ExtensionDtype)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't ExtensionDtype be enough? (our internals one subclass both I think?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet, are extension dtypes are not all ExtensiDtypes yet (this is why we have the whole PandasExtensionDtype bizness)

if string.startswith('period[') or string.startswith('Period['):
# do not parse string like U as period[U]
return PeriodDtype.construct_from_string(string)
raise TypeError("could not construct PeriodDtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider rather changing construct_from_string itself? As those don't really adhere to the ExtensionDtype requirements I think?

If we keep both, I would make this one private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


class BaseOpsTests(BaseExtensionTests):
"""Various Series and DataFrame ops methos."""
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a left-over from splitting it from the previous PR?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2018

all fixed up @jorisvandenbossche (except for docs on Registry in Extension section)

@jreback
Copy link
Contributor Author

jreback commented May 25, 2018

see if can make this dispatch to create_array_type as a function

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2018

updated docs

@jorisvandenbossche @TomAugspurger if any final comments.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments.

From my previous review:

We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?

Can you add some explanation about this to the docs/docstrings? (eg that it relies on _from_sequence)

"""Construct a new ExtensionArray from a sequence of scalars.

Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the docstring?

type
"""
if array is None:
return cls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cls is never the correct return? (since it is a dtype class, while this should return an array class?)

""" Registry for dtype inference

We can directly construct dtypes in pandas_dtypes if they are
a type; the registry allows us to register an extension dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "We can directly construct dtypes in pandas_dtypes if they are a type" is not really clear to me. What does "if they are a type" mean? To which context does this refer?

'dtype, expected',
[('int64', None),
('interval', IntervalDtype()),
('interval[int64]', IntervalDtype()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace this (or also add) one that is not the default?
Eg ('interval[datetime64[ns]]', IntervalDtype('datetime64[ns]')),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although that is maybe covered with the period or datetime64 one below. Is 'interval[datetime64[ns]]' already tested for IntervalDtype?)

dtype = data.dtype

expected = pd.Series(data)
result = pd.Series(np.array(data), dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a restriction on what the result of __array__ should be? Maybe not necessarily the scalars?

So maybe better to do list(data) or data.astype(object)

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually implement this? (only change needed would be to register DecimalDtype I think?)
As that way we actually have a test for external dtypes registering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do in subsequent PR's trying to keep this diff down.

# TODO(extension)
@pytest.mark.xfail(reason=(
"raising AssertionError as this is not implemented, "
"though easy enough to do"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change the AssertionError to a ValueError in the code, and then we can still test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok for now

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test here that an appropriate error is raised (instead of skipping the test) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@@ -109,6 +109,11 @@ class ExtensionDtype(_DtypeOpsMixin):
* name
* construct_from_string

Optionally one can override construct_array_type for construction
with the name of this dtype via the Registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give some additional explanation of what the Registry is, because now this is not explained here?

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, I am not convinced we should necessarily add those methods (certainly append), so would personally leave this for another PR to discuss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with not adding append.

Could go either way on dropna.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need them for subsquent PR's I guess will move them.

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with not adding append.

Could go either way on dropna.


- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the dtype has gained the ``construct_array_type`` (:issue:`21185`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'the dtype' -> ExtensionDtype has gained :meth:`ExtensionDtype.construct_array_type`

.. _whatsnew_0240.api.other:

Other API Changes
^^^^^^^^^^^^^^^^^

-
- Invalid consruction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consruction -> construction.

@@ -379,6 +383,16 @@ def fillna(self, value=None, method=None, limit=None):
new_values = self.copy()
return new_values

def dropna(self):
""" Return ExtensionArray without NA values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the leading space. Add a trailing .


Parameters
----------
array : array-like, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche by "this" you mean the array argument right? I'm also wondering that.

@@ -8,6 +8,64 @@
from .base import ExtensionDtype, _DtypeOpsMixin


class Registry(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking at the uses yet, could we simplify this a by just allowing string lookup? Ideally, registry would be a simple class holding a dict mapping dtype.name -> ExtensionDtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code also supports finding the dtype for eg 'interval[int64]' and not just interval (so parametrized versions of the strings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I suppose we want that to support .astype('interval[int64]'). That's fair...

@@ -800,7 +800,7 @@ def astype(self, dtype, copy=True):
@cache_readonly
def dtype(self):
"""Return the dtype object of the underlying data"""
return IntervalDtype.construct_from_string(str(self.left.dtype))
return IntervalDtype(str(self.left.dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left.dtype.name? I'm not sure when these differ, but using .name seems safer.

@jreback
Copy link
Contributor Author

jreback commented Jun 19, 2018

ok all fixed up. @jorisvandenbossche @TomAugspurger if you want to look.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 19, 2018

@jreback I'm wondering if there is a meta issue with the registry that should be thought about. Let's say that pandas registers 10 types - they are always in the registry for all pandas users. Now let's say you have 2 different people who independently extend pandas with the EA capabilities creating libraries ABC and DEF. But they happen to both call their extension array type MyEAType . Now if someone wants to use both libraries ABC and DEF, there will be a collision in the registry of MyEAType. So does there need to be some "master" registry in the documentation containing the names of all EA types that people have registered?

I realize this is a highly unlikely occurrence.

Put another way, shouldn't there be code for the registry that says "this type has already been registered"?

Also, the implementation of the registry is using a list. Shouldn't you use a dict or set?

@jreback
Copy link
Contributor Author

jreback commented Jun 19, 2018

@Dr-Irv you don;t need to register the EA types, this is only for the purpose of translating a string dtype name -> EA dtype (e.g. 'categorical' -> Categorical), basically for convenience.

so sure you could have a collision, but then user would have to deal with that (this wouldn't show up on pandas because we won't have colliding names).

@jreback
Copy link
Contributor Author

jreback commented Jun 19, 2018

@Dr-Irv I had this an OrderdedDict but its actually not necessary, and is actually not correct. If we have a string, we have to ask the dtypes can you construct a type from it. The issue is that multiple strings can map to a dtype when its parameterized, e.g. datetime64[ns, US/Eastern] is a valid string dtype as is datetime64[ns, Asia/Tokyo] and w/o imbuing the knowledge to say its 'good' inside the Register (bad idea), we can't know something works unless we duck type it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 19, 2018

@jreback Thanks for the clarification of the need versus require of registration of EA types.

On the implementation, wouldn't OrderedDict be faster once there are a lot of types?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from adding EA.append, I'm +1 on this.

Given that NumPy doesn't implement it, I don't think we should either.

@jreback
Copy link
Contributor Author

jreback commented Jun 20, 2018

I had removed append FYI (it should not longer be in this PR)

@jreback
Copy link
Contributor Author

jreback commented Jul 2, 2018

any final comments. going to rebase.

@jreback
Copy link
Contributor Author

jreback commented Jul 2, 2018

@jreback
Copy link
Contributor Author

jreback commented Jul 3, 2018

merging.

@jreback jreback merged commit 2f14faf into pandas-dev:master Jul 3, 2018
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Aug 14, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants