General repr format for our internal ExtensionArrays #22846

jorisvandenbossche · 2018-09-26T21:41:53Z

Triggered by the discussion I am having in #22511, I was also thinking we should look at our EAs reprs.

Currently we have for IntegerArray:

In [14]: pd.core.arrays.integer_array([1, 2, 3, None])
Out[14]: IntegerArray([1, 2, 3, nan], dtype='Int64')

which looks like code, but actually is not valid code (because the constructor needs an array, has no dtype argument, and nan is actually no defined).

So also here (since here we still have the freedom to choose something without the concern of changing something), I think we should have some discussion about what we ideally want.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-09-26T21:44:17Z

So a similar random thought as I mentioned in #22511 , one possible alternative:

<pandas.IntegerArray>
[1, 2, 3, nan, ..., 1, 2, 3]
Length: 400, dtype: Int64

TomAugspurger · 2018-09-27T12:41:01Z

That seems fine.

Would we update Categorical to include <pandas.Catgegorical>, or are we happy with its current repo?

jbrockmendel · 2018-09-27T21:42:48Z

This and some other common-but-not-necessary methods could go in a mixin in something like core.arrays.common and get mixed in to all of the pandas-internal EA subclasses.

TomAugspurger · 2018-10-26T14:01:11Z

PeriodArray is using a repr like
#22846 (comment).

In [19]: arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')

In [20]: arr
Out[20]:
<PeriodArray>
['2000', '2001']
Length: 2, dtype: period[A-DEC]

I'd be happy with something similar for IntegerArray.

SparseArray can't share it, because it needs to indicate the sparse points. But I think it can be a bit more uniform, something like

>>> pd.SparseArray([1, 0, 0, 4])
<pandas.SparseArray>
[1, 0, 0, 4]
Indicies: array([0, 3]), dtype: Sparse[int64, 0]

Categorical could maybe be updated, but not a big deal.

jorisvandenbossche · 2018-10-26T14:05:53Z

I personally like the fact that we have 'pandas' in the repr, like I did in the IntegerArray example above, or how you did the SparsArray (what PeriodArray doesn't have), to make it clear from it's repr it is a pandas array, without needing to check type(..).
Buf of course, if we don't add them to the top level API (which still needs to be discussed), then <pandas.IntegerArray> is a bit confusing.

TomAugspurger · 2018-10-26T14:27:06Z

Buf of course, if we don't add them to the top level API (which still needs to be discussed), then <pandas.IntegerArray> is a bit confusing.

That's why I left it out of PeriodArray. I don't have a strong opinion there though. It's clearly not code since it's in brackets. We could even have <pandas PeriodArray> maybe.

jreback · 2018-11-12T14:25:26Z

I am puzzled why you are arbitrarily changing formats here from what we already have longstanding for all Index types. The repr for Index looks much nicer, doesn't have angle brackets, fits on a single line if its short, and properly separates commas.

Creating a new format is non-trivial, which is ok to do generally, but only if we change everything. This is a huge, undesired breaking change. So big -1 on doing this. This impacts #23601 and is blocking for me.

TomAugspurger · 2018-11-12T15:03:41Z

already have longstanding for all Index types

This is for arrays, not index. It's similar to Categorical.

doesn't have angle brackets

We don't know if the repr is code, so angle brackets are appropriate.

properly separates commas

Do you have an example of #23617 where commas aren't handled correctly?

This is a huge, undesired breaking change

#23617 is backwards compatible.

…

On Mon, Nov 12, 2018 at 8:25 AM Jeff Reback ***@***.***> wrote: I am puzzled why you are arbitrarily changing formats here from what we already have longstanding for all Index types. The repr for Index looks much nicer, doesn't have angle brackets, fits on a single line if its short, and properly separates commas. Creating a new format is non-trivial, which is ok to do generally, but only if we change everything. This is a huge, undesired breaking change. So big -1 on doing this. This impacts #23601 <#23601> and is blocking for me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22846 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInrCa3XWPN4zXpdwPPdR0iwJsA0Dks5uuYTdgaJpZM4W7ipV> .

jorisvandenbossche · 2018-11-12T15:05:27Z

This is a huge, undesired breaking change.

You may not like the proposed format, but it is not "breaking". Nothing will be broken as this only affects new API (the internal EAs), not the reprs of Index or Series.

I am puzzled why you are arbitrarily changing formats here from what we already have longstanding for all Index types.

It is not arbitrarily. There are reasons to do this, mentioned somewhat in the original issue.
You are correct this is deviating from the typical format we have been using for the Index reprs. And therefore, we indeed need to have good reasons to do things differently.

Trying to put the original reasoning a bit clearer:

Index and Array are different objects. I don't think it is a problem that they have a visually distinct repr.
The current Index repr also has some disadvantages (which we don't necessarily have to carry over to the Array repr):
- It tries to "look" like actual code, and is that in many cases, but not in all (eg when there are missing values, when it is truncated, ..). But specifically for EAs, those cases are much more common (see points below).
- Minor other points: when truncated, it has a length=..., inside the repr, which I find looking odd in a repr that mimics a constructor.
Specifically for EAs, we don't recommend to use the actual class constructors (as opposed to the different Index subclasses). I think therefore having a repr that does not look like a class constructor is more important for Array than Index.
Specifically for EAs, we decided to keep the class constructor simple. This means that eg a repr of a DatetimeIndex like DatetimeIndex(['2012-01-01'], dtype='datetime64[ns]', freq=None) is actually a valid constructor for DatetimeIndex, but a similar repr for DatetimeArray would not be a valid constructor for DatetimeArray. So IMO the repr of mimicking python code as we do for Index, is much less fitting for Array.

Since the Array class is a new API, we have the opportunity to discuss more freely what repr we would like to have. So let's discuss that (that's the reason I originally opened this issue).

jreback · 2018-11-15T13:14:08Z

Index and Array are different objects. I don't think it is a problem that they have a visually distinct repr.

The more things are different the more inconsistencies across the code base in code, testing, and user experience. This is such a jaring huge change you now done't know what the relationship between Index and Array is. This is a bad thing.

The current Index repr also has some disadvantages (which we don't necessarily have to carry over to the Array repr):
It tries to "look" like actual code, and is that in many cases, but not in all (eg when there are missing values, when it is truncated, ..). But specifically for EAs, those cases are much more common (see points below).

Sure, this is the case, but it is not a reason to deviate.

Minor other points: when truncated, it has a length=..., inside the repr, which I find looking odd in a repr that mimics a constructor.
Specifically for EAs, we don't recommend to use the actual class constructors (as opposed to the different Index subclasses). I think therefore having a repr that does not look like a class constructor is more important for Array than Index.

This is a problem for MultiIndex as well. Again I am all for changing everything, just not a part.

Specifically for EAs, we decided to keep the class constructor simple. This means that eg a repr of a DatetimeIndex like DatetimeIndex(['2012-01-01'], dtype='datetime64[ns]', freq=None) is actually a valid constructor for DatetimeIndex, but a similar repr for DatetimeArray would not be a valid constructor for DatetimeArray. So IMO the repr of mimicking python code as we do for Index, is much less fitting for Array.

This is also not true for Index generally, e.g. Period, not sure why this is an issue here.

I thinking breaking this change (and adding associated hacky things like removing commas), is just bad code smell. I would be ok with adding the repr the same as Index. Then if you want to change the repr generally that is ok, but half and half does not fly.

jreback · 2018-11-15T13:18:41Z

I might be more accepting if the repr is not multi-lined by default, I think this is also the cause of the 'need to remove the commas'.

jorisvandenbossche · 2018-11-15T13:23:52Z

@jreback can you try to explain why you don't like the fact it is multi-line? Once you have a bit longer index, our current index repr is also multi-line (IntervalIndex is even multi-line by default, but that one is a bit inconsistent with the others)

jreback · 2018-11-15T13:46:45Z

@jreback can you try to explain why you don't like the fact it is multi-line? Once you have a bit longer index, our current index repr is also multi-line (IntervalIndex is even multi-line by default, but that one is a bit inconsistent with the others)

for a short repr this is very awkward. Sure when its get longer it is multi-line. The point is the separation of the class from the data. I personally think the Index repr is just fine.

You need a much stronger argument to arbitrarily change it. and that is what you are proposing. Consistency is the king here. You are simply breaking this.

TomAugspurger · 2018-11-15T14:01:57Z

So

In [2]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[2]: <PeriodArray>['2000-01-01', '2001-01-01'], Length: 2, dtype: period[D]

rather than

In [2]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[2]:
<PeriodArray>
['2000-01-01', '2001-01-01']
Length: 2, dtype: period[D]

would move you to a +1 on
#23601?

jreback · 2018-11-15T14:11:15Z

Currently

In [1]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[1]: 
<PeriodArray>
['2000-01-01', '2001-01-01']
Length: 2, dtype: period[D]

In [2]: pd.Index(pd.core.arrays.period_array(['2000', '2001'], 'D'))
Out[2]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')

so i see some departures here which just don't make sense. quoting on the dtype, the length (in the current EA); this only shows up if its too long for the display normally. its 1 line, and I just don't like the angle brackets.

TomAugspurger · 2018-11-15T14:20:19Z

and I just don't like the angle brackets.

One thing at a time. Do we agree that angle brackets are appropriate for the base repr, since we don't know if the repr is code?

jreback · 2018-11-15T15:54:28Z

One thing at a time. Do we agree that angle brackets are appropriate for the base repr, since we don't know if the repr is code?

again no i don't agree, we have not done that elsewhere and therefore this is a completely breaking change. from a user perspective

TomAugspurger · 2018-11-15T15:55:28Z

We haven't defined a repr for objects with arbitrary constructors though. This is different.

jreback · 2018-11-15T16:01:50Z

sure we already have this repr for example, so i would assert this is NOT different.

In [1]: pd.Categorical(list('abc'))
Out[1]: 
[a, b, c]
Categories (3, object): [a, b, c]

TomAugspurger · 2018-11-15T16:03:32Z

We defined Categorical.__init__ though. We don't define ExtensionArray.__init__, so we can't know whether the repr is valid code. Remember, I'm talking about the base repr here.

jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 26, 2018

TomAugspurger added the Output-Formatting __repr__ of pandas objects, to_string label Sep 27, 2018

jorisvandenbossche mentioned this issue Oct 19, 2018

REF: Make PeriodArray an ExtensionArray #22862

Merged

TomAugspurger mentioned this issue Nov 9, 2018

Add default repr for EAs #23601

Merged

jreback closed this as completed in #23601 Dec 4, 2018

jreback added this to the 0.24.0 milestone Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General repr format for our internal ExtensionArrays #22846

General repr format for our internal ExtensionArrays #22846

jorisvandenbossche commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018 •

edited

Loading

TomAugspurger commented Sep 27, 2018

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Oct 26, 2018

jorisvandenbossche commented Oct 26, 2018

TomAugspurger commented Oct 26, 2018 •

edited

Loading

jreback commented Nov 12, 2018

TomAugspurger commented Nov 12, 2018 via email

jorisvandenbossche commented Nov 12, 2018

jreback commented Nov 15, 2018 •

edited by jorisvandenbossche

Loading

jreback commented Nov 15, 2018

jorisvandenbossche commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

General repr format for our internal ExtensionArrays #22846

General repr format for our internal ExtensionArrays #22846

Comments

jorisvandenbossche commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018 • edited Loading

TomAugspurger commented Sep 27, 2018

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Oct 26, 2018

jorisvandenbossche commented Oct 26, 2018

TomAugspurger commented Oct 26, 2018 • edited Loading

jreback commented Nov 12, 2018

TomAugspurger commented Nov 12, 2018 via email

jorisvandenbossche commented Nov 12, 2018

jreback commented Nov 15, 2018 • edited by jorisvandenbossche Loading

jreback commented Nov 15, 2018

jorisvandenbossche commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jreback commented Nov 15, 2018

TomAugspurger commented Nov 15, 2018

jorisvandenbossche commented Sep 26, 2018 •

edited

Loading

TomAugspurger commented Oct 26, 2018 •

edited

Loading

jreback commented Nov 15, 2018 •

edited by jorisvandenbossche

Loading