Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General repr format for our internal ExtensionArrays #22846

Closed
jorisvandenbossche opened this issue Sep 26, 2018 · 20 comments
Closed

General repr format for our internal ExtensionArrays #22846

jorisvandenbossche opened this issue Sep 26, 2018 · 20 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Triggered by the discussion I am having in #22511, I was also thinking we should look at our EAs reprs.

Currently we have for IntegerArray:

In [14]: pd.core.arrays.integer_array([1, 2, 3, None])
Out[14]: IntegerArray([1, 2, 3, nan], dtype='Int64')

which looks like code, but actually is not valid code (because the constructor needs an array, has no dtype argument, and nan is actually no defined).

So also here (since here we still have the freedom to choose something without the concern of changing something), I think we should have some discussion about what we ideally want.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 26, 2018
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 26, 2018

So a similar random thought as I mentioned in #22511 , one possible alternative:

<pandas.IntegerArray>
[1, 2, 3, nan, ..., 1, 2, 3]
Length: 400, dtype: Int64

@TomAugspurger
Copy link
Contributor

That seems fine.

Would we update Categorical to include <pandas.Catgegorical>, or are we happy with its current repo?

@TomAugspurger TomAugspurger added the Output-Formatting __repr__ of pandas objects, to_string label Sep 27, 2018
@jbrockmendel
Copy link
Member

This and some other common-but-not-necessary methods could go in a mixin in something like core.arrays.common and get mixed in to all of the pandas-internal EA subclasses.

@TomAugspurger
Copy link
Contributor

PeriodArray is using a repr like
#22846 (comment).

In [19]: arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')

In [20]: arr
Out[20]:
<PeriodArray>
['2000', '2001']
Length: 2, dtype: period[A-DEC]

I'd be happy with something similar for IntegerArray.

SparseArray can't share it, because it needs to indicate the sparse points. But I think it can be a bit more uniform, something like

>>> pd.SparseArray([1, 0, 0, 4])
<pandas.SparseArray>
[1, 0, 0, 4]
Indicies: array([0, 3]), dtype: Sparse[int64, 0]

Categorical could maybe be updated, but not a big deal.

@jorisvandenbossche
Copy link
Member Author

I personally like the fact that we have 'pandas' in the repr, like I did in the IntegerArray example above, or how you did the SparsArray (what PeriodArray doesn't have), to make it clear from it's repr it is a pandas array, without needing to check type(..).
Buf of course, if we don't add them to the top level API (which still needs to be discussed), then <pandas.IntegerArray> is a bit confusing.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 26, 2018

Buf of course, if we don't add them to the top level API (which still needs to be discussed), then <pandas.IntegerArray> is a bit confusing.

That's why I left it out of PeriodArray. I don't have a strong opinion there though. It's clearly not code since it's in brackets. We could even have <pandas PeriodArray> maybe.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2018

I am puzzled why you are arbitrarily changing formats here from what we already have longstanding for all Index types. The repr for Index looks much nicer, doesn't have angle brackets, fits on a single line if its short, and properly separates commas.

Creating a new format is non-trivial, which is ok to do generally, but only if we change everything. This is a huge, undesired breaking change. So big -1 on doing this. This impacts #23601 and is blocking for me.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 12, 2018 via email

@jorisvandenbossche
Copy link
Member Author

This is a huge, undesired breaking change.

You may not like the proposed format, but it is not "breaking". Nothing will be broken as this only affects new API (the internal EAs), not the reprs of Index or Series.

I am puzzled why you are arbitrarily changing formats here from what we already have longstanding for all Index types.

It is not arbitrarily. There are reasons to do this, mentioned somewhat in the original issue.
You are correct this is deviating from the typical format we have been using for the Index reprs. And therefore, we indeed need to have good reasons to do things differently.

Trying to put the original reasoning a bit clearer:

  • Index and Array are different objects. I don't think it is a problem that they have a visually distinct repr.
  • The current Index repr also has some disadvantages (which we don't necessarily have to carry over to the Array repr):
    • It tries to "look" like actual code, and is that in many cases, but not in all (eg when there are missing values, when it is truncated, ..). But specifically for EAs, those cases are much more common (see points below).
    • Minor other points: when truncated, it has a length=..., inside the repr, which I find looking odd in a repr that mimics a constructor.
  • Specifically for EAs, we don't recommend to use the actual class constructors (as opposed to the different Index subclasses). I think therefore having a repr that does not look like a class constructor is more important for Array than Index.
  • Specifically for EAs, we decided to keep the class constructor simple. This means that eg a repr of a DatetimeIndex like DatetimeIndex(['2012-01-01'], dtype='datetime64[ns]', freq=None) is actually a valid constructor for DatetimeIndex, but a similar repr for DatetimeArray would not be a valid constructor for DatetimeArray. So IMO the repr of mimicking python code as we do for Index, is much less fitting for Array.

Since the Array class is a new API, we have the opportunity to discuss more freely what repr we would like to have. So let's discuss that (that's the reason I originally opened this issue).

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

Index and Array are different objects. I don't think it is a problem that they have a visually distinct repr.

The more things are different the more inconsistencies across the code base in code, testing, and user experience. This is such a jaring huge change you now done't know what the relationship between Index and Array is. This is a bad thing.

The current Index repr also has some disadvantages (which we don't necessarily have to carry over to the Array repr):
It tries to "look" like actual code, and is that in many cases, but not in all (eg when there are missing values, when it is truncated, ..). But specifically for EAs, those cases are much more common (see points below).

Sure, this is the case, but it is not a reason to deviate.

Minor other points: when truncated, it has a length=..., inside the repr, which I find looking odd in a repr that mimics a constructor.
Specifically for EAs, we don't recommend to use the actual class constructors (as opposed to the different Index subclasses). I think therefore having a repr that does not look like a class constructor is more important for Array than Index.

This is a problem for MultiIndex as well. Again I am all for changing everything, just not a part.

Specifically for EAs, we decided to keep the class constructor simple. This means that eg a repr of a DatetimeIndex like DatetimeIndex(['2012-01-01'], dtype='datetime64[ns]', freq=None) is actually a valid constructor for DatetimeIndex, but a similar repr for DatetimeArray would not be a valid constructor for DatetimeArray. So IMO the repr of mimicking python code as we do for Index, is much less fitting for Array.

This is also not true for Index generally, e.g. Period, not sure why this is an issue here.

I thinking breaking this change (and adding associated hacky things like removing commas), is just bad code smell. I would be ok with adding the repr the same as Index. Then if you want to change the repr generally that is ok, but half and half does not fly.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

I might be more accepting if the repr is not multi-lined by default, I think this is also the cause of the 'need to remove the commas'.

@jorisvandenbossche
Copy link
Member Author

@jreback can you try to explain why you don't like the fact it is multi-line? Once you have a bit longer index, our current index repr is also multi-line (IntervalIndex is even multi-line by default, but that one is a bit inconsistent with the others)

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

@jreback can you try to explain why you don't like the fact it is multi-line? Once you have a bit longer index, our current index repr is also multi-line (IntervalIndex is even multi-line by default, but that one is a bit inconsistent with the others)

for a short repr this is very awkward. Sure when its get longer it is multi-line. The point is the separation of the class from the data. I personally think the Index repr is just fine.

You need a much stronger argument to arbitrarily change it. and that is what you are proposing. Consistency is the king here. You are simply breaking this.

@TomAugspurger
Copy link
Contributor

So

In [2]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[2]: <PeriodArray>['2000-01-01', '2001-01-01'], Length: 2, dtype: period[D]

rather than

In [2]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[2]:
<PeriodArray>
['2000-01-01', '2001-01-01']
Length: 2, dtype: period[D]

would move you to a +1 on
#23601?

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

Currently

In [1]: pd.core.arrays.period_array(['2000', '2001'], 'D')
Out[1]: 
<PeriodArray>
['2000-01-01', '2001-01-01']
Length: 2, dtype: period[D]

In [2]: pd.Index(pd.core.arrays.period_array(['2000', '2001'], 'D'))
Out[2]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')

so i see some departures here which just don't make sense. quoting on the dtype, the length (in the current EA); this only shows up if its too long for the display normally. its 1 line, and I just don't like the angle brackets.

@TomAugspurger
Copy link
Contributor

and I just don't like the angle brackets.

One thing at a time. Do we agree that angle brackets are appropriate for the base repr, since we don't know if the repr is code?

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

One thing at a time. Do we agree that angle brackets are appropriate for the base repr, since we don't know if the repr is code?

again no i don't agree, we have not done that elsewhere and therefore this is a completely breaking change. from a user perspective

@TomAugspurger
Copy link
Contributor

We haven't defined a repr for objects with arbitrary constructors though. This is different.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

sure we already have this repr for example, so i would assert this is NOT different.

In [1]: pd.Categorical(list('abc'))
Out[1]: 
[a, b, c]
Categories (3, object): [a, b, c]

@TomAugspurger
Copy link
Contributor

We defined Categorical.__init__ though. We don't define ExtensionArray.__init__, so we can't know whether the repr is valid code. Remember, I'm talking about the base repr here.

@jreback jreback added this to the 0.24.0 milestone Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

4 participants