-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default repr for EAs #23601
Add default repr for EAs #23601
Conversation
Hello @TomAugspurger! Thanks for submitting the PR.
|
In [3]: integer_array([1, 2, 3])
Out[3]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64
In [4]: period_array(['2000', '2001'], freq='D')
Out[4]:
<PeriodArray>
[2000-01-01, 2001-01-01]
Length: 2, dtype: period[D]
In [5]: IntervalArray.from_breaks([1, 2, 3])
Out[5]:
<IntervalArray>
[(1, 2], (2, 3]]
Length: 2, dtype: interval[int64]
In [6]: integer_array([1, 2, 3] * 1000)
Out[6]:
<IntegerArray>
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
...
3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
Length: 3000, dtype: Int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add one for DatetimeArray and TimedeltaArray here as well?
Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq
for DatetimeArray/Timedelta)Array
Is there any control over the number of elements shown?
For DatetimeArray & Timedelta I was waiting to see what happens on #23587. If @jbrockmendel reverts the repr changes before merging then I'll add it here. Otherwise I'll just delete them after that's merged.
Sure. I suppose that info doesn't belong in the dtype (right?) so we can add hooks for extra attrs.
At the array level, no. But I think following the option at |
The main difference for PeriodArray seems to be that the values are no longer quoted? (not a strong opinion here though, for Index reprs we use quotes and also numpy quotes dates) |
diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index f6996f8e6..4bc7841fe 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -330,6 +330,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
def end_time(self):
return self.to_timestamp(how='end')
+ @property
+ def _formatter(self):
+ return "'{}'".format
+
def __setitem__(
self,
key, # type: Union[int, Sequence[int], Sequence[bool]] will quote. Gonna mixup the py2 issues. |
Update:
In [5]: integer_array([1, 2, None] * 1000)
Out[5]:
<IntegerArray>
[ 1, 2, nan, 1, 2, nan, 1, 2, nan, 1,
...
nan, 1, 2, nan, 1, 2, nan, 1, 2, nan]
Length: 3000, dtype: Int64
In [6]: IntervalArray.from_breaks(list(range(1000)))
Out[6]:
<IntervalArray>
[ (0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6],
(6, 7], (7, 8], (8, 9], (9, 10],
...
(989, 990], (990, 991], (991, 992], (992, 993], (993, 994], (994, 995],
(995, 996], (996, 997], (997, 998], (998, 999]]
Length: 999, dtype: interval[int64]
In [7]: period_array(['2000', '2001'] * 1000, freq='D')
Out[7]:
<PeriodArray>
['2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
'2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01',
...
'2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
'2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01']
Length: 2000, dtype: period[D] |
has this been around for a while? its a private attribute, why deprecate? |
It's part of the interface and was around since 0.23. |
This are into a bit larger of a refactor... I removed {Categorica,Period,Interval}ArrayFormatter in favor of a generic ExtensionArrayFormatter. EAs will get control over formatting of individual values by overriding |
@TomAugspurger I was thinking for a moment again on the quoting issue. It would be nice to have a somewhat general rule about this, and also to reflect this in the base repr. From observing the results, it seems that - mostly - we use a If we like this general pattern, we could also do this for the default repr (and sorry, that goes against what I commented earlier #23601 (comment) and what you changed). So instead of
we could have
That would also fit a possible StringArray to not quote it in Series/DataFrame, but to show it quoted in the array repr. Anyway, since this is configurable, and I think none of the internal ones inherits the base |
Implemented |
---------- | ||
boxed: bool, default False | ||
An indicated for whether or not your array is being printed | ||
within a Series, DataFrame, or Index (True), or just by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add this arg to the Index formatters as well for compatiblity.
pandas/io/formats/printing.py
Outdated
defaults to the class name of the obj | ||
|
||
Pass ``False`` to indicate that subsequent lines should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be another parameter then? it seems like it is used for 2 purposes
summary += '],' | ||
|
||
# right now close is either '' or ', ' | ||
# Now we want to include the ']', but not the maybe space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so another difference this is highliting is that EA have the attributes on another line, while the Index does not (as they are args).
* docs * removed overloading of name=False * added indent_for_name
can you rebase |
@@ -503,7 +503,7 @@ def __array_wrap__(self, result, context=None): | |||
|
|||
@property | |||
def _formatter_func(self): | |||
return lambda x: "'%s'" % x | |||
return self.array._formatter(boxed=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this the only index sublcass that you need to do this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the only extension-array backed index that's using the formatting right now.
- CategoricalIndex always uses the default formatter for the underlying categories (since it can be a container for any type, it dispatches the formatting).
- IntervalIndex /IntervalArray don't use this formatting
Datetime / TImedelta will use this too, so this will be pushed up into DatetimeIndexOpsMixin
later.
def __init__(self, values, *args, **kwargs): | ||
GenericArrayFormatter.__init__(self, values, *args, **kwargs) | ||
if is_categorical_dtype(values.dtype): | ||
# Categorical is special for now, so that we can preserve tzinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a TODO here? this is until DatetimeArray is fully pushed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on whether we're willing to change __array__
for datetime-backed series / index (right now . I'm writing up an issue now to discuss that specific point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#23569 (comment) for that.
data = integer_array([1, 2, None] * 1000) | ||
expected = ( | ||
"<IntegerArray>\n" | ||
"[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1,\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are somehow justified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Justified, as in "vertically aligned"? Here's the repr
In [10]: data
Out[10]:
<IntegerArray>
[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1,
...
NaN, 1, 2, NaN, 1, 2, NaN, 1, 2, NaN]
Length: 3000, dtype: Int64
The NaN pattern there makes the formatting a bt strange, but I think unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is ok, was just wondering about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments
I'm not sure how hard it would be, but in <IntegerArray>
[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1,
...
NaN, 1, 2, NaN, 1, 2, NaN, 1, 2, NaN]
Length: 3000, dtype: Int64 I don't like that we devote an entire line just to <IntegerArray>
[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, ...,
..., 1, 2, NaN, 1, 2, NaN, 1, 2, NaN]
Length: 3000, dtype: Int64 Although, that makes the continuation harder to see (for me)... Edit: thinking about it more, I prefer the one with I'll try to followup with a short one-line repr before the release. |
this is fine: #23601 (comment) we do this elsewhere, try this with a large Array and the own line makes a lot of sense. |
👍 all good then? |
thanks a lot @TomAugspurger ; this consolidates a lot of disparate code! |
|
||
def _format_strings(self): | ||
fmt_values = format_array(self.values.get_values(), self.formatter, | ||
fmt_values = format_array(array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger : i'm struggling to resolve some formatting issues. what is the reason for calling format_array
here. As far as I can tell is looping back round to create a GenericArrayFormatter
instance with a formatter
specified to pick up the display options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess, to be more succinct, why is super()._format_strings()
not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not that familiar with this code, but from a quick look: calling super()._format_strings()
would be different, as this would call GenericArrayFormatter._format_strings
, while the generic format_array
can still result in using custom formatters like Datetime64(TZ)Formatter
or Timedelta64Formatter
, depending on what the values of the underlying EA are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that most of those custom Formatter classes don't do much special if formatter is specified.
Eg Datetime64Formatter
has this in _format_strings
:
pandas/pandas/io/formats/format.py
Lines 1174 to 1175 in 181f972
if self.formatter is not None and callable(self.formatter): | |
return [self.formatter(x) for x in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if ExtensionArrayFormatter
is not inheriting from GenericArrayFormatter
but calling format_array
to dispatch to another ...ArrayFormatter
class, why wouldn't the logic in ExtensionArrayFormatter
be in format_array
?
def _formatter(self, boxed=False): | ||
def fmt(x): | ||
if isna(x): | ||
return 'NaN' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should NaN
have been hardcoded here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? This is used for the Integer data display, where we currently use NaN.
(whether we should use rather 'NA' instead of 'NaN', that's another question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering whether it would be a problem with to_string(na_rep=...)
. will do some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that's a good reason. But in general, this _formatter
does not follow display options at all, is that correct?
In which case this is something to think about in general.
Closes #22846
Closes #23590