Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: better MultiIndex.__repr__ #22511

Merged
merged 12 commits into from
Jun 19, 2019
Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Aug 26, 2018

closes #13480
closes #12423

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Proposal to make a new repr for MultiIndex. Displaying MultiIndex will be based on displaying vertically stacked tuples, as discussed in #13480. This makes it easier to understand the structure of the MultiIndex.

In the proposal we get:

  • item formatting according to each level's formatting rule,
  • right-justification for each tuple item,
  • row-wise truncation according to pd.options.display.max_seq_items,
  • column-wise truncation according to pd.options.display.width,

A large MultiIndex example will now look like this:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n) + (['bcd'] * n),
...                          categories=['a', 'bcd'], ordered=True)
>>> dti =pd.date_range('2000-01-01', freq='s', periods=2 * n)
>>> mi = pd.MultiIndex.from_arrays([ci, ci.codes+9, dti, dti, dti],
...                                names = ['a', 'b', 'x', 'x2', 'x3'])
>>> mi
MultiIndex([(  'a',  9, '2000-01-01 00:00:00', '2000-01-01 00:00:00', ...),
            (  'a',  9, '2000-01-01 00:00:01', '2000-01-01 00:00:01', ...),
            (  'a',  9, '2000-01-01 00:00:02', '2000-01-01 00:00:02', ...),
            (  'a',  9, '2000-01-01 00:00:03', '2000-01-01 00:00:03', ...),
            (  'a',  9, '2000-01-01 00:00:04', '2000-01-01 00:00:04', ...),
            (  'a',  9, '2000-01-01 00:00:05', '2000-01-01 00:00:05', ...),
            (  'a',  9, '2000-01-01 00:00:06', '2000-01-01 00:00:06', ...),
            (  'a',  9, '2000-01-01 00:00:07', '2000-01-01 00:00:07', ...),
            (  'a',  9, '2000-01-01 00:00:08', '2000-01-01 00:00:08', ...),
            (  'a',  9, '2000-01-01 00:00:09', '2000-01-01 00:00:09', ...),
            ...
            ('bcd', 10, '2000-01-24 03:33:10', '2000-01-24 03:33:10', ...),
            ('bcd', 10, '2000-01-24 03:33:11', '2000-01-24 03:33:11', ...),
            ('bcd', 10, '2000-01-24 03:33:12', '2000-01-24 03:33:12', ...),
            ('bcd', 10, '2000-01-24 03:33:13', '2000-01-24 03:33:13', ...),
            ('bcd', 10, '2000-01-24 03:33:14', '2000-01-24 03:33:14', ...),
            ('bcd', 10, '2000-01-24 03:33:15', '2000-01-24 03:33:15', ...),
            ('bcd', 10, '2000-01-24 03:33:16', '2000-01-24 03:33:16', ...),
            ('bcd', 10, '2000-01-24 03:33:17', '2000-01-24 03:33:17', ...),
            ('bcd', 10, '2000-01-24 03:33:18', '2000-01-24 03:33:18', ...),
            ('bcd', 10, '2000-01-24 03:33:19', '2000-01-24 03:33:19', ...)],
           dtype='object', names=['a', 'b', 'x', 'x2', 'x3'], length=2000000)

For further examples, see the added tests in pandas/tests/indexes/multi/test_format.py.

@@ -57,49 +57,6 @@ def test_repr_with_unicode_data():
assert "\\u" not in repr(index) # we don't want unicode-escaped


def test_repr_roundtrip():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the new implementation breaks round-tripping. This is a worthwhile trade-off as we better clarity with the new repr IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put back a test that assert that this raises on round-trip now though. (just a simple example is enough with a comment)

@topper-123 topper-123 force-pushed the MultiIndex.__repr__ branch 3 times, most recently from dd81bdd to bbee14e Compare August 26, 2018 08:36
@codecov
Copy link

codecov bot commented Aug 26, 2018

Codecov Report

Merging #22511 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22511      +/-   ##
==========================================
- Coverage   91.73%   91.73%   -0.01%     
==========================================
  Files         178      178              
  Lines       50774    50794      +20     
==========================================
+ Hits        46579    46595      +16     
- Misses       4195     4199       +4
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.18% <46.66%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.92% <ø> (ø) ⬆️
pandas/core/indexes/base.py 96.71% <ø> (ø) ⬆️
pandas/core/indexes/multi.py 95.73% <100%> (+0.06%) ⬆️
pandas/io/formats/printing.py 86.72% <100%> (+1.16%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8ad9da...1d96c98. Read the comment docs.

@topper-123 topper-123 force-pushed the MultiIndex.__repr__ branch 3 times, most recently from 8508304 to 661e3be Compare August 26, 2018 11:40
defaults to the class name of the obj
is_multi : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not going to be acceptable
this cannot know anything about a MultiIndez
you can override the formatters in multi if you really really need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be called line_break_on_values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_object_summary needs to know if the formatter func returns a string or a tuple of strings, as the treatment of each is different (but only sligthly different).

An alternative is as you say to make a different function for MultiIndex-likes, but the functions are going to be very similar and you asked for code reuse in #13480.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes line_break_one_values is fine. This function just cannot reference any pandas internal things.

@topper-123
Copy link
Contributor Author

The failures are unrelated:

travis:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

circli-ci: py27_compat:

ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory

Will rebase and force push to see if this is an intermittent failure.


.. ipython:: python

index1=range(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting here (spaces around =)

.. ipython:: python

index1=range(1000)
index2 = pd.Index(['a'] * 500 + ['abc'] * 500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use a more familiar construction, e.g. .from_product

index2 = pd.Index(['a'] * 500 + ['abc'] * 500)
pd.MultiIndex.from_arrays([index1, index2])

For number of rows smaller than :attr:`options.display.max_seq_items`, all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a/the

For number of rows smaller than :attr:`options.display.max_seq_items`, all
values will be shown (default: 100 items). Horizontally, the output will
truncate, if it's longer than :attr:`options.display.width` (default: 80 characters).
This solves the problem with outputting large MultiIndex instances to the console.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the last sentence

Invoked by unicode(df) in py2 only. Yields a Unicode String in both
py2/py3.
"""
klass = self.__class__.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks looks like a dupe of Index.base.unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed.

head = [x.rjust(max_len) for x in head]
tail = [x.rjust(max_len) for x in tail]
head, tail = _justify(head, tail, display_width, best_len,
is_truncated, is_multi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

justify seems incompatible with is_multi (well the new option)?

"""
Justify each item in head and tail, so they align properly.
"""
if is_multi:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting pretty complicated. e.g. the nested calling of this. maybe ban justify / is_multi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_multi also needs to justify, but on each value in the tuple for each value, instead of a flexible list of values.

I see it's a bit complicated, but it's also difficult to make it simpler. I've tried containng the new functionality, maybe it's better.

@@ -57,49 +57,6 @@ def test_repr_with_unicode_data():
assert "\\u" not in repr(index) # we don't want unicode-escaped


def test_repr_roundtrip():

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put back a test that assert that this raises on round-trip now though. (just a simple example is enough with a comment)

@pytest.mark.skipif(PY2, reason="repr output is different for python2")
class TestRepr(object):

def setup_class(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugg, pls don't use the old unittest style setup, make fixtures instead

pandas/tests/indexes/multi/test_format.py Outdated Show resolved Hide resolved
@topper-123
Copy link
Contributor Author

I think I've adjusted for all the comments.

@topper-123 topper-123 force-pushed the MultiIndex.__repr__ branch 4 times, most recently from 1807702 to 906e0f7 Compare August 31, 2018 22:10
@topper-123
Copy link
Contributor Author

The trvis failure was a ResourceWarning, so unrelated to this PR.

@topper-123 topper-123 force-pushed the MultiIndex.__repr__ branch 4 times, most recently from c3a76d0 to 359b2a3 Compare September 2, 2018 09:16
@gfyoung
Copy link
Member

gfyoung commented Sep 2, 2018

@topper-123 : FYI, Anaconda has been having some bad servicing issues, so unfortunately, I don't think CI is going to be very cooperative at this point in time.

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 2, 2018

Ok, thanks for notifying me.

wrt. the PR, all comments by @jreback should have been addressed. Some further simplifications have also been done: So methods ._format_space and _format_attrs have been removed and MultiIndex now inherits those instead.

@topper-123
Copy link
Contributor Author

Ping. I would appreciate a resolution to this. To me it starts feeling like a second brexit (i.e. a decision isn't being made) ;-).

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

@jorisvandenbossche this is better than the current. perfection can be in another PR.

@topper-123
Copy link
Contributor Author

Ping.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2019

so this has been outstanding for quite a long time. Its better than the current repr. Any remaining objections.

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2019

No I think this is a good enhancement

@topper-123
Copy link
Contributor Author

I think this should get a decision now, the PR is almost a year old now. If needed it could be elevated to the BDFL, rather than languishing.

Out of optimism, I've just rebased again ;-)

@jreback
Copy link
Contributor

jreback commented Jun 12, 2019

let's not let the perfect be the enemy of the good.

unless an actionable counter-proposal with 72 hours I am going to merge.

@jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Jun 12, 2019

cc @pandas-dev/pandas-core

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't realize I was still in "Request Changes" - this lgtm!

@jreback jreback merged commit d47947a into pandas-dev:master Jun 19, 2019
@jreback
Copy link
Contributor

jreback commented Jun 19, 2019

thanks @topper-123 very nice!

I am sure there will be some followups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change MultiIndex repr ? Abbreviate MultiIndex representation
7 participants