Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_flat_index method to MultiIndex #22866

Merged
merged 16 commits into from
Nov 13, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Sep 27, 2018

Very simple implementation at the moment. The thought here is to introduce this method and perhaps subsequently extend to allow for string concatenation of the elements. Longer term there could also be a keyword added to .agg of GroupBy which will dispatch to this instead of simply returning a MultiIndex column, which could alleviate some of the pain users are experience when trying to rename columns after an aggregation.

@TomAugspurger and @jorisvandenbossche from the dev chat today

@pep8speaks
Copy link

Hello @WillAyd! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Sep 28, 2018

Codecov Report

Merging #22866 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22866      +/-   ##
==========================================
+ Coverage   92.24%   92.24%   +<.01%     
==========================================
  Files         161      161              
  Lines       51317    51318       +1     
==========================================
+ Hits        47338    47339       +1     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.63% <100%> (ø) ⬆️
#single 42.31% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.47% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.46% <100%> (ø) ⬆️
pandas/core/series.py 93.68% <0%> (-0.03%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.01%) ⬇️
pandas/core/generic.py 96.82% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20bdb3e...bea4e85. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2018

can you add to api.rst. I would add the sep=None arg here, maybe can defer the actual handling of the string concat to a later PR

@WillAyd
Copy link
Member Author

WillAyd commented Oct 1, 2018

Yep my goal is to defer the actual sep implementation, if only because I'm not sure how we'd want to handle say timestamp and categorical objects so probably needs further discussion.

For now I've stubbed it out and raise a NotImplementedError but if you want to go about in a different way lmk

@jorisvandenbossche
Copy link
Member

I'm not sure how we'd want to handle say timestamp and categorical objects so probably needs further discussion.

I think if you specify sep, everything gets stringified? (just using the default conversion to string, without any ability to use a custom format, the user can always do that before they use this method)


About the name, do we want to call this to_index ? It might feel a bit strange, since you already have an index to start with (it's only a MultiIndex, but which is still an index).
So you would get something like df.index.to_index() or df.columns.to_index(), which seems a bit strange.

Alternative would be .flatten(..), but I don't know to what extent it is a problem that this clashes with the numpy method.
Or to_single, collapse_levels, ... (but I don't really like those :-))

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 3, 2018 via email

@jreback
Copy link
Contributor

jreback commented Oct 3, 2018

maybe as_index(), though I agree I like the symmetry. btw is Index.to_index() tested? (iow identity)

@WillAyd
Copy link
Member Author

WillAyd commented Oct 3, 2018

Agreed with all comments so far - the name was initially off-putting to me but ultimately went with it given the consistency of it with the API, and I didn't want to obfuscate what flatten meant from a NumPy perspective.

I think if you specify sep, everything gets stringified?

Probably. I'm probably being ultra-conservative here but something about the string conversion and "round-tripability" of this is why I was planning on holding off until a separate PR to even go down that path. If it's a blocker here can certainly add the simple implementation.

Index.to_index() tested? (iow identity)

No but I also don't believe this is implemented atm. Are you thinking this implementation in this PR is better served in the base.py module?

@WillAyd
Copy link
Member Author

WillAyd commented Oct 23, 2018

Any other thoughts on this?

@@ -1198,6 +1199,28 @@ def to_frame(self, index=True, name=None):
result.index = self
return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about to_flat ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i would add the identity method to Index itself

@WillAyd
Copy link
Member Author

WillAyd commented Oct 26, 2018

Added the identity method. I did not change the name from to_index to to_flat because I personally prefer the API consistency with to_series and to_frame though I'm not going to get terribly hung up on that if we feel to_flat is better.

I also kept the documentation geared only towards MultiIndex atm. Not sure if we want to document the identity and if it serves a purpose outside of not breaking chained ops (at least in its current state) but happy to update with whatever we feel necessary

@shoyer
Copy link
Member

shoyer commented Oct 27, 2018

What about to_flat_index()? I think to_index() is pretty confusing for transforming a MultiIndex to a base pandas.Index (my expectation would be that this is the identity function), and it would be better to pick something slightly verbose rather than slightly confusing.

@jorisvandenbossche
Copy link
Member

As said above, I am also -1 on to_index, this is just a confusing name. I agree with @shoyer that it's then better to have a verbose but at least readable / clear name.

Other names I was thinking about: .flatten_levels() or flatten_index_levels (this last one is maybe too verbose, although explicit). But to_flat_index() is fine for me.

Now you added it to the Index/MultiIndex. But if you want to do be able to directly do this is in a method chain (eg after a groupby operation), it would be needed on the DataFrame as well. Is this something we would also want to consider (but perfectly fine with start to add it at least to the index itself)

@toobaz
Copy link
Member

toobaz commented Nov 7, 2018

Sorry for coming late. I'm also not convinced by to_index, and weakly in favor of to_flat (although flatten would not disturb me).

I'm also not convinced by sep. I think that:

  • the necessity that emerged lately (e.g. in the mailing list, when discussing MultiIndexes produced by GroupBy.apply) was for instance to transform a level of operation names ('mean', 'min', 'max') and one of variables ('a', 'b', 'c') in an index of strings such as 'mean of a', 'mean of b' and so on. Now, sep=' of ' does (once implemented) allow this, but just because it happens that there are two levels, and in the correct order, not because it makes much sense to think to the labels for each level as a flat ordered list
  • I think giving less freedom today in terms of aggregation methods just calls for adding more arguments later

My idea would be to pass a callable (e.g. fmt) accepting an iterable and returning anything (not necessarily a string, by the way). The sep solution would just become, on a MultiIndex with 2 levels, fmt='{} of {}'.format.

In any case, if we want to discuss this and not block this PR, I would just remove the sep argument for the moment.

EDIT: one objection I can see to fmt is that it would be doing precisely what can be done as mi.to_flat().apply(fmt). But this is an objection to sep too I think. And probably a weak objection in any case.

@jorisvandenbossche
Copy link
Member

My idea would be to pass a callable (e.g. fmt) accepting an iterable and returning anything

The idea is that this would be a keyword of this (to be renamed) to_index method, or a method on its own?

In any case, I like this idea, it gives indeed more flexibility than a sep.

We had some discussion about this before (I think in light of deprecating add_prefix/add_suffix) to have an easier way to format values in an Index/Series (related issues: #18347, #17211)

@jorisvandenbossche
Copy link
Member

This is kind of

In [142]: idx = pd.MultiIndex.from_product([['mean', 'sum'], ['a', 'b']])

In [143]: idx.map(lambda x: "{0} of {1}".format(*x))
Out[143]: Index(['mean of a', 'mean of b', 'sum of a', 'sum of b'], dtype='object')

@toobaz
Copy link
Member

toobaz commented Nov 8, 2018

The idea is that this would be a keyword of this (to be renamed) to_index method, or a method on its own?

A keyword of this.

This is kind of

Right, I had missed this. Basically idx.map(lambda x: "{0} of {1}".format(*x)) would just become idx.to_flat("{0} of {1}".format). So maybe this would be a bit redundant. Still, sep seems too restrictive to me, and this added sugar candy would have a very low implementation cost.

pandas/core/indexes/base.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

after reading #23141 maybe we should call this .squeeze()?

@jreback jreback mentioned this pull request Nov 11, 2018
4 tasks
@TomAugspurger
Copy link
Contributor

This is a bit different from the squeeze semantics.

squeeze will reduce dimensionality if possible (only one item on that axis).

.flatten / .to_index / whatever will always reduce the dimensionality.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

from #22866 (comment)

ahh right, ok then.

@toobaz
Copy link
Member

toobaz commented Nov 11, 2018

This is a bit different from the squeeze semantics.

squeeze will reduce dimensionality if possible (only one item on that axis).

.flatten / .to_index / whatever will always reduce the dimensionality.

Just for pickiness: the two differ in fact also for the "only one item on that axis" case, as the former will return the labels as items, while the latter will return length one tuples as items.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 13, 2018

Renamed to to_flat_index and removed sep parameter based off feedback above

@toobaz
Copy link
Member

toobaz commented Nov 13, 2018

Seems good to me

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we merge?

('bar', 'baz'), ('bar', 'qux')],
dtype='object')
"""
if not isinstance(self, ABCMultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle style question (no need to change here): what are people's preferences on

  1. Defining a method for both Index and MultiIndex here, but using an isnstance to figure out which one we're working with
  2. Defining a default implementation here that's return self, and overriding in MultiIndex, so no if isintance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference for 2, and in general, for leaving the base Index class unaware (every time it is possible/reasonable) of the existence and specificities of Index subclasses.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 13, 2018

The inheritance model does seem to make more sense here so I've updated the latest commit to reflect that

@WillAyd WillAyd changed the title Add to_index method to MultiIndex Add to_flat_index method to MultiIndex Nov 13, 2018
@TomAugspurger
Copy link
Contributor

Thanks @WillAyd!

@TomAugspurger TomAugspurger merged commit 454ecfc into pandas-dev:master Nov 13, 2018

.. versionadded:: 0.24.0

This is implemented for compatability with subclass implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compatability -> compatibility

By the way: reusing the same docstring would be a + for me. I would rather describe the actual operation (so on MuiltiIndex) and then say that it is idempotent on flat Indexes.

@WillAyd WillAyd deleted the mi-to-index branch November 13, 2018 16:58
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
* upstream/master: (25 commits)
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
  CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
…fixed

* upstream/master:
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add method to flatten all multi-index levels
7 participants