Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to flatten all multi-index levels #19950

Closed
cbrnr opened this issue Mar 1, 2018 · 8 comments · Fixed by #22866
Closed

Add method to flatten all multi-index levels #19950

cbrnr opened this issue Mar 1, 2018 · 8 comments · Fixed by #22866

Comments

@cbrnr
Copy link
Contributor

cbrnr commented Mar 1, 2018

Sometimes it is useful to flatten all levels of a multi-index. For example, when pivoting data into a wide format, the new columns are generally multi-indexed. However, when exporting to CSV, sometimes it might be desirable to have only one header row.

AFAIK, there is no dedicated method to flatten an existing multi-index. Assuming I want to combine the multi-index levels with an underscore, I would do this to get one header row:

df.columns = ["_".join(v) for v in df.columns.values]

This is not really obvious, and maybe a dedicated MultiIndex method should be added, e.g. df.columns.flatten(sep="_", inplace=False). What do you think?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2018

having a .to_index(sep=None) would be useful, the default would be equivalent of .tolist(), IOW return tuples, while a sep could stringify and join. Note that this is a bit tricky because for example what do you do with a Timestamp?

@cbrnr
Copy link
Contributor Author

cbrnr commented Mar 1, 2018

I like your idea of a to_index method. Good point, specifying a sep character implies that individual entries are converted to strings. How do Timestamp objects behave when I try to str them? If this is not possible, the method could raise an error in this case (and only work for string or int index objects).

@cbrnr
Copy link
Contributor Author

cbrnr commented Mar 1, 2018

Actually, is this even possible in the use case I described? I mean, I was referring to a multi-index created by pivot_table - can a level consist of timestamps?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2018

In [29]: index=pd.MultiIndex.from_product([list('abc'), pd.date_range('20180101', periods=3)])

In [30]: index
Out[30]: 
MultiIndex(levels=[['a', 'b', 'c'], [2018-01-01 00:00:00, 2018-01-02 00:00:00, 2018-01-03 00:00:00]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])


In [32]: pd.Index(['_'.join([str(_) for _ in v]) for v in index.values])
Out[32]: 
Index(['a_2018-01-01 00:00:00', 'a_2018-01-02 00:00:00',
       'a_2018-01-03 00:00:00', 'b_2018-01-01 00:00:00',
       'b_2018-01-02 00:00:00', 'b_2018-01-03 00:00:00',
       'c_2018-01-01 00:00:00', 'c_2018-01-02 00:00:00',
       'c_2018-01-03 00:00:00'],
      dtype='object')

@jreback
Copy link
Contributor

jreback commented Mar 1, 2018

slightly OT: note that .wide_to_long basically does the inverse of this, meaning it splits columns names on regex.

@cbrnr
Copy link
Contributor Author

cbrnr commented Mar 1, 2018

Looks good! So I guess no problem then? I didn't know about .wide_to_long, maybe adding a .long_to_wide would make sense?

Just to motivate this idea: I love how unite and spread from the tidyr R package do exactly what I describe here.

@joelostblom
Copy link
Contributor

FWIW There is a very popular question related to this on SO, so by that metric it seems like quite a few people could benefit from a solution similar to what is suggested here.

@oguzhanogreden
Copy link
Contributor

The answer to the the popular question hints that it's okay to overwrite columns with the result of this method.

That quickly got me here, which doesn't work as I expected with version 0.25.1:

df = pd.DataFrame({"a": [1,1,2,2], "b": [1, 2, 3, 4], "c": [1, 2, 3, 4]})
df_agg = df.groupby("a").agg({"b": ["min", "max"], "c":"sum"})
df_agg[("b", "min")] # works
df_agg.loc[:, ("b", "min")] # works

df_agg.columns = df_agg.columns.to_flat_index()
df_agg[("b", "min")] # works
df_agg.loc[:, ("b", "min")] # won't work.

Do you consider this expected behavior? I'm not very familiar with selector implementations, but my user intuition tells me this should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants