Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot table margins brittleness #3334

Closed
wesm opened this issue Apr 13, 2013 · 3 comments
Closed

Pivot table margins brittleness #3334

wesm opened this issue Apr 13, 2013 · 3 comments
Labels
Milestone

Comments

@wesm
Copy link
Member

wesm commented Apr 13, 2013

I'm trying to find the most efficient way to tabulate responses to a survey using pandas. Here's a sample survey:

df = DataFrame({'Response' : ['Y', 'N' ,'N', 'Y', 'Y', 'N'],
                'Type' : ['A', 'A', 'B', 'B', 'B', 'C']})

I want to count the number of responses of each value (Y or N), with the different types across the top.

  1. A simple way to do this is
df.groupby('Type')['Response'].value_counts().unstack(level=0)

Type     A      B        C
N       1       1       1
Y       1       2       NaN

This is almost what I want, except I'd like row and column totals as well. Of course, I could calculate these manually, but that seems ugly. What I'd really like to do is create a pivot table with margins=True, but this isn't quite as easy as I thought it would be.

  1. Again without totals, I can use
    pivot_table(df, rows='Response',cols='Type',aggfunc=len)

This produces almost the same output as above, though now the index has a name:

Type            A       B       C
Response
N                       1       1        1
Y                       1       2       NaN

So far so good, but adding margins=True results in an error.

  1. To get exactly what I want, the following works, but it seems there should be a nicer way:
df['Count'] = 1
pivot_table(df, rows='Response',cols='Type',values='Count',aggfunc=len,margins=True)     # Can also use sum

I have three questions:

  1. Why does adding 'margins=True' in reindex_like function #2 above result in an error, while it works in Binary operations on int DataMatrix #3? I'm not sure I understand what's happening here.

  2. If this behavior is intentional, is there a better way to get the output I'm looking for, without introducing an extra, unwanted column?

  3. How do I remove the name of the index from a dataframe (to get the output from reindex_like function #2 to look the same as in Enable element-wise comparison operations in DataMatrix objects #1)?

@guyrt
Copy link
Contributor

guyrt commented Jul 3, 2013

I've traced this bug to the type of index we have on the columns of the pivoted table. The pivoted table is first constructed here:

https://github.com/pydata/pandas/blob/master/pandas/tools/pivot.py#L100

If the rows and columns account for the entire frame, then the index is a single level index. Otherwise, it's multilevel with nlevels=2. This difference causes problems later when we group by column:

https://github.com/pydata/pandas/blob/master/pandas/tools/pivot.py#L156

Rather than grouping by level 0 of a multi-index, we are grouping by the only level in a single index. This picks up a particular value in a column rather than the level 0 value in the index, which was the name of a column.

I'll try to push a fix tomorrow or the next day pending time to work on it. We may need to handle this case specially.

@guyrt
Copy link
Contributor

guyrt commented Jul 3, 2013

I guess the most obvious question is whether a suitable fix is to append an extra column of ones if rows + cols is the entire DataFrame. Thoughts?

guyrt added a commit to guyrt/pandas that referenced this issue Jul 25, 2013
…_table

Adds support for margin computation when all columns are used in rows and cols.
guyrt added a commit to guyrt/pandas that referenced this issue Aug 1, 2013
Adds support for margin computation when all columns are used in rows and cols
guyrt added a commit to guyrt/pandas that referenced this issue Aug 1, 2013
…_table

Adds support for margin computation when all columns are used in rows and cols.
guyrt added a commit to guyrt/pandas that referenced this issue Aug 1, 2013
…_table

Adds support for margin computation when all columns are used in rows and cols.
jreback pushed a commit that referenced this issue Aug 12, 2013
Adds support for margin computation when all columns are used in rows and cols.

DOC: Fixed release notes

BUG: Fixed failing test due to use of basestring in python 3
@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

closed via #4432

@jreback jreback closed this as completed Aug 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants