Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crosstab in 0.23.4 does not respect categorical variables #22453

Closed
andrewcassidy opened this issue Aug 21, 2018 · 7 comments
Closed

crosstab in 0.23.4 does not respect categorical variables #22453

andrewcassidy opened this issue Aug 21, 2018 · 7 comments
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@andrewcassidy
Copy link

Code Sample, a copy-pastable example if possible

a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dtype=object)
c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny", "shiny", "shiny"], dtype=object)
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
crosstab(foo, bar)

col_0 d e
row_0
a 1 0
b 0 1

Problem description

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

The example in the docs DOES NOT work

Expected Output

col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: 0.6.0
pandas_datareader: None

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Aug 21, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 21, 2018

Can you try on master? The examples from the docs worked fine for me and I think we have CI to check that as well

@jorisvandenbossche
Copy link
Member

Example run on master:

In [13]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [14]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])

In [15]: pd.crosstab(foo, bar)
Out[15]: 
col_0  d  e
row_0      
a      1  0
b      0  1

@jorisvandenbossche jorisvandenbossche added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Aug 22, 2018
@jorisvandenbossche
Copy link
Member

It seems to be a regression in 0.23.0. As on 0.22 I still get the correct result (on pandas 0.20 it is also only partially working, only the index is correct)

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Aug 22, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.5 milestone Aug 22, 2018
@jorisvandenbossche
Copy link
Member

@andrewcassidy if you would have time, very welcome to look into it what could be the cause

@kokes
Copy link
Contributor

kokes commented Aug 28, 2018

Hi there, a handy bisect led me to this commit: b020891, which references PR #20583 (in short: there is no reindexing of categoricals now, so empty groups are omitted). This appears to be intentional behaviour wherein empty groups are dropped in a groupby, which affects other operations that use groupby under the hood.

They don't mention crosstab specifically, but they do deal with a very similar issue in pivot_table. The discussion is quite lengthy, so I don't know what the outcome is precisely (maybe one of those participants will shed some light on this matter), but I managed to reproduce the desired output using pandas 0.23.3, by adding dropna=False to the crosstab call.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: from pandas import crosstab
   ...: 
   ...: 
   ...: a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"], dty
   ...: pe=object)
   ...: b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dty
   ...: pe=object)
   ...: c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny", "shin
   ...: y", "shiny"], dtype=object)
   ...: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
   ...: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
   ...: 
   ...: 

In [2]: crosstab(foo, bar)
Out[2]: 
col_0  d  e
row_0      
a      1  0
b      0  1

In [3]: crosstab(foo, bar, dropna=False)
Out[3]: 
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

@kokes
Copy link
Contributor

kokes commented Aug 28, 2018

By the way not only does the example not work, the documentation preceding does not reflect the current functionality:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

This was already pointed out in a related (open!) issue, #16367.

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
@jbrockmendel jbrockmendel added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2019
@mroeschke mroeschke added the Bug label May 11, 2020
@MarcoGorelli
Copy link
Member

Pretty sure this is the same as #16367 , so closing in favour of that one (though please do let me know if I've misunderstood)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

8 participants