crosstab in 0.23.4 does not respect categorical variables #22453

andrewcassidy · 2018-08-21T22:23:10Z

Code Sample, a copy-pastable example if possible

a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dtype=object)
c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny", "shiny", "shiny"], dtype=object)
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
crosstab(foo, bar)

col_0 d e
row_0
a 1 0
b 0 1

Problem description

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

The example in the docs DOES NOT work

Expected Output

col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: 0.6.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-08-21T22:55:28Z

Can you try on master? The examples from the docs worked fine for me and I think we have CI to check that as well

jorisvandenbossche · 2018-08-22T09:10:08Z

Example run on master:

In [13]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [14]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])

In [15]: pd.crosstab(foo, bar)
Out[15]: 
col_0  d  e
row_0      
a      1  0
b      0  1

jorisvandenbossche · 2018-08-22T09:21:36Z

It seems to be a regression in 0.23.0. As on 0.22 I still get the correct result (on pandas 0.20 it is also only partially working, only the index is correct)

jorisvandenbossche · 2018-08-22T09:22:21Z

@andrewcassidy if you would have time, very welcome to look into it what could be the cause

kokes · 2018-08-28T06:57:40Z

Hi there, a handy bisect led me to this commit: b020891, which references PR #20583 (in short: there is no reindexing of categoricals now, so empty groups are omitted). This appears to be intentional behaviour wherein empty groups are dropped in a groupby, which affects other operations that use groupby under the hood.

They don't mention crosstab specifically, but they do deal with a very similar issue in pivot_table. The discussion is quite lengthy, so I don't know what the outcome is precisely (maybe one of those participants will shed some light on this matter), but I managed to reproduce the desired output using pandas 0.23.3, by adding dropna=False to the crosstab call.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: from pandas import crosstab
   ...: 
   ...: 
   ...: a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"], dty
   ...: pe=object)
   ...: b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dty
   ...: pe=object)
   ...: c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny", "shin
   ...: y", "shiny"], dtype=object)
   ...: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
   ...: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
   ...: 
   ...: 

In [2]: crosstab(foo, bar)
Out[2]: 
col_0  d  e
row_0      
a      1  0
b      0  1

In [3]: crosstab(foo, bar, dropna=False)
Out[3]: 
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

kokes · 2018-08-28T07:15:08Z

By the way not only does the example not work, the documentation preceding does not reflect the current functionality:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

This was already pointed out in a related (open!) issue, #16367.

MarcoGorelli · 2021-02-22T15:27:14Z

Pretty sure this is the same as #16367 , so closing in favour of that one (though please do let me know if I've misunderstood)

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Aug 21, 2018

jorisvandenbossche added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Aug 22, 2018

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Aug 22, 2018

jorisvandenbossche added this to the 0.23.5 milestone Aug 22, 2018

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

jbrockmendel added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2019

mroeschke added the Bug label May 11, 2020

MarcoGorelli closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crosstab in 0.23.4 does not respect categorical variables #22453

crosstab in 0.23.4 does not respect categorical variables #22453

andrewcassidy commented Aug 21, 2018

INSTALLED VERSIONS

WillAyd commented Aug 21, 2018

jorisvandenbossche commented Aug 22, 2018

jorisvandenbossche commented Aug 22, 2018

jorisvandenbossche commented Aug 22, 2018

kokes commented Aug 28, 2018

kokes commented Aug 28, 2018

MarcoGorelli commented Feb 22, 2021

crosstab in 0.23.4 does not respect categorical variables #22453

crosstab in 0.23.4 does not respect categorical variables #22453

Comments

andrewcassidy commented Aug 21, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Aug 21, 2018

jorisvandenbossche commented Aug 22, 2018

jorisvandenbossche commented Aug 22, 2018

jorisvandenbossche commented Aug 22, 2018

kokes commented Aug 28, 2018

kokes commented Aug 28, 2018

MarcoGorelli commented Feb 22, 2021

Output of `pd.show_versions()`