Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.describe(include=['O']) should include categorical columns #16722

Closed
tdpetrou opened this issue Jun 19, 2017 · 7 comments · Fixed by #17789
Closed

DataFrame.describe(include=['O']) should include categorical columns #16722

tdpetrou opened this issue Jun 19, 2017 · 7 comments · Fixed by #17789
Labels
Categorical Categorical Data Type Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@tdpetrou
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In[1]: df = pd.DataFrame({'cat':pd.Categorical(['a','b','c']), 
                          'obj':['d', 'e', 'f'], 
                          'num': [1,2,3], 
                          'bool':[True,False,True],
                          'dt':[pd.Timestamp('2010-1-1'), pd.Timestamp('2010-2-1'), pd.Timestamp('2010-3-1')]})

In [2]: df.describe(include=['O'])
Out[2]:        
       obj
count    3
unique   3
top      f
freq     1

Problem description

For me it makes sense that when using describe on object data types, categorical data types should be included as well in the output. The docstrings even use the word categorical: "To limit it instead to categorical objects submit the numpy.object data type." It might make sense to add booleans and datetimes as well.

Expected Output

The result should mimic the output of df.describe(include=['O', 'category'])

       cat obj
count    3   3
unique   3   3
top      c   f
freq     1   1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.3.0.post

@jreback
Copy link
Contributor

jreback commented Jun 19, 2017

why? you are mixing 2 distinct and different data types. Better to be much more explicit.

@tdpetrou
Copy link
Contributor Author

The types are very closely related and they have the same output (count, unique, top and freq). Perhaps the bigger issue is with the wording of the docstrings. It literally uses the word 'categorical'. Maybe there could be an addition of a new word for 'category-like' data which would include object, category, bool and datetime.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2017

yeah a re-wording of the doc-string with an example (similar to yours) would be helpful. a PR would be great.

@jreback jreback added Categorical Categorical Data Type Difficulty Novice Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 19, 2017
@jreback jreback added this to the Next Major Release milestone Jun 19, 2017
@tdpetrou
Copy link
Contributor Author

There is still some inconsistency. np.number refers to multiple distinct and different data types. I was hoping there was a similar reference that would capture both categorical and object since they are essentially the same.

Also, If the dataframe consists of only categoricals and objects then describe will output both by default.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2017

np.number is a composite dtype. nothing that both captures np.object and category

@TomAugspurger
Copy link
Contributor

Agreed that we should clarify the docs here, and not change the behavior.

It is inconsistent that np.number selects all its subtypes, but object doesn't, but that's an unfortunate consequence of the type system we have to work with. I think the ability to select object and categorical dtypes independently is more useful.

@tdpetrou
Copy link
Contributor Author

Agreed, behavior should stay the same. Still wish there was something analogous to np.number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants