ENH: support .astype('category') on DataFrame / aka co-factorization #12860

jreback · 2016-04-11T14:50:42Z

We don't allow an astype of a DataFrame to category directly

In [44]: df.astype('category')
NotImplementedError: > 1 ndim Categorical are not supported at this time

Instead you can apply the astype per-column.

In [35]: df = DataFrame({'A' : list('aabcda'), 'B' : list('bcdaae')})

In [36]: df
Out[36]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [37]: df.apply(lambda x: x.astype('category'))
Out[37]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [38]: df.apply(lambda x: x.astype('category')).B
Out[38]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

In [39]: df.apply(lambda x: x.astype('category')).A
Out[39]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (4, object): [a, b, c, d]

But if you have 'similar' cateogories then you would usually do this, automatically
astyping with the same uniques.

In [41]: uniques = np.sort(pd.unique(df.values.ravel()))

In [42]: df.apply(lambda x: x.astype('category', categories=uniques)).A
Out[42]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (5, object): [a, b, c, d, e]

In [43]: df.apply(lambda x: x.astype('category', categories=uniques)).B
Out[43]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

This is failry straightforward to actually implement, and I think is a nice easy way of coding, w/o having to actually support 2D categoricals internally (and we are moving away from internal 2-d structures anyhow).

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-11T14:51:08Z

@TomAugspurger @sinhrks @shoyer @jorisvandenbossche
cc @JanSchulz

jankatins · 2016-04-11T17:52:40Z

I'm not so sure what you are proposing here? That df.astype('category' , ...) would internally be mapped to df.apply(lambda x: x.astype('category', ...))?

For my usecase df.astype(...) is not necessary, I usually have different types of columns and doing a astype on the complete df would just destroy the df... But if others have the need...

My usecase is more:

lickert_columns= [...] # a few of the columns in my df
for col in lickert_columns:
    df[col] = df[col].astype("category", categories=lickert_scale, ordered=True)

jreback · 2016-04-11T18:32:44Z

well, you would oftentimes do this on a sub-set I think, e.g. df[['A','B']].astype(...)

the reason I bring this up is whether we should form the uniques FIRST before conversions, IOW

if categories=None is passed (which is the default), then we would create it explicity from ALL the passed values.

As opposed to individually create them per-column.

jankatins · 2016-04-11T18:54:13Z

IMO constructing the categories from all uniques makes sense.

[How would one merge these subset back into the original DF? dropping the old columns and merging the new ones back in? Sounds like a lot of work which ends up as long as the for loop?]

jreback · 2017-10-16T10:07:36Z

example from SO

Here is a complete example

In [9]: np.random.seed(1234)

In [10]: import string

In [11]: df = pd.DataFrame([np.random.choice(list(string.ascii_lowercase), 10) for i in range(5)])

In [12]: df
Out[12]: 
   0  1  2  3  4  5  6  7  8  9
0  p  t  g  v  m  u  y  z  p  r
1  x  j  l  m  w  y  q  f  q  j
2  w  p  s  q  m  f  c  g  d  h
3  l  a  j  l  q  d  c  t  m  b
4  l  t  l  r  o  t  h  k  l  o

In [14]: In [16]: b = pd.unique(df.values.T.reshape(-1, )) 
    ...: df.apply(lambda x: pd.Categorical(x, b).codes)
    ...: 
    ...: 
Out[14]: 
   0  1  2   3   4   5   6   7   8   9
0  0  4  7   9  10  14  15  20   0  12
1  1  5  3  10   2  15  11  16  11   5
2  2  0  8  11  10  16  18   7  17  19
3  3  6  5   3  11  17  18   4  10  22
4  3  4  3  12  13   4  19  21   3  13

jreback · 2017-10-16T10:09:25Z

Note this can actually be implemented in a more performant way via https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1453

jreback added Enhancement Categorical Categorical Data Type Difficulty Intermediate labels Apr 11, 2016

jreback added this to the Next Major Release milestone Apr 11, 2016

jorisvandenbossche mentioned this issue Sep 18, 2016

Recoding as numerical categories with multiple columns #14242

Closed

jreback changed the title ~~ENH: support .astype('category') on DataFrame~~ ENH: support .astype('category') on DataFrame / aka co-factorization Oct 16, 2017

jreback mentioned this issue Oct 28, 2017

ERR: Fix segfault with .astype('category') on empty DataFrame #18015

Merged

4 tasks

jschendel mentioned this issue Nov 3, 2017

ENH: Implement DataFrame.astype('category') #18099

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Feb 24, 2018

jreback closed this as completed in #18099 Mar 1, 2018

simonjayhawkins mentioned this issue Sep 9, 2018

ENH: pd.factorize to accept a Dataframe #8819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

jreback commented Apr 11, 2016

jreback commented Apr 11, 2016

jankatins commented Apr 11, 2016

jreback commented Apr 11, 2016

jankatins commented Apr 11, 2016

jreback commented Oct 16, 2017 •

edited

Loading

jreback commented Oct 16, 2017

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

Comments

jreback commented Apr 11, 2016

jreback commented Apr 11, 2016

jankatins commented Apr 11, 2016

jreback commented Apr 11, 2016

jankatins commented Apr 11, 2016

jreback commented Oct 16, 2017 • edited Loading

jreback commented Oct 16, 2017

jreback commented Oct 16, 2017 •

edited

Loading