Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

Closed
jreback opened this issue Apr 11, 2016 · 6 comments · Fixed by #18099
Closed

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

jreback opened this issue Apr 11, 2016 · 6 comments · Fixed by #18099
Labels
Categorical Categorical Data Type Enhancement
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

xref to #10696, #8709

We don't allow an astype of a DataFrame to category directly

In [44]: df.astype('category')
NotImplementedError: > 1 ndim Categorical are not supported at this time

Instead you can apply the astype per-column.

In [35]: df = DataFrame({'A' : list('aabcda'), 'B' : list('bcdaae')})

In [36]: df
Out[36]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [37]: df.apply(lambda x: x.astype('category'))
Out[37]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [38]: df.apply(lambda x: x.astype('category')).B
Out[38]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

In [39]: df.apply(lambda x: x.astype('category')).A
Out[39]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (4, object): [a, b, c, d]

But if you have 'similar' cateogories then you would usually do this, automatically
astyping with the same uniques.

In [41]: uniques = np.sort(pd.unique(df.values.ravel()))

In [42]: df.apply(lambda x: x.astype('category', categories=uniques)).A
Out[42]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (5, object): [a, b, c, d, e]

In [43]: df.apply(lambda x: x.astype('category', categories=uniques)).B
Out[43]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

This is failry straightforward to actually implement, and I think is a nice easy way of coding, w/o having to actually support 2D categoricals internally (and we are moving away from internal 2-d structures anyhow).

@jreback jreback added this to the Next Major Release milestone Apr 11, 2016
@jreback
Copy link
Contributor Author

jreback commented Apr 11, 2016

@jankatins
Copy link
Contributor

I'm not so sure what you are proposing here? That df.astype('category' , ...) would internally be mapped to df.apply(lambda x: x.astype('category', ...))?

For my usecase df.astype(...) is not necessary, I usually have different types of columns and doing a astype on the complete df would just destroy the df... But if others have the need...

My usecase is more:

lickert_columns= [...] # a few of the columns in my df
for col in lickert_columns:
    df[col] = df[col].astype("category", categories=lickert_scale, ordered=True)

@jreback
Copy link
Contributor Author

jreback commented Apr 11, 2016

well, you would oftentimes do this on a sub-set I think, e.g. df[['A','B']].astype(...)

the reason I bring this up is whether we should form the uniques FIRST before conversions, IOW

if categories=None is passed (which is the default), then we would create it explicity from ALL the passed values.

As opposed to individually create them per-column.

@jankatins
Copy link
Contributor

IMO constructing the categories from all uniques makes sense.

[How would one merge these subset back into the original DF? dropping the old columns and merging the new ones back in? Sounds like a lot of work which ends up as long as the for loop?]

@jreback
Copy link
Contributor Author

jreback commented Oct 16, 2017

example from SO

Here is a complete example

In [9]: np.random.seed(1234)

In [10]: import string

In [11]: df = pd.DataFrame([np.random.choice(list(string.ascii_lowercase), 10) for i in range(5)])

In [12]: df
Out[12]: 
   0  1  2  3  4  5  6  7  8  9
0  p  t  g  v  m  u  y  z  p  r
1  x  j  l  m  w  y  q  f  q  j
2  w  p  s  q  m  f  c  g  d  h
3  l  a  j  l  q  d  c  t  m  b
4  l  t  l  r  o  t  h  k  l  o

In [14]: In [16]: b = pd.unique(df.values.T.reshape(-1, )) 
    ...: df.apply(lambda x: pd.Categorical(x, b).codes)
    ...: 
    ...: 
Out[14]: 
   0  1  2   3   4   5   6   7   8   9
0  0  4  7   9  10  14  15  20   0  12
1  1  5  3  10   2  15  11  16  11   5
2  2  0  8  11  10  16  18   7  17  19
3  3  6  5   3  11  17  18   4  10  22
4  3  4  3  12  13   4  19  21   3  13

@jreback
Copy link
Contributor Author

jreback commented Oct 16, 2017

Note this can actually be implemented in a more performant way via https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1453

@jreback jreback changed the title ENH: support .astype('category') on DataFrame ENH: support .astype('category') on DataFrame / aka co-factorization Oct 16, 2017
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants