Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pd.factorize to accept a Dataframe #8819

Closed
jreback opened this issue Nov 14, 2014 · 3 comments
Closed

ENH: pd.factorize to accept a Dataframe #8819

jreback opened this issue Nov 14, 2014 · 3 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Closing Candidate May be closeable, needs more eyeballs Enhancement

Comments

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

from #8626 discussion
cc @fkaufer

allow pd.factorize to take a Dataframe and process with tuples of columns.
further allow DataFrame.factorize with a subset argument which just calls pd.factorize

impl is below (e.g. simple fast-zipping then factorizing after dense conversion)
just needs some tests

In [41]: df = pd.DataFrame({'A':['a1','a1','a2','a2','a1'], 'B':['b1','b2','b1','b2','b1']})

In [42]: df
Out[42]: 
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
4  a1  b1

In [43]: cols_as_tuples = pd.lib.fast_zip([df[col].get_values() for col in df.columns])

In [44]: cols_as_tuples 
Out[44]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2'), ('a1', 'b1')], dtype=object)

In [47]: pd.factorize(cols_as_tuples)
Out[47]: 
(array([0, 1, 2, 3, 0]),
 array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))

In [48]: pd.Categorical(cols_as_tuples)
Out[48]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

In [59]: pd.Categorical(df.to_records(index=False))
Out[59]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Nov 14, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 14, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@simonjayhawkins
Copy link
Member

@jreback : just a thought. after reading #12860, if factorize were to be implemented on a Dataframe, it would be necessary to distinguish the difference between sharing a category between columns and creating a category across columns. Using a subset argument alone maybe insufficient.

@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 28, 2020
@mroeschke mroeschke removed API Design Categorical Categorical Data Type labels Apr 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

Not obvious to me what the use case is here.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Feb 11, 2023
@mroeschke
Copy link
Member

Yeah agreed I think especially since factorize has centralized around the 1D input this wouldn't fit well. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Closing Candidate May be closeable, needs more eyeballs Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants