ENH: pd.factorize to accept a Dataframe #8819

jreback · 2014-11-14T19:36:58Z

from #8626 discussion
cc @fkaufer

allow pd.factorize to take a Dataframe and process with tuples of columns.
further allow DataFrame.factorize with a subset argument which just calls pd.factorize

impl is below (e.g. simple fast-zipping then factorizing after dense conversion)
just needs some tests

In [41]: df = pd.DataFrame({'A':['a1','a1','a2','a2','a1'], 'B':['b1','b2','b1','b2','b1']})

In [42]: df
Out[42]: 
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
4  a1  b1

In [43]: cols_as_tuples = pd.lib.fast_zip([df[col].get_values() for col in df.columns])

In [44]: cols_as_tuples 
Out[44]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2'), ('a1', 'b1')], dtype=object)

In [47]: pd.factorize(cols_as_tuples)
Out[47]: 
(array([0, 1, 2, 3, 0]),
 array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))

In [48]: pd.Categorical(cols_as_tuples)
Out[48]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

In [59]: pd.Categorical(df.to_records(index=False))
Out[59]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2018-09-09T11:00:07Z

@jreback : just a thought. after reading #12860, if factorize were to be implemented on a Dataframe, it would be necessary to distinguish the difference between sharing a category between columns and creating a category across columns. Using a subset argument alone maybe insufficient.

jbrockmendel · 2023-02-11T21:26:10Z

Not obvious to me what the use case is here.

mroeschke · 2023-03-29T00:27:37Z

Yeah agreed I think especially since factorize has centralized around the 1D input this wouldn't fit well. Closing

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Nov 14, 2014

jreback added this to the 0.16.0 milestone Nov 14, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 28, 2020

mroeschke removed API Design Categorical Categorical Data Type labels Apr 11, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Feb 11, 2023

mroeschke closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pd.factorize to accept a Dataframe #8819

ENH: pd.factorize to accept a Dataframe #8819

jreback commented Nov 14, 2014

simonjayhawkins commented Sep 9, 2018

jbrockmendel commented Feb 11, 2023

mroeschke commented Mar 29, 2023

ENH: pd.factorize to accept a Dataframe #8819

ENH: pd.factorize to accept a Dataframe #8819

Comments

jreback commented Nov 14, 2014

simonjayhawkins commented Sep 9, 2018

jbrockmendel commented Feb 11, 2023

mroeschke commented Mar 29, 2023