Easy function for making dummy variable matrices #955

wesm · 2012-03-22T18:28:21Z

there are already a few things floating around but having something more structured / more options + in the pandas namespace would be nice

from an e-mail on the statsmodels mailing list

Here's a quick hack at it (not too dissimilar to Aman's code it looks
like)-- I should find a place in the library to put this:

def make_dummies(data, cat_variables):
   result = data.drop(cat_variables, axis=1)

   for variable in cat_variables:
       dummies = _get_dummy_frame(data, variable)
       result = result.join(dummies)
   return result

def _get_dummy_frame(data, column):
   from pandas import Factor
   factor = Factor(data[column])
   dummy_mat = np.eye(len(factor.levels)).take(factor.labels, axis=0)
   dummy_cols = ['%s.%s' % (column, v) for v in factor.levels]
   dummies = DataFrame(dummy_mat, index=data.index,
                       columns=dummy_cols)

   return dummies


In [29]: df
Out[29]:
  gender  hand   color  height  age
0  male    right  green  5.75    23
1  female  right  brown  5.42    27
2  female  left   green  5.58    31
3  male    right  brown  5.92    39
4  male    right  blue   5.83    33

In [30]: make_dummies(df, ['gender', 'hand', 'color']).T
Out[30]:
              0     1     2     3     4
height         5.75  5.42  5.58  5.92  5.83
age            23    27    31    39    33
gender.female  0     1     1     0     0
gender.male    1     0     0     1     1
hand.left      0     0     1     0     0
hand.right     1     1     0     1     1
color.blue     0     0     0     0     1
color.brown    0     1     0     1     0
color.green    1     0     1     0     0

(BTW I read in that data using df = read_clipboard(sep=','))

The text was updated successfully, but these errors were encountered:

changhiskhan · 2012-03-24T14:30:28Z

Not sure what people would want but in the absence of a strong reason to do otherwise, I would prefer to not transpose the axes.

wesm · 2012-03-24T14:55:11Z

I only transposed there to make it output to the console (lot of long-ish columns)

changhiskhan · 2012-03-24T15:15:34Z

got it.

wesm · 2012-03-24T15:17:21Z

i mean, you see the example above, right? You have multiple columns and you want to produce dummy columns for each combination of a set of factors

wesm · 2012-03-29T03:25:22Z

related: http://scipy-central.org/item/35/1/convert-categorical-data-in-a-structure-numpy-array-to-boolean-fields

cpcloud · 2013-07-29T05:54:07Z

i think this machinery might already be in patsy...might be possible to lift it from there

jreback · 2013-09-28T19:09:48Z

looks pretty covered by get_dummies

TomAugspurger · 2014-08-28T01:55:46Z

@jreback any opinion on reopening this so get_dummies can handle DataFrames?

',PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S\n1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C\n'

We could replace this

features = pd.concat([data.get(['Fare', 'Age']),
                      pd.get_dummies(data.Sex, prefix='Sex'),
                      pd.get_dummies(data.Pclass, prefix='Pclass'),
                      pd.get_dummies(data.Embarked, prefix='Embarked')],
                     axis=1)

with this

features = pd.get_dummies(data, include=['Sex', 'Pclass', 'Embarked'], exclude=['Fare', 'Age])

Or we can check they dtypes on the DataFrame to see that [Fare, Age] are numeric and not dummize them automatically, so you can leave off the exclude parameter. The current way seems a bit verbose, especially when you have a mixture of
categorical columns that need dummies and numerical columns that don't.

jorisvandenbossche · 2014-08-28T06:36:42Z

+1

jreback · 2014-08-28T13:18:35Z

@TomAugspurger nice idea. pls open a new issue for this though.

enmanuelsg · 2017-04-17T01:17:03Z

Here is another technique to create automatically dummie: http://python-apuntes.blogspot.com.ar/2017/04/creacion-de-variables-de-grupo.html

jreback closed this as completed Sep 28, 2013

jorisvandenbossche mentioned this issue Aug 29, 2014

ENH: let get_dummies take a DataFrame #8140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy function for making dummy variable matrices #955

Easy function for making dummy variable matrices #955

wesm commented Mar 22, 2012

changhiskhan commented Mar 24, 2012

wesm commented Mar 24, 2012

changhiskhan commented Mar 24, 2012

wesm commented Mar 24, 2012

wesm commented Mar 29, 2012

cpcloud commented Jul 29, 2013

jreback commented Sep 28, 2013

TomAugspurger commented Aug 28, 2014

jorisvandenbossche commented Aug 28, 2014

jreback commented Aug 28, 2014

enmanuelsg commented Apr 17, 2017

Easy function for making dummy variable matrices #955

Easy function for making dummy variable matrices #955

Comments

wesm commented Mar 22, 2012

changhiskhan commented Mar 24, 2012

wesm commented Mar 24, 2012

changhiskhan commented Mar 24, 2012

wesm commented Mar 24, 2012

wesm commented Mar 29, 2012

cpcloud commented Jul 29, 2013

jreback commented Sep 28, 2013

TomAugspurger commented Aug 28, 2014

jorisvandenbossche commented Aug 28, 2014

jreback commented Aug 28, 2014

enmanuelsg commented Apr 17, 2017