Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy function for making dummy variable matrices #955

Closed
wesm opened this issue Mar 22, 2012 · 11 comments
Closed

Easy function for making dummy variable matrices #955

wesm opened this issue Mar 22, 2012 · 11 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Mar 22, 2012

there are already a few things floating around but having something more structured / more options + in the pandas namespace would be nice

from an e-mail on the statsmodels mailing list

Here's a quick hack at it (not too dissimilar to Aman's code it looks
like)-- I should find a place in the library to put this:

def make_dummies(data, cat_variables):
   result = data.drop(cat_variables, axis=1)

   for variable in cat_variables:
       dummies = _get_dummy_frame(data, variable)
       result = result.join(dummies)
   return result

def _get_dummy_frame(data, column):
   from pandas import Factor
   factor = Factor(data[column])
   dummy_mat = np.eye(len(factor.levels)).take(factor.labels, axis=0)
   dummy_cols = ['%s.%s' % (column, v) for v in factor.levels]
   dummies = DataFrame(dummy_mat, index=data.index,
                       columns=dummy_cols)

   return dummies


In [29]: df
Out[29]:
  gender  hand   color  height  age
0  male    right  green  5.75    23
1  female  right  brown  5.42    27
2  female  left   green  5.58    31
3  male    right  brown  5.92    39
4  male    right  blue   5.83    33

In [30]: make_dummies(df, ['gender', 'hand', 'color']).T
Out[30]:
              0     1     2     3     4
height         5.75  5.42  5.58  5.92  5.83
age            23    27    31    39    33
gender.female  0     1     1     0     0
gender.male    1     0     0     1     1
hand.left      0     0     1     0     0
hand.right     1     1     0     1     1
color.blue     0     0     0     0     1
color.brown    0     1     0     1     0
color.green    1     0     1     0     0

(BTW I read in that data using df = read_clipboard(sep=','))
@changhiskhan
Copy link
Contributor

Not sure what people would want but in the absence of a strong reason to do otherwise, I would prefer to not transpose the axes.

@wesm
Copy link
Member Author

wesm commented Mar 24, 2012

I only transposed there to make it output to the console (lot of long-ish columns)

@changhiskhan
Copy link
Contributor

got it.

@wesm
Copy link
Member Author

wesm commented Mar 24, 2012

i mean, you see the example above, right? You have multiple columns and you want to produce dummy columns for each combination of a set of factors

@wesm
Copy link
Member Author

wesm commented Mar 29, 2012

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2013

i think this machinery might already be in patsy...might be possible to lift it from there

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

looks pretty covered by get_dummies

@jreback jreback closed this as completed Sep 28, 2013
@TomAugspurger
Copy link
Contributor

@jreback any opinion on reopening this so get_dummies can handle DataFrames?

',PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S\n1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C\n'

We could replace this

features = pd.concat([data.get(['Fare', 'Age']),
                      pd.get_dummies(data.Sex, prefix='Sex'),
                      pd.get_dummies(data.Pclass, prefix='Pclass'),
                      pd.get_dummies(data.Embarked, prefix='Embarked')],
                     axis=1)

with this

features = pd.get_dummies(data, include=['Sex', 'Pclass', 'Embarked'], exclude=['Fare', 'Age])

Or we can check they dtypes on the DataFrame to see that [Fare, Age] are numeric and not dummize them automatically, so you can leave off the exclude parameter. The current way seems a bit verbose, especially when you have a mixture of
categorical columns that need dummies and numerical columns that don't.

@jorisvandenbossche
Copy link
Member

+1

@jreback
Copy link
Contributor

jreback commented Aug 28, 2014

@TomAugspurger nice idea. pls open a new issue for this though.

@enmanuelsg
Copy link

Here is another technique to create automatically dummie: http://python-apuntes.blogspot.com.ar/2017/04/creacion-de-variables-de-grupo.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants