ENH: GH12042 Add parameter `drop_first` to get_dummies to get n-1 variables out of n levels. #12092

BranYang · 2016-01-19T18:47:48Z

closes #12042

Some times it's useful to only accept n-1 variables out of n categorical levels.

jreback · 2016-01-19T19:34:23Z

pandas/core/reshape.py

@@ -974,6 +975,9 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
        Whether the dummy columns should be sparse or not.  Returns
        SparseDataFrame if `data` is a Series or if all columns are included.
        Otherwise returns a DataFrame with some SparseBlocks.
+    remove_first : bool, defalt False


not sure I like this parameter name, we have other places where we use keep. Can you recast to this?

According to @TomAugspurger 's comment, I will change this option to drop_first.

TomAugspurger · 2016-01-19T19:45:32Z

@jreback it might be tough it find a better name than drop_first, though agreed we usually go with keep_ parameters...

We should define what happens when drop_first is True and there's a single level. Right now it fails with an IndexError I think (haven't tested). I'm not sure whether we should raise or not. Any idea what R does?

TomAugspurger · 2016-01-19T19:50:18Z

Looks like R raises

> year <- c(1, 1, 1, 1)
> year.f
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
> year.f <- factor(year)
> model.matrix(~year.f)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels

although since pandas is more of a data manipulation library than a stats library, we might want to not raise an error. That said, I'm not sure what we'd return if we do catch the error, the single level or no levels at all?

jreback · 2016-01-19T19:59:59Z

I would raise a ValueError that doesn't really make any sense.

BranYang · 2016-01-20T01:53:25Z

When drop_first is turned on and there is only one level, currently it will return an empty DataFrame:

In [1]: import pandas as pd

In [2]: pd.get_dummies(pd.DataFrame(list('aaaaa')), drop_first=True)
Out[2]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

By doing this I think that the user doesn't have to check the levels contained in a column before using drop_first.
Otherwise if we raise an error it will be the caller's responsibility to check the levels of the columns they want to get dummy from. It may be painful to do this.

If you guys think raise an error is preferred, please comment and I will change the behavior

jreback · 2016-01-20T19:01:56Z

doc/source/reshaping.rst

+
+    pd.get_dummies(s, drop_first=True)
+
+    df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})


you only need 1 of these examples (you can just use the Series one)

jreback · 2016-01-27T01:40:18Z

@TomAugspurger any comments?

TomAugspurger · 2016-01-27T02:00:48Z

doc/source/reshaping.rst

+.. versionadded:: 0.18.0
+
+Sometimes it will be useful to only keep k-1 levels of a categorical
+variable to avoid collinearity when feed the result to statistical models.


Typo: feed -> feeding

TomAugspurger · 2016-01-27T02:02:32Z

Just the minor doc typo. Looks good otherwise.

kawochen · 2016-01-27T02:04:51Z

pandas/core/reshape.py

@@ -971,6 +971,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
        Otherwise returns a DataFrame with some SparseBlocks.

        .. versionadded:: 0.16.1
+    drop_first : bool, defalt False


jreback · 2016-01-27T02:08:32Z

pls
squash when fixed

kawochen · 2016-01-27T02:09:41Z

doc/source/reshaping.rst

+
+Sometimes it will be useful to only keep k-1 levels of a categorical
+variable to avoid collinearity when feed the result to statistical models.
+You can switch to this mode by turn on ``drop_first``.


jreback · 2016-01-27T15:29:53Z

@BranYang did you mean to close this?

BranYang · 2016-01-27T15:34:54Z

@jreback I found that it's too messy when I try to squash my commits. So I just revert to a previous state and re-commit it.

jreback · 2016-01-27T15:39:21Z

that's fine. I will take care on merge anyhow.

jreback · 2016-01-27T15:40:54Z

pandas/tests/test_reshape.py

+
+        result = get_dummies(s_series, sparse=self.sparse, drop_first=True)
+        assert_frame_equal(result, expected)
+


do we have the empty case tested somewhere?

You mean the case that we only have 1 level in a categorical variable? I will add a case to test this - the result should be empty.

yep. also test a completely empty frame as well.

@jreback By "empty frame", do you mean this case

In [1]: import pandas as pd In [2]: pd.get_dummies(pd.Series()) Out[2]: Empty DataFrame Columns: [] Index: [] In [3]: pd.get_dummies(pd.Series(), drop_first=True) Out[3]: Empty DataFrame Columns: [] Index: []

Or this case

In [4]: pd.get_dummies(pd.DataFrame())

But this will raise an error

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-059be6e25969> in <module>() ----> 1 pd.get_dummies(pd.DataFrame()) C:\D\Projects\Github\pandas\pandas\core\reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first) 1083 drop_first=drop_first) 1084 with_dummies.append(dummy) -> 1085 result = concat(with_dummies, axis=1) 1086 else: 1087 result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na, C:\D\Projects\Github\pandas\pandas\tools\merge.py in concat(objs, axis, join, jo in_axes, ignore_index, keys, levels, names, verify_integrity, copy) 833 keys=keys, levels=levels, names=names, 834 verify_integrity=verify_integrity, --> 835 copy=copy) 836 return op.get_result() 837 C:\D\Projects\Github\pandas\pandas\tools\merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy) 866 867 if len(objs) == 0: --> 868 raise ValueError('No objects to concatenate') 869 870 if keys is None: ValueError: No objects to concatenate

jreback · 2016-01-29T14:29:32Z

pandas/tests/test_reshape.py

+    # Test the case that categorical variable only has one level.
+    def test_basic_drop_first_one_level(self):
+        result = get_dummies(list('aaa'), sparse=self.sparse, drop_first=True)
+        self.assertEqual(result.empty, True)


compare this with an actual constructed empty frame as this will verify that the indexes are correct.

jreback · 2016-01-29T14:32:52Z

pandas/tests/test_reshape.py

+                               2: 0.0}})
+        assert_frame_equal(res, exp)
+
+        # Sparse dataframes do not allow nan labelled columns, see #GH8822


so what does sparse do in that case? pls tests that as well

…iables out of n levels.

…as one level.

jreback · 2016-02-01T20:22:53Z

pandas/tests/test_reshape.py

+
+        res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse,
+                                  drop_first=True)
+        tm.assert_numpy_array_equal(res_just_na.empty, True)


this doesn't make any sense. again compare against an expected frame.

BranYang · 2016-02-03T03:01:21Z

@jreback pls kindly review the tests..

jreback · 2016-02-08T15:28:59Z

thanks @BranYang

…iables out of n levels. closes pandas-dev#12042 Some times it's useful to only accept n-1 variables out of n categorical levels. Author: Bran Yang <yangbo.84@gmail.com> Closes pandas-dev#12092 from BranYang/master and squashes the following commits: 0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty 0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level. 45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.

jreback reviewed Jan 19, 2016
View reviewed changes

jreback added Stats Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 19, 2016

BranYang changed the title ~~ENH: GH12042 Pandas get_dummies() and n-1 Categorical Encoding Option…~~ ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. Jan 20, 2016

jreback reviewed Jan 20, 2016
View reviewed changes

TomAugspurger reviewed Jan 27, 2016
View reviewed changes

kawochen reviewed Jan 27, 2016
View reviewed changes

BranYang closed this Jan 27, 2016

BranYang reopened this Jan 27, 2016

jreback reviewed Jan 27, 2016
View reviewed changes

jreback added this to the 0.18.0 milestone Jan 27, 2016

jreback reviewed Jan 29, 2016
View reviewed changes

BranYang added 2 commits February 1, 2016 10:43

ENH: GH12042 Add parameter drop_first to get_dummies to get k-1 var…

45f14e8

…iables out of n levels.

Test the case that drop_first is on and categorical variable only h…

0d99c2a

…as one level.

jreback reviewed Feb 1, 2016
View reviewed changes

Compare with empty DataFrame, not just check empty

0528c57

jreback closed this in 62363d2 Feb 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: GH12042 Add parameter `drop_first` to get_dummies to get n-1 variables out of n levels. #12092

ENH: GH12042 Add parameter `drop_first` to get_dummies to get n-1 variables out of n levels. #12092

BranYang commented Jan 19, 2016

jreback Jan 19, 2016

BranYang Jan 20, 2016

TomAugspurger commented Jan 19, 2016

TomAugspurger commented Jan 19, 2016

jreback commented Jan 19, 2016

BranYang commented Jan 20, 2016

jreback Jan 20, 2016

jreback commented Jan 27, 2016

TomAugspurger Jan 27, 2016

TomAugspurger commented Jan 27, 2016

kawochen Jan 27, 2016

jreback commented Jan 27, 2016

kawochen Jan 27, 2016

jreback commented Jan 27, 2016

BranYang commented Jan 27, 2016

jreback commented Jan 27, 2016

jreback Jan 27, 2016

BranYang Jan 27, 2016

jreback Jan 27, 2016

BranYang Jan 27, 2016

jreback Jan 29, 2016

jreback Jan 29, 2016

jreback Feb 1, 2016

BranYang commented Feb 3, 2016

jreback commented Feb 8, 2016


		pd.get_dummies(s, drop_first=True)

		df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})


		result = get_dummies(s_series, sparse=self.sparse, drop_first=True)
		assert_frame_equal(result, expected)

ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. #12092

ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. #12092

Conversation

BranYang commented Jan 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 19, 2016

TomAugspurger commented Jan 19, 2016

jreback commented Jan 19, 2016

BranYang commented Jan 20, 2016

Choose a reason for hiding this comment

jreback commented Jan 27, 2016

Choose a reason for hiding this comment

TomAugspurger commented Jan 27, 2016

Choose a reason for hiding this comment

jreback commented Jan 27, 2016

Choose a reason for hiding this comment

jreback commented Jan 27, 2016

BranYang commented Jan 27, 2016

jreback commented Jan 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BranYang commented Feb 3, 2016

jreback commented Feb 8, 2016

ENH: GH12042 Add parameter `drop_first` to get_dummies to get n-1 variables out of n levels. #12092

ENH: GH12042 Add parameter `drop_first` to get_dummies to get n-1 variables out of n levels. #12092