-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: GH12042 Add parameter drop_first
to get_dummies to get n-1 variables out of n levels.
#12092
Conversation
@@ -974,6 +975,9 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
Whether the dummy columns should be sparse or not. Returns | |||
SparseDataFrame if `data` is a Series or if all columns are included. | |||
Otherwise returns a DataFrame with some SparseBlocks. | |||
remove_first : bool, defalt False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I like this parameter name, we have other places where we use keep
. Can you recast to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @TomAugspurger 's comment, I will change this option to drop_first
.
@jreback it might be tough it find a better name than We should define what happens when |
Looks like R raises > year <- c(1, 1, 1, 1)
> year.f
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
> year.f <- factor(year)
> model.matrix(~year.f)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels although since pandas is more of a data manipulation library than a stats library, we might want to not raise an error. That said, I'm not sure what we'd return if we do catch the error, the single level or no levels at all? |
I would raise a |
When In [1]: import pandas as pd
In [2]: pd.get_dummies(pd.DataFrame(list('aaaaa')), drop_first=True)
Out[2]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4] By doing this I think that the user doesn't have to check the levels contained in a column before using If you guys think raise an error is preferred, please comment and I will change the behavior |
drop_first
to get_dummies to get n-1 variables out of n levels.
|
||
pd.get_dummies(s, drop_first=True) | ||
|
||
df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only need 1 of these examples (you can just use the Series one)
@TomAugspurger any comments? |
.. versionadded:: 0.18.0 | ||
|
||
Sometimes it will be useful to only keep k-1 levels of a categorical | ||
variable to avoid collinearity when feed the result to statistical models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: feed -> feeding
Just the minor doc typo. Looks good otherwise. |
@@ -971,6 +971,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
Otherwise returns a DataFrame with some SparseBlocks. | |||
|
|||
.. versionadded:: 0.16.1 | |||
drop_first : bool, defalt False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default
pls |
|
||
Sometimes it will be useful to only keep k-1 levels of a categorical | ||
variable to avoid collinearity when feed the result to statistical models. | ||
You can switch to this mode by turn on ``drop_first``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turning
@BranYang did you mean to close this? |
@jreback I found that it's too messy when I try to squash my commits. So I just revert to a previous state and re-commit it. |
that's fine. I will take care on merge anyhow. |
|
||
result = get_dummies(s_series, sparse=self.sparse, drop_first=True) | ||
assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have the empty case tested somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the case that we only have 1 level in a categorical variable? I will add a case to test this - the result should be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. also test a completely empty frame as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback By "empty frame", do you mean this case
In [1]: import pandas as pd
In [2]: pd.get_dummies(pd.Series())
Out[2]:
Empty DataFrame
Columns: []
Index: []
In [3]: pd.get_dummies(pd.Series(), drop_first=True)
Out[3]:
Empty DataFrame
Columns: []
Index: []
Or this case
In [4]: pd.get_dummies(pd.DataFrame())
But this will raise an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-059be6e25969> in <module>()
----> 1 pd.get_dummies(pd.DataFrame())
C:\D\Projects\Github\pandas\pandas\core\reshape.py in get_dummies(data, prefix,
prefix_sep, dummy_na, columns, sparse, drop_first)
1083 drop_first=drop_first)
1084 with_dummies.append(dummy)
-> 1085 result = concat(with_dummies, axis=1)
1086 else:
1087 result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na,
C:\D\Projects\Github\pandas\pandas\tools\merge.py in concat(objs, axis, join, jo
in_axes, ignore_index, keys, levels, names, verify_integrity, copy)
833 keys=keys, levels=levels, names=names,
834 verify_integrity=verify_integrity,
--> 835 copy=copy)
836 return op.get_result()
837
C:\D\Projects\Github\pandas\pandas\tools\merge.py in __init__(self, objs, axis,
join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
866
867 if len(objs) == 0:
--> 868 raise ValueError('No objects to concatenate')
869
870 if keys is None:
ValueError: No objects to concatenate
# Test the case that categorical variable only has one level. | ||
def test_basic_drop_first_one_level(self): | ||
result = get_dummies(list('aaa'), sparse=self.sparse, drop_first=True) | ||
self.assertEqual(result.empty, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare this with an actual constructed empty frame as this will verify that the indexes are correct.
2: 0.0}}) | ||
assert_frame_equal(res, exp) | ||
|
||
# Sparse dataframes do not allow nan labelled columns, see #GH8822 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what does sparse do in that case? pls tests that as well
…iables out of n levels.
|
||
res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse, | ||
drop_first=True) | ||
tm.assert_numpy_array_equal(res_just_na.empty, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make any sense. again compare against an expected frame.
@jreback pls kindly review the tests.. |
thanks @BranYang |
…iables out of n levels. closes pandas-dev#12042 Some times it's useful to only accept n-1 variables out of n categorical levels. Author: Bran Yang <yangbo.84@gmail.com> Closes pandas-dev#12092 from BranYang/master and squashes the following commits: 0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty 0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level. 45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.
closes #12042
Some times it's useful to only accept n-1 variables out of n categorical levels.