Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. #12092

Closed
wants to merge 3 commits into from

Conversation

BranYang
Copy link
Contributor

closes #12042

Some times it's useful to only accept n-1 variables out of n categorical levels.

@@ -974,6 +975,9 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
Whether the dummy columns should be sparse or not. Returns
SparseDataFrame if `data` is a Series or if all columns are included.
Otherwise returns a DataFrame with some SparseBlocks.
remove_first : bool, defalt False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I like this parameter name, we have other places where we use keep. Can you recast to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @TomAugspurger 's comment, I will change this option to drop_first.

@jreback jreback added Stats Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 19, 2016
@TomAugspurger
Copy link
Contributor

@jreback it might be tough it find a better name than drop_first, though agreed we usually go with keep_ parameters...

We should define what happens when drop_first is True and there's a single level. Right now it fails with an IndexError I think (haven't tested). I'm not sure whether we should raise or not. Any idea what R does?

@TomAugspurger
Copy link
Contributor

Looks like R raises

> year <- c(1, 1, 1, 1)
> year.f
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
> year.f <- factor(year)
> model.matrix(~year.f)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels

although since pandas is more of a data manipulation library than a stats library, we might want to not raise an error. That said, I'm not sure what we'd return if we do catch the error, the single level or no levels at all?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

I would raise a ValueError that doesn't really make any sense.

@BranYang
Copy link
Contributor Author

When drop_first is turned on and there is only one level, currently it will return an empty DataFrame:

In [1]: import pandas as pd

In [2]: pd.get_dummies(pd.DataFrame(list('aaaaa')), drop_first=True)
Out[2]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

By doing this I think that the user doesn't have to check the levels contained in a column before using drop_first.
Otherwise if we raise an error it will be the caller's responsibility to check the levels of the columns they want to get dummy from. It may be painful to do this.

If you guys think raise an error is preferred, please comment and I will change the behavior

@BranYang BranYang changed the title ENH: GH12042 Pandas get_dummies() and n-1 Categorical Encoding Option… ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. Jan 20, 2016

pd.get_dummies(s, drop_first=True)

df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only need 1 of these examples (you can just use the Series one)

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

@TomAugspurger any comments?

.. versionadded:: 0.18.0

Sometimes it will be useful to only keep k-1 levels of a categorical
variable to avoid collinearity when feed the result to statistical models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: feed -> feeding

@TomAugspurger
Copy link
Contributor

Just the minor doc typo. Looks good otherwise.

@@ -971,6 +971,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
Otherwise returns a DataFrame with some SparseBlocks.

.. versionadded:: 0.16.1
drop_first : bool, defalt False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

pls
squash when fixed


Sometimes it will be useful to only keep k-1 levels of a categorical
variable to avoid collinearity when feed the result to statistical models.
You can switch to this mode by turn on ``drop_first``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turning

@BranYang BranYang closed this Jan 27, 2016
@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

@BranYang did you mean to close this?

@BranYang BranYang reopened this Jan 27, 2016
@BranYang
Copy link
Contributor Author

@jreback I found that it's too messy when I try to squash my commits. So I just revert to a previous state and re-commit it.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

that's fine. I will take care on merge anyhow.


result = get_dummies(s_series, sparse=self.sparse, drop_first=True)
assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have the empty case tested somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the case that we only have 1 level in a categorical variable? I will add a case to test this - the result should be empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. also test a completely empty frame as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback By "empty frame", do you mean this case

In [1]: import pandas as pd

In [2]: pd.get_dummies(pd.Series())
Out[2]:
Empty DataFrame
Columns: []
Index: []

In [3]: pd.get_dummies(pd.Series(), drop_first=True)
Out[3]:
Empty DataFrame
Columns: []
Index: []

Or this case

In [4]: pd.get_dummies(pd.DataFrame())

But this will raise an error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-059be6e25969> in <module>()
----> 1 pd.get_dummies(pd.DataFrame())

C:\D\Projects\Github\pandas\pandas\core\reshape.py in get_dummies(data, prefix,
prefix_sep, dummy_na, columns, sparse, drop_first)
   1083                                     drop_first=drop_first)
   1084             with_dummies.append(dummy)
-> 1085         result = concat(with_dummies, axis=1)
   1086     else:
   1087         result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na,

C:\D\Projects\Github\pandas\pandas\tools\merge.py in concat(objs, axis, join, jo
in_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    833                        keys=keys, levels=levels, names=names,
    834                        verify_integrity=verify_integrity,
--> 835                        copy=copy)
    836     return op.get_result()
    837

C:\D\Projects\Github\pandas\pandas\tools\merge.py in __init__(self, objs, axis,
join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    866
    867         if len(objs) == 0:
--> 868             raise ValueError('No objects to concatenate')
    869
    870         if keys is None:

ValueError: No objects to concatenate

@jreback jreback added this to the 0.18.0 milestone Jan 27, 2016
# Test the case that categorical variable only has one level.
def test_basic_drop_first_one_level(self):
result = get_dummies(list('aaa'), sparse=self.sparse, drop_first=True)
self.assertEqual(result.empty, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare this with an actual constructed empty frame as this will verify that the indexes are correct.

2: 0.0}})
assert_frame_equal(res, exp)

# Sparse dataframes do not allow nan labelled columns, see #GH8822
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what does sparse do in that case? pls tests that as well


res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse,
drop_first=True)
tm.assert_numpy_array_equal(res_just_na.empty, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make any sense. again compare against an expected frame.

@BranYang
Copy link
Contributor Author

BranYang commented Feb 3, 2016

@jreback pls kindly review the tests..

@jreback jreback closed this in 62363d2 Feb 8, 2016
@jreback
Copy link
Contributor

jreback commented Feb 8, 2016

thanks @BranYang

cldy pushed a commit to cldy/pandas that referenced this pull request Feb 11, 2016
…iables out of n levels.

closes pandas-dev#12042     Some times it's useful to only accept n-1 variables
out of n categorical levels.

Author: Bran Yang <yangbo.84@gmail.com>

Closes pandas-dev#12092 from BranYang/master and squashes the following commits:

0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty
0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level.
45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?
4 participants