-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow get_dummies to accept dtype argument #18330
Conversation
Codecov Report
@@ Coverage Diff @@
## master #18330 +/- ##
==========================================
- Coverage 91.35% 91.33% -0.02%
==========================================
Files 163 163
Lines 49714 49719 +5
==========================================
- Hits 45415 45410 -5
- Misses 4299 4309 +10
Continue to review full report at Codecov.
|
pandas/core/reshape/reshape.py
Outdated
@@ -725,6 +725,8 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
drop_first : bool, default False | |||
Whether to get k-1 dummies out of k categorical levels by removing the | |||
first level. | |||
dtype : dtype, default np.uint8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
pandas/tests/reshape/test_reshape.py
Outdated
@@ -217,34 +217,36 @@ def test_multiindex(self): | |||
|
|||
|
|||
class TestGetDummies(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this define a fixture that returns the various dtypes that you are testing
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -140,7 +140,7 @@ Sparse | |||
Reshaping | |||
^^^^^^^^^ | |||
|
|||
- | |||
- :func:`get_dummies` now supports ``dtype`` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a little more expl, add the PR number as the issue number. Move to Other Enhancements section.
All done. Also updated sparse tests to use fixtures as well. And added one test to verify effective dtype is uint8 when dtype argument is None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good generally. thanks for parametrizing the tests!
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -24,7 +24,17 @@ Other Enhancements | |||
|
|||
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a separate sub-section for this
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -24,7 +24,17 @@ Other Enhancements | |||
|
|||
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- | |||
- :func:`pandas.get_dummies` now supports ``dtype`` argument, which forces specific dtype for new columns. (:issue:`18330`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say default is the same (uint8)
doc/source/whatsnew/v0.22.0.txt
Outdated
- | ||
- :func:`pandas.get_dummies` now supports ``dtype`` argument, which forces specific dtype for new columns. (:issue:`18330`) | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an ipython block, show the original as well (first)
pandas/tests/reshape/test_reshape.py
Outdated
self.df = DataFrame({'A': ['a', 'b', 'a'], | ||
'B': ['b', 'b', 'c'], | ||
'C': [1, 2, 3]}) | ||
@pytest.fixture(params=['uint8', 'float64']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cycle thru more dtypes here that are valid (doesn't have to be all, but include int64, bool, IOW valid for both sparse/dense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add None as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should prob raise on object
dtype I think.
pandas/tests/reshape/test_reshape.py
Outdated
expected = DataFrame({'a': [1, 0, 0], | ||
'b': [0, 1, 0], | ||
'c': [0, 0, 1]}, dtype=dtype) | ||
result = get_dummies(s_list, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't construct the kwargs, actually just pass directly
pandas/tests/reshape/test_reshape.py
Outdated
assert_frame_equal(res, exp) | ||
|
||
# Sparse dataframes do not allow nan labelled columns, see #GH8822 | ||
res_na = get_dummies(s, dummy_na=True, sparse=self.sparse) | ||
res_na = get_dummies(s, dummy_na=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above, don't use kwargs generally for passing args to test functions, rather pass directly
3d504fb
to
d93ee28
Compare
Hello @Scorpil! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 22, 2017 at 13:15 Hours UTC |
0850ac7
to
cb3156a
Compare
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -24,7 +24,27 @@ Other Enhancements | |||
|
|||
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- | |||
|
|||
``get_dummies`` now supports ``dtype`` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref here as well.
doc/source/whatsnew/v0.22.0.txt
Outdated
``get_dummies`` now supports ``dtype`` argument | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:func:`get_dummies` function now accepts ``dtype`` argument, which forces specific dtype for new columns. When ``dtype`` is not specified or equals to ``None``, new columns will have dtype ``uint8`` (as before), so this change is backwards compatible. (:issue:`18330`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :func:`get_dummies` function now accepts a
dtypeargument, which forces a specific dtype for the new columns. The default is
uint8if
dtypeis not specified or
None``.
pandas/core/reshape/reshape.py
Outdated
if dtype is None: | ||
dtype = np.uint8 | ||
|
||
if np.dtype(dtype) is np.dtype('O'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_object_dtype
pandas/core/reshape/reshape.py
Outdated
|
||
if np.dtype(dtype) is np.dtype('O'): | ||
raise TypeError("'object' is not a valid type for get_dummies") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be a ValueError
; also so dtype=object is not a valid dtype for get_dummies
pandas/core/reshape/reshape.py
Outdated
return result | ||
|
||
|
||
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False, | ||
sparse=False, drop_first=False): | ||
sparse=False, drop_first=False, dtype=np.uint8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be passthru?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, missed this one. Idea was to treat this one as internal and allow "wrapper" to set dtype, but passthru has it's advantages and I don't mind ether way, so I'll move dtype-related conversions to this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I'll finish taking a look later but my only real concern is how many times you parametrize by the dtype. It's great to do that for some tests like test_basic_dtype
and a few others, but I'm not sure about all the prefix / sep tests. Do you have specific concerns that you're trying to test there?
doc/source/whatsnew/v0.22.0.txt
Outdated
``get_dummies`` now supports ``dtype`` argument | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:func:`get_dummies` function now accepts ``dtype`` argument, which forces specific dtype for new columns. When ``dtype`` is not specified or equals to ``None``, new columns will have dtype ``uint8`` (as before), so this change is backwards compatible. (:issue:`18330`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can remove
function
afterget_dummies
. - now accepts a dtype argument
- replace forces with specifies a
Replace the second sentence with
When ``dtype`` is not specified, the dtype will be ``uint8`` as before.
doc/source/whatsnew/v0.22.0.txt
Outdated
.. ipython:: python | ||
|
||
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}) | ||
pd.get_dummies(df, columns=['c']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the "Previous behavior" section since this is backwards compatible.
I'd just do
pd.get_dummies(df, columns=['c']).dtypes
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
@@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns): | |||
|
|||
|
|||
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
columns=None, sparse=False, drop_first=False): | |||
columns=None, sparse=False, drop_first=False, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use 'uint8'
or np.uint8
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to mirror API of DataFrame, Series, Panel etc. where passing None explicitly is allowed and means "dtype will be inferred".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @TomAugspurger So this is the last question to answer. Do you accept my argument about None or should I change it to np.uint8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok here, it follows a similar style elsewhere
@@ -728,6 +728,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
|
|||
.. versionadded:: 0.18.0 | |||
|
|||
dtype : dtype, default np.uint8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also accept arguments to np.dtype
like the string 'i8'
, and handle those appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
he is using np.dtype()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work already, I'll add it to the tests.
# e.g. TestGetDummies::test_basic[uint8-sparse] instead of [uint8-True] | ||
return request.param == 'sparse' | ||
|
||
def effective_dtype(self, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the default np.uint8
you can remove this.
pandas/tests/reshape/test_reshape.py
Outdated
'C': [1, 2, 3]}) | ||
|
||
@pytest.fixture(params=['uint8', 'int64', np.float64, bool, None]) | ||
def dtype(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use this fixture in many places it's going to add a ton of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all quick, so this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I only had 'uint8' and 'float64', but @jreback reasonably suggested to add some more. What would be a good balance here? If I'll remove usage of this fixture from all the unrelated tests like prefix / separator tests, and move None to separate stand-alone test, would ['uint8', 'i8', np.float64, bool] be OK? Still x4 number of tests, but each item uses a different way to specify dtype, so i think it's meaningful set of fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think toms point is to not apply this fixture to every test (just relevant ones)
pandas/tests/reshape/test_reshape.py
Outdated
def test_dataframe_dummies_all_obj(self): | ||
df = self.df[['A', 'B']] | ||
result = get_dummies(df, sparse=self.sparse) | ||
def test_dataframe_dummies_all_obj(self, df, sparse, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of parametrizing by dtype here?
pandas/tests/reshape/test_reshape.py
Outdated
assert_frame_equal(result, expected) | ||
|
||
def test_dataframe_dummies_prefix_str(self): | ||
def test_dataframe_dummies_prefix_str(self, df, sparse, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question. Is there any relationship between the prefix and dtype? The seem orthogonal.
pandas/tests/reshape/test_reshape.py
Outdated
assert_frame_equal(result, expected) | ||
|
||
def test_dataframe_dummies_subset(self): | ||
df = self.df | ||
def test_dataframe_dummies_subset(self, df, sparse, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question: Any interaction between subset
and dtype
?
pandas/tests/reshape/test_reshape.py
Outdated
def test_dataframe_dummies_prefix_sep(self): | ||
df = self.df | ||
result = get_dummies(df, prefix_sep='..', sparse=self.sparse) | ||
def test_dataframe_dummies_prefix_sep(self, df, sparse, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question :)
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -27,6 +27,17 @@ Other Enhancements | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`) | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a ref here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, not used to work with sphinx. Do you mean something like this here:
- :func:`get_dummies` now supports ``dtype`` argument, see :ref:`here <whatsnew_0220.enhancements.get_dummies_dtype>` for more (:issue: `18330`)
and then this before the actual description block:
.. _whatsnew_0220.enhancements.get_dummies_dtype
?
@@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns): | |||
|
|||
|
|||
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
columns=None, sparse=False, drop_first=False): | |||
columns=None, sparse=False, drop_first=False, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok here, it follows a similar style elsewhere
pandas/core/reshape/reshape.py
Outdated
See Also | ||
-------- | ||
Series.str.get_dummies | ||
""" | ||
from pandas.core.reshape.concat import concat | ||
from itertools import cycle | ||
|
||
if dtype is None: | ||
dtype = np.uint8 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a dtype = np.dtype(dtype)
pandas/core/reshape/reshape.py
Outdated
return result | ||
|
||
|
||
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False, | ||
sparse=False, drop_first=False): | ||
sparse=False, drop_first=False, dtype=np.uint8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
afbf368
to
6d447c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave another quick glance, and things look good here.
pandas/tests/reshape/test_reshape.py
Outdated
# not that you should do this... | ||
df = self.df | ||
result = get_dummies(df, prefix='bad', sparse=self.sparse) | ||
df[['C']] = df[['C']].astype(np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a bit weird... Having 2 columns with identical names caused ValueError when I tried expected['C'] = expected['C']
. Now, when you mentioned it, I see in diff that expected = expected.astype({"C": np.int64})
should work, I'll put it back.
df.loc[3, :] = [np.nan, np.nan, np.nan] | ||
result = get_dummies(df, dummy_na=True, sparse=self.sparse) | ||
result = get_dummies(df, dummy_na=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add the sorting because the output changed, or to make the test easier to write?
I slightly prefer the explicit ordering rather than sorting, though that'll be covered elsewhere so changing it isn't a huge deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a bit easier to write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small doc changes. have a look in reshaping.rst if any doc updates are needed.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -28,6 +28,20 @@ Other Enhancements | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`) | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`) | |||
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`) | |||
- :func:`get_dummies` now supports ``dtype`` argument, see :ref:`here <whatsnew_0220.enhancements.get_dummies_dtype>` for more (:issue:`18330`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this line, its already covered in the sub-section. move the sub-section before other enhancements
pandas/tests/reshape/test_reshape.py
Outdated
return np.uint8 | ||
return dtype | ||
|
||
def test_throws_on_dtype_object(self, df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throws -> raises
8ab9859
to
a1de373
Compare
|
||
pd.get_dummies(df, dtype=bool).dtypes | ||
|
||
.. versionadded:: 0.22.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca you move to before the example
doc/source/whatsnew/v0.22.0.txt
Outdated
``get_dummies`` now supports ``dtype`` argument | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a specific dtype for the new columns. When ``dtype`` is not specified or ``None``, the dtype will be ``uint8`` as before. (:issue:`18330`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say the default remains uint8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also removed useless 'specific' in "specifies a specific dtype".
@@ -240,7 +240,7 @@ values will be set to ``NaN``. | |||
df3 | |||
df3.unstack() | |||
|
|||
.. versionadded: 0.18.0 | |||
.. versionadded:: 0.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo, right? There are couple more places where second double column is missing:
pandas/core/frame.py
4516: .. versionadded: 0.18.0
4679: .. versionadded: 0.16.1
pandas/core/generic.py
968: .. versionadded: 0.21.0
pandas/core/series.py
1629: .. versionadded: 0.19.0
2216: .. versionadded: 0.18.0
pandas/core/tools/datetimes.py
117: .. versionadded: 0.18.1
143: .. versionadded: 0.16.1
181: .. versionadded: 0.20.0
187: .. versionadded: 0.22.0
pandas/tseries/offsets.py
778: .. versionadded: 0.16.1
882: .. versionadded: 0.18.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yes looks that way. would be great if you can update those! (if you really want to could also add a lint rule to search for these and fail the build if they are found) (also in doc dir too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR or this will do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate PR prob better. (the one you changed already is fine). I think we DO want to add some more generic checks for these formatting tags, I guess sphinx doesn't complain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's just comments for sphinx. I'll create an issue then, and see what I can do when I have time to look into it. Or somebody will pick it up before that, which is also fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/reshape/reshape.py
Outdated
# Series avoids inconsistent NaN handling | ||
codes, levels = _factorize_from_iterable(Series(data)) | ||
|
||
if dtype is None: | ||
dtype = np.uint8 | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dtype is None
dtype = np.uint8
dtype = np.dtype(dtype)
a bit more idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments. ping on green.
Use pytest fixtures. Add test for dtype=None.
fecb047
to
d19d81f
Compare
@jreback it's green. |
thanks @Scorpil nice patch! keem em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Update in version 0.19.0 made
get_dummies
return uint8 values instead of floats (#8725). While I agree with the argument thatget_dummies
should output integers by default (to save some memory), in many cases it would be beneficial for user to choose other dtype.In my case there was serious performance degradation between versions 0.18 and 0.19. After investigation, reason behind it turned out to be the change to
get_dummies
output type. DataFrame with dummy values was used as an argument to np.dot in an optimization function (second argument was matrix of floats). Since there were lots of iterations involved, and on each iteration np.dot was converting all uint8 values to float64, conversion overhead took unreasonably long time. It is possible to work around this issue by converting dummy columns "manually" afterwards, but it adds unnecessary complexity to the code and is clearly less convenient than callingget_dummies
withdtype=float
.Apart from performance considerations, I can imagine
dtype=bool
to be a common use case.get_dummies(data, dtype=None)
is allowed and will return uint8 values to match the DataFrame interface (where None allows inferring datatype, which is default behavior).I've extended the test suite to run all the
get_dummies
tests (except for those that don't deal with internal dtypes, liketest_just_na
) twice, once withuint8
and once withfloat64
.