[DEPR]: Deprecate setting nans in categories #10929

TomAugspurger · 2015-08-29T13:22:00Z

WIP still

I have to run for now, but will pick this up later today.
I think I'm missing a few in the tests, since the warning is showing up in a bunch of places (there's a way to convert those to errors for testing right?)

I had to refactor a couple function that were setting ._categories directly instead of using Categorical._set_categories. I kept that in a separate commit. Could do a bit more refactoring with the validate_categories stuff, but that can be separate.

And I need to figure out the proper stacklevel for this warning. I think I used 3 for now.

jreback · 2015-08-29T13:36:24Z

pandas/core/categorical.py

@@ -443,12 +443,18 @@ def _validate_categories(cls, categories):
            raise ValueError('Categorical categories must be unique')
        return categories

-    def _set_categories(self, categories):
+    def _set_categories(self, categories, validate=True):
        """ Sets new categories """


need to change the setting of categories in the fastpath section (near top of __init__). and add a fastpath to avoid this check.

fastpath here goes thru set_categories which calls ._set_categories. It still needs to go through the validate part so I'm not really sure what it's fastpathing.

All the warnings are caught in the test_categorical. There were quite a few.

that's my point - if I am fast pathing I don't need any validation at all (so I want to call _set_categories) to assign them but not validate

rather than

self._categories = ...

I think fastpath is being misused here. Some tests failed when I changed the constructor to do things properly (fastpath=True -> set categories directly, no validation). For one example (from the tests):

factor = Categorical([0,1,2,0,1,2]*100, ['a', 'b', 'c'], name='cat', fastpath=True)

breaks since it's assumed elsewhere that categories is always an Index. Followup in a separate issue?

fast path is purely internal - the point is you already have things setup correctly and simply easy to fall the constructor

your example is not valid

My example came from the test suite :)

There's about 8 tests that fail / error if I change the constructor to skip category validation when fastpath is true. Not sure if I'll be able to get to those today.

hmm this extra validation is a problem
fast path creation of categorically is already slow

TomAugspurger · 2015-08-31T12:19:13Z

@jreback cherry picked your PR, thanks for that. I can squash down to one commit if you want.

I'll try to do a perf test today.

jreback · 2015-08-31T12:28:42Z

squash up to you. yeh just do a quick perf test. merge when green.

jorisvandenbossche · 2015-08-31T13:22:38Z

doc/source/categorical.rst


 .. ipython:: python

-    s = pd.Series(["a","b",np.nan,"a"], dtype="category")
+    s = pd.Series(["a","b",np.nan,"a"], dtype="category"
+    )


I think this should be on the previous line?

TomAugspurger · 2015-08-31T14:18:46Z

@jorisvandenbossche Addressed your 3 points. I also

Changed the level heading for Differences to R's factors to move it out one (it was under the Missing Data section)
Added a note to the Differences section say that R does allow missing levels / categories, while pandas doesn't.

running asv now. The only categorical test is in concat (not sure if this touches the fastpath constructor). Might add another test.

TomAugspurger · 2015-09-01T12:51:40Z

@jreback

In [14]: %timeit Categorical(codes, cat_idx, fastpath=True)
100 loops, best of 3: 4.01 ms per loop

In [15]: %timeit Categorical(values, cat_idx)
1 loops, best of 3: 279 ms per loop

👍 I've got an asv benchmark here. Just wrote and pushed an asv test for the Categorical constructor. Running it before and after your commit now. I'm guessing it should be similar to those numbers though since fastpath wasn't really skipping much before.

jreback · 2015-09-01T12:59:15Z

@TomAugspurger awesome

yeh in dask Categoricals are created a lot and just want to fastpath them (e.g. with from_codes), but the is_unique check is not necessary (but is currently done) on a fair amount. So this is good.

TomAugspurger · 2015-09-01T13:45:00Z

asv doesn't show much of a difference (assuming I ran this correctly).

asv run 30f672c..8d87f3b --bench=categorical_constructor
· Fetching recent changes
· Creating environments
· Discovering benchmarks
·· Uninstalling from py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt.
·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...............................
·· Installing into py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash 8d87f3be:
[  0.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...................................
[  0.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 25.00%] ··· Running categoricals.categorical_constructor.time_fastpath                         5.84ms
[ 50.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor            250.19ms
[ 50.00%] · For pandas commit hash e757e8a8:
[ 50.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt......................................
[ 50.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 75.00%] ··· Running categoricals.categorical_constructor.time_fastpath                         4.83ms
[100.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor            246.23ms

Those two hashes are

8d87f3b move NaN deprecation warning to _validate_categories, cleanup a bit
e757e8a DEPR: No NaNs in categories

jreback · 2015-09-01T14:19:37Z

ok, merge away then

[DEPR]: Deprecate setting nans in categories

TomAugspurger · 2015-09-01T19:18:25Z

Merged, thanks.

Closes dask#1565 For compatability with pandas-dev/pandas#10929 where it was decided that `pd.Categorical(['a', np.nan], categories=['a', np.nan])` Should raise a `FutureWarning`. Now we just drop missing values before computing the distincts for the categories.

Closes #1565 For compatability with pandas-dev/pandas#10929 where it was decided that `pd.Categorical(['a', np.nan], categories=['a', np.nan])` Should raise a `FutureWarning`. Now we just drop missing values before computing the distincts for the categories.

jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 29, 2015

jreback added this to the 0.17.0 milestone Aug 29, 2015

jreback reviewed Aug 29, 2015
View reviewed changes

TomAugspurger force-pushed the depr-categorical-nans branch 2 times, most recently from 668ef9f to 3745ca4 Compare August 31, 2015 12:17

jorisvandenbossche reviewed Aug 31, 2015
View reviewed changes

TomAugspurger force-pushed the depr-categorical-nans branch from 3745ca4 to ac84968 Compare August 31, 2015 14:17

TomAugspurger and others added 2 commits September 1, 2015 07:50

DEPR: No NaNs in categories

e757e8a

move NaN deprecation warning to _validate_categories, cleanup a bit

8d87f3b

TomAugspurger force-pushed the depr-categorical-nans branch from ac84968 to 8d87f3b Compare September 1, 2015 12:50

TomAugspurger pushed a commit that referenced this pull request Sep 1, 2015

Merge pull request #10929 from TomAugspurger/depr-categorical-nans

f784e9a

[DEPR]: Deprecate setting nans in categories

TomAugspurger merged commit f784e9a into pandas-dev:master Sep 1, 2015

TomAugspurger mentioned this pull request Sep 24, 2016

COMPAT/API: DataFrame.categorize missing values dask/dask#1578

Merged

TomAugspurger deleted the depr-categorical-nans branch April 5, 2017 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEPR]: Deprecate setting nans in categories #10929

[DEPR]: Deprecate setting nans in categories #10929

TomAugspurger commented Aug 29, 2015

jreback Aug 29, 2015

TomAugspurger Aug 29, 2015

jreback Aug 29, 2015

TomAugspurger Aug 29, 2015

jreback Aug 29, 2015

TomAugspurger Aug 29, 2015

jreback Aug 29, 2015

TomAugspurger commented Aug 31, 2015

jreback commented Aug 31, 2015

jorisvandenbossche Aug 31, 2015

TomAugspurger commented Aug 31, 2015

TomAugspurger commented Sep 1, 2015

jreback commented Sep 1, 2015

TomAugspurger commented Sep 1, 2015

jreback commented Sep 1, 2015

TomAugspurger commented Sep 1, 2015

[DEPR]: Deprecate setting nans in categories #10929

[DEPR]: Deprecate setting nans in categories #10929

Conversation

TomAugspurger commented Aug 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 31, 2015

jreback commented Aug 31, 2015

Choose a reason for hiding this comment

TomAugspurger commented Aug 31, 2015

TomAugspurger commented Sep 1, 2015

jreback commented Sep 1, 2015

TomAugspurger commented Sep 1, 2015

jreback commented Sep 1, 2015

TomAugspurger commented Sep 1, 2015