Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPR]: Deprecate setting nans in categories #10929

Merged

Conversation

TomAugspurger
Copy link
Contributor

WIP still

Closes #10748

I have to run for now, but will pick this up later today.
I think I'm missing a few in the tests, since the warning is showing up in a bunch of places (there's a way to convert those to errors for testing right?)

I had to refactor a couple function that were setting ._categories directly instead of using Categorical._set_categories. I kept that in a separate commit. Could do a bit more refactoring with the validate_categories stuff, but that can be separate.

And I need to figure out the proper stacklevel for this warning. I think I used 3 for now.

@jreback jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 29, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 29, 2015
@@ -443,12 +443,18 @@ def _validate_categories(cls, categories):
raise ValueError('Categorical categories must be unique')
return categories

def _set_categories(self, categories):
def _set_categories(self, categories, validate=True):
""" Sets new categories """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to change the setting of categories in the fastpath section (near top of __init__). and add a fastpath to avoid this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastpath here goes thru set_categories which calls ._set_categories. It still needs to go through the validate part so I'm not really sure what it's fastpathing.

All the warnings are caught in the test_categorical. There were quite a few.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's my point - if I am fast pathing I don't need any validation at all (so I want to call _set_categories) to assign them but not validate

rather than

self._categories = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fastpath is being misused here. Some tests failed when I changed the constructor to do things properly (fastpath=True -> set categories directly, no validation). For one example (from the tests):

factor = Categorical([0,1,2,0,1,2]*100, ['a', 'b', 'c'], name='cat', fastpath=True)

breaks since it's assumed elsewhere that categories is always an Index. Followup in a separate issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast path is purely internal - the point is you already have things setup correctly and simply easy to fall the constructor

your example is not valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My example came from the test suite :)

There's about 8 tests that fail / error if I change the constructor to skip category validation when fastpath is true. Not sure if I'll be able to get to those today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this extra validation is a problem
fast path creation of categorically is already slow

@TomAugspurger TomAugspurger force-pushed the depr-categorical-nans branch 2 times, most recently from 668ef9f to 3745ca4 Compare August 31, 2015 12:17
@TomAugspurger
Copy link
Contributor Author

@jreback cherry picked your PR, thanks for that. I can squash down to one commit if you want.

I'll try to do a perf test today.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

squash up to you. yeh just do a quick perf test. merge when green.


.. ipython:: python

s = pd.Series(["a","b",np.nan,"a"], dtype="category")
s = pd.Series(["a","b",np.nan,"a"], dtype="category"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be on the previous line?

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche Addressed your 3 points. I also

  • Changed the level heading for Differences to R's factors to move it out one (it was under the Missing Data section)
  • Added a note to the Differences section say that R does allow missing levels / categories, while pandas doesn't.

running asv now. The only categorical test is in concat (not sure if this touches the fastpath constructor). Might add another test.

@TomAugspurger
Copy link
Contributor Author

@jreback

In [14]: %timeit Categorical(codes, cat_idx, fastpath=True)
100 loops, best of 3: 4.01 ms per loop

In [15]: %timeit Categorical(values, cat_idx)
1 loops, best of 3: 279 ms per loop

👍 I've got an asv benchmark here. Just wrote and pushed an asv test for the Categorical constructor. Running it before and after your commit now. I'm guessing it should be similar to those numbers though since fastpath wasn't really skipping much before.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@TomAugspurger awesome

yeh in dask Categoricals are created a lot and just want to fastpath them (e.g. with from_codes), but the is_unique check is not necessary (but is currently done) on a fair amount. So this is good.

@TomAugspurger
Copy link
Contributor Author

asv doesn't show much of a difference (assuming I ran this correctly).

asv run 30f672c..8d87f3b --bench=categorical_constructor
· Fetching recent changes
· Creating environments
· Discovering benchmarks
·· Uninstalling from py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt.
·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...............................
·· Installing into py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash 8d87f3be:
[  0.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...................................
[  0.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 25.00%] ··· Running categoricals.categorical_constructor.time_fastpath                         5.84ms
[ 50.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor            250.19ms
[ 50.00%] · For pandas commit hash e757e8a8:
[ 50.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt......................................
[ 50.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 75.00%] ··· Running categoricals.categorical_constructor.time_fastpath                         4.83ms
[100.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor            246.23ms

Those two hashes are

8d87f3b move NaN deprecation warning to _validate_categories, cleanup a bit
e757e8a DEPR: No NaNs in categories

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

ok, merge away then

TomAugspurger pushed a commit that referenced this pull request Sep 1, 2015
[DEPR]: Deprecate setting nans in categories
@TomAugspurger TomAugspurger merged commit f784e9a into pandas-dev:master Sep 1, 2015
@TomAugspurger
Copy link
Contributor Author

Merged, thanks.

TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Sep 24, 2016
Closes dask#1565

For compatability with pandas-dev/pandas#10929
where it was decided that

`pd.Categorical(['a', np.nan], categories=['a', np.nan])`

Should raise a `FutureWarning`. Now we just drop missing values
before computing the distincts for the categories.
jcrist pushed a commit to dask/dask that referenced this pull request Sep 24, 2016
Closes #1565

For compatability with pandas-dev/pandas#10929
where it was decided that

`pd.Categorical(['a', np.nan], categories=['a', np.nan])`

Should raise a `FutureWarning`. Now we just drop missing values
before computing the distincts for the categories.
@TomAugspurger TomAugspurger deleted the depr-categorical-nans branch April 5, 2017 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants