-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEPR]: Deprecate setting nans in categories #10929
[DEPR]: Deprecate setting nans in categories #10929
Conversation
@@ -443,12 +443,18 @@ def _validate_categories(cls, categories): | |||
raise ValueError('Categorical categories must be unique') | |||
return categories | |||
|
|||
def _set_categories(self, categories): | |||
def _set_categories(self, categories, validate=True): | |||
""" Sets new categories """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to change the setting of categories in the fastpath section (near top of __init__
). and add a fastpath
to avoid this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fastpath
here goes thru set_categories
which calls ._set_categories
. It still needs to go through the validate
part so I'm not really sure what it's fastpathing.
All the warnings are caught in the test_categorical
. There were quite a few.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's my point - if I am fast pathing I don't need any validation at all (so I want to call _set_categories) to assign them but not validate
rather than
self._categories = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fastpath is being misused here. Some tests failed when I changed the constructor to do things properly (fastpath=True
-> set categories directly, no validation). For one example (from the tests):
factor = Categorical([0,1,2,0,1,2]*100, ['a', 'b', 'c'], name='cat', fastpath=True)
breaks since it's assumed elsewhere that categories
is always an Index. Followup in a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast path is purely internal - the point is you already have things setup correctly and simply easy to fall the constructor
your example is not valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My example came from the test suite :)
There's about 8 tests that fail / error if I change the constructor to skip category validation when fastpath
is true. Not sure if I'll be able to get to those today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this extra validation is a problem
fast path creation of categorically is already slow
668ef9f
to
3745ca4
Compare
@jreback cherry picked your PR, thanks for that. I can squash down to one commit if you want. I'll try to do a perf test today. |
squash up to you. yeh just do a quick perf test. merge when green. |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(["a","b",np.nan,"a"], dtype="category") | ||
s = pd.Series(["a","b",np.nan,"a"], dtype="category" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be on the previous line?
3745ca4
to
ac84968
Compare
@jorisvandenbossche Addressed your 3 points. I also
running asv now. The only categorical test is in |
ac84968
to
8d87f3b
Compare
In [14]: %timeit Categorical(codes, cat_idx, fastpath=True)
100 loops, best of 3: 4.01 ms per loop
In [15]: %timeit Categorical(values, cat_idx)
1 loops, best of 3: 279 ms per loop 👍 I've got an asv benchmark here. Just wrote and pushed an |
@TomAugspurger awesome yeh in dask |
asv doesn't show much of a difference (assuming I ran this correctly). asv run 30f672c..8d87f3b --bench=categorical_constructor
· Fetching recent changes
· Creating environments
· Discovering benchmarks
·· Uninstalling from py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt.
·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...............................
·· Installing into py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For pandas commit hash 8d87f3be:
[ 0.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt...................................
[ 0.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 25.00%] ··· Running categoricals.categorical_constructor.time_fastpath 5.84ms
[ 50.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor 250.19ms
[ 50.00%] · For pandas commit hash e757e8a8:
[ 50.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt......................................
[ 50.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-scipy-sqlalchemy-tables-xlrd-xlwt
[ 75.00%] ··· Running categoricals.categorical_constructor.time_fastpath 4.83ms
[100.00%] ··· Running categoricals.categorical_constructor.time_regular_constructor 246.23ms Those two hashes are
|
ok, merge away then |
[DEPR]: Deprecate setting nans in categories
Merged, thanks. |
Closes dask#1565 For compatability with pandas-dev/pandas#10929 where it was decided that `pd.Categorical(['a', np.nan], categories=['a', np.nan])` Should raise a `FutureWarning`. Now we just drop missing values before computing the distincts for the categories.
Closes #1565 For compatability with pandas-dev/pandas#10929 where it was decided that `pd.Categorical(['a', np.nan], categories=['a', np.nan])` Should raise a `FutureWarning`. Now we just drop missing values before computing the distincts for the categories.
WIP still
Closes #10748
I have to run for now, but will pick this up later today.
I think I'm missing a few in the tests, since the warning is showing up in a bunch of places (there's a way to convert those to errors for testing right?)
I had to refactor a couple function that were setting
._categories
directly instead of usingCategorical._set_categories
. I kept that in a separate commit. Could do a bit more refactoring with thevalidate_categories
stuff, but that can be separate.And I need to figure out the proper
stacklevel
for this warning. I think I used 3 for now.