Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPR]: Deprecate setting nans in categories #10929

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion asv_bench/benchmarks/categoricals.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from .pandas_vb_common import *

import string

class concat_categorical(object):
goal_time = 0.2
Expand All @@ -25,3 +25,21 @@ def time_value_counts(self):

def time_value_counts_dropna(self):
self.ts.value_counts(dropna=True)

class categorical_constructor(object):
goal_time = 0.2

def setup(self):
n = 5
N = 1e6
self.categories = list(string.ascii_letters[:n])
self.cat_idx = Index(self.categories)
self.values = np.tile(self.categories, N)
self.codes = np.tile(range(n), N)

def time_regular_constructor(self):
Categorical(self.values, self.categories)

def time_fastpath(self):
Categorical(self.codes, self.cat_idx, fastpath=True)

29 changes: 13 additions & 16 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -632,41 +632,35 @@ Missing Data

pandas primarily uses the value `np.nan` to represent missing data. It is by
default not included in computations. See the :ref:`Missing Data section
<missing_data>`
<missing_data>`.

There are two ways a `np.nan` can be represented in categorical data: either the value is not
available ("missing value") or `np.nan` is a valid category.
Missing values should **not** be included in the Categorical's ``categories``,
only in the ``values``.
Instead, it is understood that NaN is different, and is always a possibility.
When working with the Categorical's ``codes``, missing values will always have
a code of ``-1``.

.. ipython:: python

s = pd.Series(["a","b",np.nan,"a"], dtype="category")
# only two categories
s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe also show s.cat.codes? (to show the -1)

s2 = pd.Series(["a","b","c","a"], dtype="category")
s2.cat.categories = [1,2,np.nan]
# three categories, np.nan included
s2
s.codes

.. note::
As integer `Series` can't include NaN, the categories were converted to `object`.

.. note::
Missing value methods like ``isnull`` and ``fillna`` will take both missing values as well as
`np.nan` categories into account:
Methods for working with missing data, e.g. :meth:`~Series.isnull`, :meth:`~Series.fillna`,
:meth:`~Series.dropna`, all work normally:

.. ipython:: python

c = pd.Series(["a","b",np.nan], dtype="category")
c.cat.set_categories(["a","b",np.nan], inplace=True)
# will be inserted as a NA category:
c[0] = np.nan
s = pd.Series(c)
s
pd.isnull(s)
s.fillna("a")

Differences to R's `factor`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------

The following differences to R's factor functions can be observed:

Expand All @@ -677,6 +671,9 @@ The following differences to R's factor functions can be observed:
* In contrast to R's `factor` function, using categorical data as the sole input to create a
new categorical series will *not* remove unused categories but create a new categorical series
which is equal to the passed in one!
* R allows for missing values to be included in its `levels` (pandas' `categories`). Pandas
does not allow `NaN` categories, but missing values can still be in the `values`.


Gotchas
-------
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -652,6 +652,7 @@ Deprecations
===================== =================================

- ``Categorical.name`` was deprecated to make ``Categorical`` more ``numpy.ndarray`` like. Use ``Series(cat, name="whatever")`` instead (:issue:`10482`).
- Setting missing values (NaN) in a ``Categorical``'s ``categories`` will issue a warning (:issue:`10748`). You can still have missing values in the ``values``.
- ``drop_duplicates`` and ``duplicated``'s ``take_last`` keyword was deprecated in favor of ``keep``. (:issue:`6511`, :issue:`8505`)
- ``Series.nsmallest`` and ``nlargest``'s ``take_last`` keyword was deprecated in favor of ``keep``. (:issue:`10792`)
- ``DataFrame.combineAdd`` and ``DataFrame.combineMult`` are deprecated. They
Expand Down
1 change: 1 addition & 0 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ def argmin(self, axis=None):
"""
return nanops.nanargmin(self.values)

@cache_readonly
def hasnans(self):
""" return if I have any nans; enables various perf speedups """
return com.isnull(self).any()
Expand Down
59 changes: 44 additions & 15 deletions pandas/core/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
if fastpath:
# fast path
self._codes = _coerce_indexer_dtype(values, categories)
self.categories = categories
self._categories = self._validate_categories(categories, fastpath=isinstance(categories, ABCIndexClass))
self._ordered = ordered
return

Expand Down Expand Up @@ -274,6 +274,8 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
### FIXME ####
raise NotImplementedError("> 1 ndim Categorical are not supported at this time")

categories = self._validate_categories(categories)

else:
# there were two ways if categories are present
# - the old one, where each value is a int pointer to the levels array -> not anymore
Expand All @@ -282,7 +284,6 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F

# make sure that we always have the same type here, no matter what we get passed in
categories = self._validate_categories(categories)

codes = _get_codes_for_values(values, categories)

# TODO: check for old style usage. These warnings should be removes after 0.18/ in 2016
Expand All @@ -295,7 +296,7 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
"'Categorical.from_codes(codes, categories)'?", RuntimeWarning, stacklevel=2)

self.set_ordered(ordered or False, inplace=True)
self.categories = categories
self._categories = categories
self._codes = _coerce_indexer_dtype(codes, categories)

def copy(self):
Expand Down Expand Up @@ -421,9 +422,15 @@ def _get_labels(self):
_categories = None

@classmethod
def _validate_categories(cls, categories):
def _validate_categories(cls, categories, fastpath=False):
"""
Validates that we have good categories

Parameters
----------
fastpath : boolean (default: False)
Don't perform validation of the categories for uniqueness or nulls

"""
if not isinstance(categories, ABCIndexClass):
dtype = None
Expand All @@ -439,16 +446,40 @@ def _validate_categories(cls, categories):

from pandas import Index
categories = Index(categories, dtype=dtype)
if not categories.is_unique:
raise ValueError('Categorical categories must be unique')

if not fastpath:

# check properties of the categories
# we don't allow NaNs in the categories themselves

if categories.hasnans:
# NaNs in cats deprecated in 0.17, remove in 0.18 or 0.19 GH 10748
msg = ('\nSetting NaNs in `categories` is deprecated and '
'will be removed in a future version of pandas.')
warn(msg, FutureWarning, stacklevel=5)

# categories must be unique

if not categories.is_unique:
raise ValueError('Categorical categories must be unique')

return categories

def _set_categories(self, categories):
""" Sets new categories """
categories = self._validate_categories(categories)
if not self._categories is None and len(categories) != len(self._categories):
def _set_categories(self, categories, fastpath=False):
""" Sets new categories

Parameters
----------
fastpath : boolean (default: False)
Don't perform validation of the categories for uniqueness or nulls

"""

categories = self._validate_categories(categories, fastpath=fastpath)
if not fastpath and not self._categories is None and len(categories) != len(self._categories):
raise ValueError("new categories need to have the same number of items than the old "
"categories!")

self._categories = categories

def _get_categories(self):
Expand Down Expand Up @@ -581,11 +612,10 @@ def set_categories(self, new_categories, ordered=None, rename=False, inplace=Fal
if not cat._categories is None and len(new_categories) < len(cat._categories):
# remove all _codes which are larger and set to -1/NaN
self._codes[self._codes >= len(new_categories)] = -1
cat._categories = new_categories
else:
values = cat.__array__()
cat._codes = _get_codes_for_values(values, new_categories)
cat._categories = new_categories
cat._categories = new_categories

if ordered is None:
ordered = self.ordered
Expand Down Expand Up @@ -706,9 +736,8 @@ def add_categories(self, new_categories, inplace=False):
msg = "new categories must not include old categories: %s" % str(already_included)
raise ValueError(msg)
new_categories = list(self._categories) + list(new_categories)
new_categories = self._validate_categories(new_categories)
cat = self if inplace else self.copy()
cat._categories = new_categories
cat._categories = self._validate_categories(new_categories)
cat._codes = _coerce_indexer_dtype(cat._codes, new_categories)
if not inplace:
return cat
Expand Down Expand Up @@ -1171,7 +1200,7 @@ def order(self, inplace=False, ascending=True, na_position='last'):
Category.sort
"""
warn("order is deprecated, use sort_values(...)",
FutureWarning, stacklevel=2)
FutureWarning, stacklevel=3)
return self.sort_values(inplace=inplace, ascending=ascending, na_position=na_position)

def sort(self, inplace=True, ascending=True, na_position='last'):
Expand Down
Loading