Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: For GH23814, return early in Categorical.__init__ #23888

Merged
merged 15 commits into from
Nov 30, 2018
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions asv_bench/benchmarks/categoricals.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def setup(self):
self.values_some_nan = list(np.tile(self.categories + [np.nan], N))
self.values_all_nan = [np.nan] * len(self.values)
self.values_all_int8 = np.ones(N, 'int8')
self.categorical = pd.Categorical(self.values, self.categories)

def time_regular(self):
pd.Categorical(self.values, self.categories)
Expand All @@ -68,6 +69,9 @@ def time_all_nan(self):
def time_from_codes_all_int8(self):
pd.Categorical.from_codes(self.values_all_int8, self.categories)

def time_existing_categorical(self):
pd.Categorical(self.categorical)


class ValueCounts(object):

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1150,7 +1150,7 @@ Performance Improvements
- Improved performance of :func:`pd.concat` for `Series` objects (:issue:`23404`)
- Improved performance of :meth:`DatetimeIndex.normalize` and :meth:`Timestamp.normalize` for timezone naive or UTC datetimes (:issue:`23634`)
- Improved performance of :meth:`DatetimeIndex.tz_localize` and various ``DatetimeIndex`` attributes with dateutil UTC timezone (:issue:`23772`)

- Improved performance of :meth:`Categorical.__init__` (:issue:`23814`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say constructor rather than referring to __init__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback , updated doc string, and also added asv test that exercises the code (first one didn't, but left it since still useful) (you can see my comment about asv results to gfyoung)


.. _whatsnew_0240.docs:

Expand Down
10 changes: 10 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,16 @@ class Categorical(ExtensionArray, PandasObject):
def __init__(self, values, categories=None, ordered=None, dtype=None,
fastpath=False):

# GH23814, for perf, if no optional params used and values already an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can just move this down to where the fastpath check is now; you can add this on i think. this constructor is already amazing too complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at that point, the arg dtype, and maybe categories, will be set. I wanted to only use this early return if none of the optional args were specified (I believe @TomAugsperger was suggesting this in the issue thread).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eoveson I would still like to investiagte consolidating some of this code. This is a very complicated constructor and more code is not great here. See if you can add it lower down, even if its slightly lower perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback , Ok let me look into this and see if I can consolidate some of the code..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please check it out when you get a chance

# instance of Categorical, simply return the same dtype/codes
if categories is None and ordered is None and dtype is None:
if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._values
if isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes.copy()
return

# Ways of specifying the dtype (prioritized ordered)
# 1. dtype is a CategoricalDtype
# a.) with known categories, use dtype.categories
Expand Down