Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename categories with Series #17982

Merged
merged 9 commits into from
Oct 26, 2017

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 25, 2017

Closes #17981

head:

[ 50.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat         6.39ms
[100.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat_ordered 4.81ms

master:

[ 50.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat         137ms
[100.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat_ordered 5.14ms

HEAD:

```
[ 50.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat          6.63ms
[ 50.00%] ·····
[100.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat_ordered  4.85ms
```

Closes pandas-dev#17981
@TomAugspurger TomAugspurger added the Categorical Categorical Data Type label Oct 25, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Oct 25, 2017
@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Oct 25, 2017
@codecov
Copy link

codecov bot commented Oct 25, 2017

Codecov Report

Merging #17982 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17982      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50117       +4     
==========================================
- Hits        45723    45718       -5     
- Misses       4390     4399       +9
Flag Coverage Δ
#multiple 89.03% <100%> (ø) ⬆️
#single 40.3% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.75% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/ops.py 91.77% <0%> (ø) ⬆️
pandas/core/reshape/concat.py 97.58% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c309e...0e086ca. Read the comment docs.

@@ -879,8 +874,22 @@ def rename_categories(self, new_categories, inplace=False):

Parameters
----------
new_categories : Index-like or dict-like (>=0.21.0)
The renamed categories.
new_categories : array-like or dict-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nitpick: new_categories is listed as "array-like" here, but in the description below the term "list-like" is used instead, and then it's switched back to "array-like" further down in the examples section. Might be nice to make the terminology consistent, but certainly not something that's super critical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2017

lgtm. merge away.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2017

you migh want to add the issue number to the original issue number in the whatsnew (#17336)

Categories (2, object): [A, b]

Series are considered list-like here, so the *values* are used
instead of the *index*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want this behaviour?
Eg for Series.rename, a Series is seen as a dict-like ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. It’ll be a backwards incompatible change if we don’t treat Series as arrays so I think we should at least do this for now, maybe with a warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback which part to you agree with? Warning that it'll change to dict-like in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I agree the current behavior is correct. we handle list-like the same. no warning is needed as this is expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, looking at this again.

a Series should be just like a dict. This is a perf issue yes?
do this. (once Index.map works we could simplify a bit)

In [12]: cat = pd.Categorical(['a', 'b', 'c', 'd'])
    ...: res = cat.rename_categories(pd.Series({'a': 4, 'b': 3, 'c': 2, 'd': 1}))
    ...: 
    ...: 

In [13]: cat
Out[13]: 
[a, b, c, d]
Categories (4, object): [a, b, c, d]

In [14]: res
Out[14]: 
[4, 3, 2, 1]
Categories (4, int64): [4, 3, 2, 1]

In [17]: pd.Series(cat.categories).map({'a': 4, 'b': 3, 'c': 2, 'd': 1}).values
Out[17]: array([4, 3, 2, 1])

In [19]: pd.Series(cat.categories).map({'a': 4, 'b': 3, 'c': 2, 'd': 1}).values
Out[19]: array([4, 3, 2, 1])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert the warning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain?
If we go for "Series -> dict-like" behaviour, this is a breaking change, and we need to use a warning for that.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@TomAugspurger
Copy link
Contributor Author

It's not really a perf issue, that's just the reason we caught it. The main issue is the API change from

c = pd.Categorical(['a', 'b'])
c.rename(pd.Series([10, 20)

Before #17586, the values would be used, and so the categories would be [10, 20]. Afterwards, the series is considered dict-like, so the index -> value mapping is used.

I really don't think we should just break API here. It'll be very surprising and difficult to debug. But in most cases Series is considered dict-like, so I think a warning is the best option here.

@jreback
Copy link
Contributor

jreback commented Oct 26, 2017

why don’t you just make th change to fix this back to how it was; this PR was in 0.21 so it’s not externally visible

warnings are a terrible things and should rarely be used

@TomAugspurger
Copy link
Contributor Author

Just pushed a WIP adding the warning so that we're talking about the same thing, will clean up a bit further.

@TomAugspurger
Copy link
Contributor Author

why don’t you just make th change to fix this back to how it was;

Not sure what you mean here, revert the entire PR? That's a useful feature, it just (unexpectedly) changes the behavior for Series. We didn't have any tests covering that, so it wasn't caught at the time.

"For dict-like, use 'new_categories.to_dict()'\n"
"For list-like, use 'new_categories.values'.")
warn(msg, FutureWarning, stacklevel=2)
if is_dict_like(new_categories) and not is_series:
cat.categories = [new_categories.get(item, item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use map

new_categories : Index-like or dict-like (>=0.21.0)
The renamed categories.
new_categories : list-like or dict-like
The categories end up with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like an incomplete sentence ?

categories.

If dict-like, categories not contained in the mapping are passed
through.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit unstructured with the versionchanged and the further explanation splitted.

Suggestion (just copy paste, not edited sentences):

- list-like: If it is list-like, all items must be unique and the number of
  items in the new categories must match the existing number of
  categories.
- .. versionadded:: 0.21.1 dict-like, in which case it specifies a mapping from old-categories to new. Categories not contained in the mapping are passed through.

.. warning::

   about series

(only not fully sure how the versionchanges works in a list)

Categories (2, object): [A, b]

Series are considered list-like here, so the *values* are used
instead of the *index*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain?
If we go for "Series -> dict-like" behaviour, this is a breaking change, and we need to use a warning for that.

"treat Series like a dictionary.\n"
"For dict-like, use 'new_categories.to_dict()'\n"
"For list-like, use 'new_categories.values'.")
warn(msg, FutureWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe convert the series to array (list-like), so then the rest of the code does not need to take care of it being a series or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for "Series -> dict-like" behaviour, this is a breaking change, and we need to use a warning for that.

Sorry I think that was an example I added in the first commit of this PR, before we decided to treat Series as list-like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a breaking change at all? we simply did not support this before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old behavior required a list-like, and Series are list like. It's not unreasonable for a user to expect

cat.rename(Series([0, 1]))

to work, since it did! But we have a new feature that changes the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I c, I think this was accidently supported before. ok so fine on the FutureWarning.

@jorisvandenbossche
Copy link
Member

I find the current PR fine: warn for series to say that in future it will use dict-like behaviour, but for now keep list-like behaviour

@TomAugspurger
Copy link
Contributor Author

@jreback are you OK with the FutureWarning for now?

@TomAugspurger TomAugspurger mentioned this pull request Oct 26, 2017
64 tasks
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from some typos, looks good to me

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`~Series.cat.rename_categories` now accepts a dict-like argument for
``new_categories``. The previous categories are lookup up in the dictionary's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lookup up" -> ""looked up"? (not sure what is the correct english conjugation)

* dict-like: specifies a mapping from
old categories to new. Categories not contained in the mapping
are passed through and extra categories in the mapping are
ignored. *New in verison 0.21.0*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verison -> version

"treat Series like a dictionary.\n"
"For dict-like, use 'new_categories.to_dict()'\n"
"For list-like, use 'new_categories.values'.")
warn(msg, FutureWarning, stacklevel=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I c, I think this was accidently supported before. ok so fine on the FutureWarning.

@jreback jreback merged commit 7aeccd3 into pandas-dev:master Oct 26, 2017
@TomAugspurger TomAugspurger deleted the rename-categories-bug branch October 27, 2017 11:43
peterpanmj pushed a commit to peterpanmj/pandas that referenced this pull request Oct 31, 2017
* PERF/API: Treat series as array-like for rename_categories

HEAD:

```
[ 50.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat          6.63ms
[ 50.00%] ·····
[100.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat_ordered  4.85ms
```

Closes pandas-dev#17981

* Redo docstring

* Use list-like

* Warn

* Fix doc indent

* Doc cleanup

* More doc cleanup

* Fix API reference

* Typos
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
* PERF/API: Treat series as array-like for rename_categories

HEAD:

```
[ 50.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat          6.63ms
[ 50.00%] ·····
[100.00%] ··· Running categoricals.Categoricals3.time_rank_string_cat_ordered  4.85ms
```

Closes pandas-dev#17981

* Redo docstring

* Use list-like

* Warn

* Fix doc indent

* Doc cleanup

* More doc cleanup

* Fix API reference

* Typos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants