-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347
Conversation
…False) In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with `Categorical(..., ordered=False)` then the order should be taken from the order of appearance, similar to the current `Series.unique()` implementation.
Note this should only be taken if I'm also not so sure what is best here: if a categorical is changed from
|
IMO |
@shoyer: can you elaborate? :-) This is modeled after R's |
@shoyer the more I think about it, the more sense it makes to not let If that's desirable, we need a @jreback @jorisvandenbossche any comments? |
@@ -268,7 +268,7 @@ def __init__(self, values, categories=None, ordered=None, name=None, fastpath=Fa | |||
|
|||
if categories is None: | |||
try: | |||
codes, categories = factorize(values, sort=True) | |||
codes, categories = factorize(values, sort=ordered if not ordered is None else True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More succinctly, this could be ordered is None or ordered
I don't think we necessarily need a change ordered method; it's straightforward enough to use By "part of the dtype" I'm referring to categorical data as it's defined, e.g., in dynd -- everything that's not the specific values of an array is part of the dtype. From a simplicity perspective, it's nice if that cannot change. |
But right now it can change, as it is "only inplace", as it is a setter: E.g. df["cat"] = ...
df.cat.cat.ordered = False will change the dataframe itself For all else, you have to assign again or use inplace=True: df["cat"] = df.cat.cat.reorder_categories([...])
df.cat.cat.reorder_categories([...], inplace=True) |
The current method to change the order is not friendly and requires more knowledge than most users would probably have. Of course, many people would just use ordered = pd.Categorical(['a','b','c']),
unordered = pd.Categorical(ordered, ordered=False) which is a little wasteful but simple. I could see something like unordered = ordered.swap_ordering() or ordered.swap_ordering(inplace=True) making it more obvious how to remove (or add) ordering. One of the dangers of the current approach to ordering is unordered = pd.Categorical(['a','c','d','b'], ordered=False)
unordered.ordered=True which seems to work but only by accident. |
actually we tried to hide |
@JanSchulz I think this PR is fine. separate issue is whether to change what |
@JanSchulz pls add a release note (API section, maybe add a small example to be clear). |
As a comparison, R does lexicographical sort the categories for unordered factors (but that is the default). And apart from that, I don't know if I find it more logical or not to sort the categories or not. @JanSchulz As you said in another discussion (about the sorting I think), that often 'the order of appearance' in a dataset does not really mean something. So I don't know if I find it worth changing. |
@JanSchulz can you add a release note (and maybe a short example of the behavior change to 0.16.0) |
@JanSchulz can you add a release note explaining this change? otherwise looks gtg. |
@jreback @JanSchulz I am not really sure I think this is a good idea. I think having it sorted lexicographically is more logical, as there is no inherent order in the categories. But you still want to have this somewhat consistent. In that way, sorting it lexicographically makes the most sense I think. This is also what R does (but there the default is unordered), and I think also what dynd does in |
So think the intent is to have the 2nd section (where no ordering is specified explicity) be turned into the first section by default (for all cases). They will still be ordered/unordered as a Categorical type as indicated by the This will effectively remove the default ordering from lexiographic to the discovery order (which is how a couple of options I see
aside from this, I think we need for #9190
and then raise on I think 4) is the most logical here. Basically turning default |
Yes, I also wanted to raise the idea to have unordered as a default (wanted to make an issue for that, but started with reraising in this issue). Do we discuss that here or open a new one? I also think
|
I can live with making it unordered as default. Regarding the default order on the categories: I actually don't mind much, if you want to have it ordered in most of the case you have to reorder anyway... i would lean slightly to let it be as it is now (lexi sorting and not order of appearance). Someone who is willing to make a decision should do it by either pulling this PR or closing it :-) -> In line with option 4 above (default to unordered but lexi-sort the categories) Regarding the method which changes the "ordered" property: maybe a Not sure about deprecation or not. If deprecation, just one cycle? |
closing in favor of #9611 |
In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with
Categorical(..., ordered=False)
then the order should be taken from the order of appearance, similar to the currentSeries.unique()
implementation.