-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seaborn should respect categorical order when sorting pd.Categorical objects #361
Comments
I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart. |
Generally I agree, but I was assuming @mwaskom had his reasons. If this is just a side effect of the fact that he's using np.unique, he should try the pandas unique method instead, which does not sort (and is also a little faster). On Mon, Nov 10, 2014 at 10:10 PM, Rob Story notifications@github.com
|
I definitely agree with @shoyer about categorical types. I'm not sure I feel all that strongly about alphabetical sorting (though to @wrobstory's point, without having any way to know whether a column is intentionally sorted, it seemed best for the default to do something predictable). But it sounds easy to just change To go along with this I think the test datasets should be updated to use |
I'm in favor of this, but I'm gonna kick it to the 0.6 cycle because it will take a little thinking. I want to figure out the best way to make this behavior as consistent as possible across the package. |
@olgabot The sorting order for bins will also be fixed upstream when I finish adding an Interval type to pandas (which will sort properly), instead of using string labels. |
hi all, is there a workaround for this currently? i love the power of the col= parameter, where you can create a graph for all instances of a column but I want to be able to plot Jan, Feb, Mar in order. |
@yoshiserry Yes, use the |
Do the changes in #409 look reasonable? It provides a compatibility function that uses pandas sort and unique, which 1) handles NA in a consistent manner because Want to get some feedback before adding a few more tests and touching the code base in other parts. Should this approach replace all the |
Thanks for taking a crack at it. The open issue is really whether things should be lexicographically sorted or in the order that they appear in the dataframe (so just what the straight pandas I originally had a mild preference for consistent behavior (sorts are at least predictable once you expect them), but if it requires a fair amount of complexity to make sorting work correctly across a range of pandas versions, it's possible that might weigh in favor of not sorting. |
I'd been letting this issue fester as it required a hard decision so thanks for poking at it :] |
Yep- I would not expect seaborn to sort the data for me unless explicitly asked to do so. I think there are lots of cases where I've already munged the dataframe to get the exact ordering I want, and expect it to be plotted 1:1. |
I think that's probably fair in some cases, though I don't really see at first glance how a sorted index would affect the order in a boxplot, e.g. Would it? Is the suggestion to keep the order of the first instances of each level in a factor over observations? This would require some serious doing to work around behavior of These issues are kind of unrelated though. The status quo right now is to sort. This should be done correctly. Then there's the question of whether or not to sort at all by default and in which cases it makes sense not to, right? |
I don't think I follow, the default behavior of In [23]: pd.Series(["foo", "bar", "buz"]).unique()
Out[23]: array(['foo', 'bar', 'buz'], dtype=object)
Sure, but I definitely want to deal with this for the 0.6 release, so it doesn't make sense for you to put a lot of work into a good solution to preserve the status quo if it's just gonna get stripped out for a bunch of simpler code that just calls |
On Fri, Dec 26, 2014 at 2:57 PM, Michael Waskom notifications@github.com
|
So concrete steps for #409. Change np.sort to the pandas compatibility sort and preserve np.unique vs. pandas.unique? I think this preserves the status quo and makes sort work as you'd expect it to. I'd prefer to punt on sorting vs. index preservation bc. I don't have my head around enough the current code base. |
Now that we have ordered categoricals in pandas, I think automatically sorting would be OK for Seaborn. But generally I would agree with @wrobstory that respecting input order is less surprising. It's also certainly much less awkward to manually sort a column with pandas if desired than to tell Seaborn not to sort. So I'm +0 for |
The strange behavior of unordered pandas categoricals sort of defeats the utility of relying on |
IMO if categorical variables (dtype "category") are to be plotted, the categories should be used directly instead of |
Interesting, I think that's a reasonable point @JanSchulz |
So to make explicit what we want to happen on a categorical axis:
Also this has to happen by inspecting the object attributes, not with any special pandas functions, because seaborn has to run on pandas < 0.15. Does that sound right? If so, smarter pandas folk, what is the cleanest way to go about doing this? |
@mwaskom Here is my suggested implementation: def get_categories(values):
if hasattr(values, 'categories'):
# values is a pd.Categorical
return np.asarray(values.categories)
else:
return pd.unique(values) This satisfies your conditions 1 and 3, but not 2: unordered Categoricals will still display values in the order of the categories. It's definitely possible to fix that case, but it's also trickier and the order is somewhat ambiguous if not all categories appear in the data (I guess those could go to the end?). Might not be worth worrying about. |
I guess it would be good to be consistent with Pandas, but I think it's better to use the DataFrame order and drop categories with no observations than the reverse (exactly for the reason you say: where to put them is undefined). |
So I've been playing with scatter plots of categorical numeric data today (pretty easy when combined with Here's a synthetic version of my data: import pandas as pd
import numpy as np
import seaborn as sns
def cut_diverging(array, n=9):
mag = max(-array.min(), array.max())
return pd.cut(array, np.linspace(-mag, mag, num=(n + 1)))
rs = np.random.RandomState(0)
df = pd.DataFrame({'x': rs.rand(100),
'y': rs.rand(100),
'z': 1 + rs.randn(100)})
df['z_cat'] = cut_diverging(df.z, 9)
categories = df.z_cat.cat.categories
palette = sns.color_palette('RdBu_r', 9)
g = sns.FacetGrid(df, hue='z_cat', hue_order=categories,
palette=palette, aspect=1.3, size=3)
g.map(plt.scatter, 'x', 'y', s=50)
g.add_legend() Two issues are evident in this plot:
Plotting all categories (but not bothering to order them) would solve each of these problems. As a side note, perhaps a utility function like |
I think this is actually just a bug in |
@mwaskom Not sure what you expected, but IMO these are the main advantages for plotting (and maybe statistical libs):
|
Sure, I mean I get that. Like I said, I'm excited about them! I think my reservations have to do with the distinction between ordered and unordered categoricals, which seems to add a fair amount of complexity and confusion (I guess mostly I don't understand the point of unordered categoricals -- all of the examples you mention have to do with the ordered kind). |
Let's fix this: pandas-dev/pandas#9346 |
Maybe what's confusing is that "unordered" categoricals aren't really unordered, they're just default (lex sort) ordered. |
I will prepare the change discussed above (e.g if one constructs a categorical with examples for unordered categoricals: countries, treament vs non treatment,... I'm undecided if lexi sorting in this case (at least initially) would be helpful or not. IMO it would not harm :-) Anyway: I think seaborn will need a doc-sentence on this in any case because in most case the workflow wouldn't use the
What will IMO never happen is that if you have a ordered categorical and take away the order, the categories will be now sorted "as appearing", as this would mean that the categoricals have to be computed on the fly as the order depends on the current order of the rows. |
IMO the docs should read something like this: The categories are either ordered like "var.unique()" or, in case the variable is of dtype "category", in the same order as the categories ("var.cat.categories"). You can change the order by resorting (all but dtype category), by changing the order of the categories variable of dtype category (var.cat.reorder_categories([....neworder...]) or by supplying [... kword args...]. |
…False) In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with `Categorical(..., ordered=False)` then the order should be taken from the order of appearance, similar to the current `Series.unique()` implementation.
Just something I wrote in one of the linked bugreports: I think that "order of apearance" doesn't make sense as a default for plotting categoricals: I can't see many cases where the "order of appearance" of a variable has any logical meaning apart when one has explicitly sorted that var (in which case it is easier to sort the unique values) or for teh date case below:
IMO the last is much less often the case than the "it's random" case above So, this would be my vote: for all cases where a categorical variable is expected:
|
This will include levels that appear in the `category` list, but that do not appear in the data. See #361
For the record here's the function I ended up using to determine a list of category levels from an arbitrary vector object: def categorical_order(values, order=None):
"""Return a list of unique data values.
Determine an ordered list of levels in ``values``.
Parameters
----------
values : list, array, Categorical, or Series
Vector of "categorical" values
order : list-like, optional
Desired order of category levels to override the order determined
from the ``values`` object.
Returns
-------
order : list
Ordered list of category levels
"""
if order is None:
if hasattr(values, "categories"):
order = values.categories
else:
try:
order = values.cat.categories
except (TypeError, AttributeError):
try:
order = values.unique()
except AttributeError:
order = pd.unique(values)
return list(order) IMHO there has to be a lot of complexity in this little function to handle the various options and failure modes of working with pandas "categorical" data, but it appears to get the job done... |
This is indeed a helpful reference -- thanks @mwaskom! On my TODO list is cleaning that up a little bit, at least removing that last |
The thing I got stuck on for a while was figuring out that |
Or maybe it was that |
I will double check, but I'm pretty sure this is at least consistent now between Python 2 and 3 on master. Any invalid use should now raise TypeError. I suppose there is a reasonable case that that should be AttributeError instead. On Sun, Mar 8, 2015 at 4:18 PM, Michael Waskom notifications@github.com
|
Sorry, took a little bit, but I've reproduced the issue. The following code returns import pandas as pd
x = ["a", "c", "c", "b", "a", "d"]
hasattr(pd.Series(x), "cat") |
`AttributeError` is really the appropriate error to raise for an invalid attribute. In particular, it is necessary to ensure that tests like `hasattr(s, 'cat')` work consistently on Python 2 and 3: on Python 2, `hasattr(s, 'cat')` will return `False` even if a `TypeError` was raised, but Python 3 more strictly requires `AttributeError`. This is an unfortunate trap that we should avoid. See this discussion in Seaborn for a full report: mwaskom/seaborn#361 (comment) Note that technically, this is an API change, since these accessors (all but `.str`, I think) raised TypeError in the last release. This also suggests another possibility for testing for Series with a Categorical dtype (GH8814): just use `hasattr(s, 'cat')` (at least for Python 2 or pandas >=0.16). CC mwaskom jorisvandenbossche JanSchulz
This will include levels that appear in the `category` list, but that do not appear in the data. See #361
Alright with #548 I think categorical variables in seaborn should work as articulated in this thread and uniformly across the package. Please open an issue if you find something that does not behave as expected. |
For example, this adaption of the "Grouped boxplots" example should work (if using pandas 0.15 or higher) even without specifying
x_order
:If you using a pandas method to do the sorting, then this is a pandas bug.
The text was updated successfully, but these errors were encountered: