Seaborn should respect categorical order when sorting pd.Categorical objects #361

shoyer · 2014-11-11T05:51:24Z

For example, this adaption of the "Grouped boxplots" example should work (if using pandas 0.15 or higher) even without specifying x_order:

import seaborn as sns
sns.set(style="ticks")

tips = sns.load_dataset("tips")
days = ["Thur", "Fri", "Sat", "Sun"]
tips['days'] = pd.Categorical(tips['day'], days)

g = sns.factorplot("day", "total_bill", "sex", tips, kind="box",
                   palette="PRGn", aspect=1.25)
g.despine(offset=10, trim=True)
g.set_axis_labels("Day", "Total Bill")

If you using a pandas method to do the sorting, then this is a pandas bug.

The text was updated successfully, but these errors were encountered:

wrobstory · 2014-11-11T06:09:57Z

I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart.

shoyer · 2014-11-11T06:15:44Z

Generally I agree, but I was assuming @mwaskom had his reasons. If this is just a side effect of the fact that he's using np.unique, he should try the pandas unique method instead, which does not sort (and is also a little faster).

On Mon, Nov 10, 2014 at 10:10 PM, Rob Story notifications@github.com
wrote:

I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart.

Reply to this email directly or view it on GitHub:
#361 (comment)

mwaskom · 2014-11-12T22:38:38Z

I definitely agree with @shoyer about categorical types. I'm not sure I feel all that strongly about alphabetical sorting (though to @wrobstory's point, without having any way to know whether a column is intentionally sorted, it seemed best for the default to do something predictable). But it sounds easy to just change np.unique(Series) to Series.unique() and pick up the main goal here (categorical columns) and maybe also reduce some surprise.

To go along with this I think the test datasets should be updated to use Categorical where appropriate to demonstrate the functionality.

mwaskom · 2014-11-14T05:55:08Z

I'm in favor of this, but I'm gonna kick it to the 0.6 cycle because it will take a little thinking. I want to figure out the best way to make this behavior as consistent as possible across the package.

olgabot · 2014-12-04T17:56:23Z

Just wanted to chime in here in support of sticking to the categorical order, for example here:

The alphabetical sort doesn't make sense because then the length bins are out of order, putting the bins with 0's and 5's first, rather than the natural size ordering.

shoyer · 2014-12-04T21:17:29Z

@olgabot The sorting order for bins will also be fixed upstream when I finish adding an Interval type to pandas (which will sort properly), instead of using string labels.

yoshiserry · 2014-12-05T01:13:02Z

hi all, is there a workaround for this currently? i love the power of the col= parameter, where you can create a graph for all instances of a column but I want to be able to plot Jan, Feb, Mar in order.

shoyer · 2014-12-05T03:01:52Z

@yoshiserry Yes, use the col_order argument.

jseabold · 2014-12-26T15:26:49Z

Do the changes in #409 look reasonable?

It provides a compatibility function that uses pandas sort and unique, which 1) handles NA in a consistent manner because np.sort will fail for object dtype Series with NA. and 2) it handles ordered and unordered Categorical. If an ordered factor/Categorical is passed in then it sorts on that order. If not, it does a lexicographical sort bc. of how unique works.

Want to get some feedback before adding a few more tests and touching the code base in other parts. Should this approach replace all the np.sort calls right now. I don't have a good sense if this would ever not be desired.

mwaskom · 2014-12-26T16:52:33Z

Thanks for taking a crack at it.

The open issue is really whether things should be lexicographically sorted or in the order that they appear in the dataframe (so just what the straight pandas .unique() method returns). I think that @wrobstory finding the sorting surprising was what originally motivated this chain of issues.

I originally had a mild preference for consistent behavior (sorts are at least predictable once you expect them), but if it requires a fair amount of complexity to make sorting work correctly across a range of pandas versions, it's possible that might weigh in favor of not sorting.

mwaskom · 2014-12-26T16:55:17Z

I'd been letting this issue fester as it required a hard decision so thanks for poking at it :]

wrobstory · 2014-12-26T17:31:09Z

Yep- I would not expect seaborn to sort the data for me unless explicitly asked to do so. I think there are lots of cases where I've already munged the dataframe to get the exact ordering I want, and expect it to be plotted 1:1.

jseabold · 2014-12-26T18:20:39Z

I think that's probably fair in some cases, though I don't really see at first glance how a sorted index would affect the order in a boxplot, e.g. Would it? Is the suggestion to keep the order of the first instances of each level in a factor over observations? This would require some serious doing to work around behavior of unique no?

These issues are kind of unrelated though. The status quo right now is to sort. This should be done correctly. Then there's the question of whether or not to sort at all by default and in which cases it makes sense not to, right?

mwaskom · 2014-12-26T19:57:02Z

Is the suggestion to keep the order of the first instances of each level in a factor over observations? This would require some serious doing to work around behavior of unique no?

I don't think I follow, the default behavior of Series.unique() differs from np.unique() in that it doesn't sort:

In [23]: pd.Series(["foo", "bar", "buz"]).unique()
Out[23]: array(['foo', 'bar', 'buz'], dtype=object)

These issues are kind of unrelated though. The status quo right now is to sort. This should be done correctly.

Sure, but I definitely want to deal with this for the 0.6 release, so it doesn't make sense for you to put a lot of work into a good solution to preserve the status quo if it's just gonna get stripped out for a bunch of simpler code that just calls .unique().

jseabold · 2014-12-26T22:18:15Z

On Fri, Dec 26, 2014 at 2:57 PM, Michael Waskom notifications@github.com
wrote:

Is the suggestion to keep the order of the first instances of each level
in a factor over observations? This would require some serious doing to
work around behavior of unique no?

I don't think I follow, the default behavior of Series.unique() differs
from np.unique() in that it doesn't sort:

In [23]: pd.Series(["foo", "bar", "buz"]).unique()
Out[23]: array(['foo', 'bar', 'buz'], dtype=object)

Oh ok. That's news to me. Should have checked my priors.

jseabold · 2014-12-26T22:40:03Z

So concrete steps for #409. Change np.sort to the pandas compatibility sort and preserve np.unique vs. pandas.unique? I think this preserves the status quo and makes sort work as you'd expect it to. I'd prefer to punt on sorting vs. index preservation bc. I don't have my head around enough the current code base.

shoyer · 2014-12-29T12:58:25Z

Now that we have ordered categoricals in pandas, I think automatically sorting would be OK for Seaborn. But generally I would agree with @wrobstory that respecting input order is less surprising. It's also certainly much less awkward to manually sort a column with pandas if desired than to tell Seaborn not to sort. So I'm +0 for .unique() rather than sorting.

mwaskom · 2015-01-21T18:43:58Z

The strange behavior of unordered pandas categoricals sort of defeats the utility of relying on .unique(), @shoyer, cf this issue thread: pandas-dev/pandas#9148

jankatins · 2015-01-22T10:33:55Z

IMO if categorical variables (dtype "category") are to be plotted, the categories should be used directly instead of unique(). E.g. I would expect that if I plot a lickert scale via bar plots, I expect that the complete scale is shown, not only the bars for categories which are used. That was also once upon the time the reasoning why unique returned (all) categories in the order they were specified (but this was changed in pandas-dev/pandas#8559 (comment) to only return unique values because that's what the contract on unique said...).

mwaskom · 2015-01-22T16:20:53Z

Interesting, I think that's a reasonable point @JanSchulz

mwaskom · 2015-01-22T22:46:00Z

So to make explicit what we want to happen on a categorical axis:

For objects with ordered categorical datatype, show all categories in the correct category order
For objects with unordered categorical datatype, show all categories in the order they appear in the Series
Otherwise, show all unique values in the order they appear in the Series/array/list

Also this has to happen by inspecting the object attributes, not with any special pandas functions, because seaborn has to run on pandas < 0.15.

Does that sound right? If so, smarter pandas folk, what is the cleanest way to go about doing this?

shoyer · 2015-01-22T23:21:35Z

@mwaskom Here is my suggested implementation:

def get_categories(values):
    if hasattr(values, 'categories'):
        # values is a pd.Categorical
        return np.asarray(values.categories)
    else:
        return pd.unique(values)

This satisfies your conditions 1 and 3, but not 2: unordered Categoricals will still display values in the order of the categories. It's definitely possible to fix that case, but it's also trickier and the order is somewhat ambiguous if not all categories appear in the data (I guess those could go to the end?). Might not be worth worrying about.

mwaskom · 2015-01-22T23:27:28Z

I guess it would be good to be consistent with Pandas, but I think it's better to use the DataFrame order and drop categories with no observations than the reverse (exactly for the reason you say: where to put them is undefined).

shoyer · 2015-01-23T02:53:59Z

So I've been playing with scatter plots of categorical numeric data today (pretty easy when combined with hue on FacetGrid), and I do agree pretty strongly with @JanSchulz that all categories should be plotted.

Here's a synthetic version of my data:

import pandas as pd
import numpy as np
import seaborn as sns

def cut_diverging(array, n=9):
    mag = max(-array.min(), array.max())
    return pd.cut(array, np.linspace(-mag, mag, num=(n + 1)))

rs = np.random.RandomState(0)
df = pd.DataFrame({'x': rs.rand(100),
                   'y': rs.rand(100),
                   'z': 1 + rs.randn(100)})
df['z_cat'] = cut_diverging(df.z, 9)

categories = df.z_cat.cat.categories
palette = sns.color_palette('RdBu_r', 9)

g = sns.FacetGrid(df, hue='z_cat', hue_order=categories,
                  palette=palette, aspect=1.3, size=3)
g.map(plt.scatter, 'x', 'y', s=50)
g.add_legend()

Two issues are evident in this plot:

The labels in the legend are wrong (notice that distribution should be mostly positive numbers). This is because I'm using labels in hue_order that don't appear in the data. (There should probably fail more loudly, but that's a separate issue).
Even though I went to the trouble of obtaining centered divergent categories, the color map is not centered because not every category is found.

Plotting all categories (but not bothering to order them) would solve each of these problems.

As a side note, perhaps a utility function like cut_diverging belongs in seaborn?

mwaskom · 2015-01-23T03:03:41Z

I think this is actually just a bug in FacetGrid.add_lagend and isn't specifically related to the category issue.

jankatins · 2015-01-23T20:51:22Z

@mwaskom Not sure what you expected, but IMO these are the main advantages for plotting (and maybe statistical libs):

I can easily change the order of displayed categorical data (e.g. try to order a variable with values like "one", "two", "three" or lickert scales -> already possible with kwords in seaborn)
I can include empty categories at the right place (not sure if that's possible in seaborn yet).
In ggplot I was also looking forward to using it for facets: instead of passing around the full labels (which would need an additional "context" or so), each facet would simple take this information from the the categorical variable itself. This would then solve the up-to-now problem in ggplot, that displaying a categorical (e.g. string) variable in a faceted plot would omit bars in some facets, where they were not present in the groupby dataframe.

mwaskom · 2015-01-23T20:55:23Z

Sure, I mean I get that. Like I said, I'm excited about them! I think my reservations have to do with the distinction between ordered and unordered categoricals, which seems to add a fair amount of complexity and confusion (I guess mostly I don't understand the point of unordered categoricals -- all of the examples you mention have to do with the ordered kind).

shoyer · 2015-01-23T20:58:45Z

AFAIK unique makes no guarantee that unique() outputs data in any particular order, so this could change (the docstring doesn't mention any order).

Let's fix this: pandas-dev/pandas#9346

mwaskom · 2015-01-23T21:00:45Z

Maybe what's confusing is that "unordered" categoricals aren't really unordered, they're just default (lex sort) ordered.

jankatins · 2015-01-23T21:09:55Z

I will prepare the change discussed above (e.g if one constructs a categorical with order=False the categories are not ordered), but I suspect that we understand something different by "a categorical is ordered": The only thing what this means is that values are sortable (according to the order specified in the categories), not anything about the order in the categories. More like that there is an order on int (e.g. 1 < 2) and if you could take this away (order=False), 1 < 2 would throw a exception like it does for comparing two objects.

examples for unordered categoricals: countries, treament vs non treatment,... I'm undecided if lexi sorting in this case (at least initially) would be helpful or not. IMO it would not harm :-)

Anyway: I think seaborn will need a doc-sentence on this in any case because in most case the workflow wouldn't use the Categorical constructor directly but

df = ...
df["vcat"] = df.string_var.astype("category") # sorts because categorical defaults to "ordered=True"
df.vcat.cat.ordered = False # categories are not reevaluated/sorted in order of appearance
[plot it...]

What will IMO never happen is that if you have a ordered categorical and take away the order, the categories will be now sorted "as appearing", as this would mean that the categoricals have to be computed on the fly as the order depends on the current order of the rows.

jankatins · 2015-01-23T21:39:40Z

IMO the docs should read something like this:

The categories are either ordered like "var.unique()" or, in case the variable is of dtype "category", in the same order as the categories ("var.cat.categories"). You can change the order by resorting (all but dtype category), by changing the order of the categories variable of dtype category (var.cat.reorder_categories([....neworder...]) or by supplying [... kword args...].

…False) In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with `Categorical(..., ordered=False)` then the order should be taken from the order of appearance, similar to the current `Series.unique()` implementation.

jankatins · 2015-01-26T15:19:45Z

Just something I wrote in one of the linked bugreports: I think that "order of apearance" doesn't make sense as a default for plotting categoricals: I can't see many cases where the "order of appearance" of a variable has any logical meaning apart when one has explicitly sorted that var (in which case it is easier to sort the unique values) or for teh date case below:

If it is sorted by the categorical variable, then that ordere has a meaning, but I could have done that much cheaper by sorting the unique values itself or supplying a custom ordering by kwargs
If the frame is "unsorted" (or by some ID/Timestamp), the order of (almost) any other var including the categorical one is random and has no meaning in a categorical plot.
The only case is when you sort by date and have a categorical "month", which has the month as a name ("Jan","Apr",...). But that would be much better handled by converting to a Categorical, where the order of the categories has a defined meaning.

IMO the last is much less often the case than the "it's random" case above

So, this would be my vote: for all cases where a categorical variable is expected:

If it is dtype category, take the categories and plot "as is" (order and including unused)
if it is a string/int: take unique and sort it, [maybe convert to categorical for easier codepaths?,] then plot it. Add a warning that string was converted to categorical and one can do that by hand and change the order there.
If one supplies a ordered list of values, overwrite both cases.

This will include levels that appear in the `category` list, but that do not appear in the data. See #361

mwaskom · 2015-03-08T19:59:41Z

For the record here's the function I ended up using to determine a list of category levels from an arbitrary vector object:

def categorical_order(values, order=None):
    """Return a list of unique data values.

    Determine an ordered list of levels in ``values``.

    Parameters
    ----------
    values : list, array, Categorical, or Series
        Vector of "categorical" values
    order : list-like, optional
        Desired order of category levels to override the order determined
        from the ``values`` object.

    Returns
    -------
    order : list
        Ordered list of category levels

    """
    if order is None:
        if hasattr(values, "categories"):
            order = values.categories
        else:
            try:
                order = values.cat.categories
            except (TypeError, AttributeError):
                try:
                    order = values.unique()
                except AttributeError:
                    order = pd.unique(values)

    return list(order)

IMHO there has to be a lot of complexity in this little function to handle the various options and failure modes of working with pandas "categorical" data, but it appears to get the job done...

shoyer · 2015-03-08T20:15:03Z

This is indeed a helpful reference -- thanks @mwaskom!

On my TODO list is cleaning that up a little bit, at least removing that last try/except clause -- pd.unique should be able to handle pandas categoricals (and other types with a .unique method) directly.

mwaskom · 2015-03-08T20:17:13Z

The thing I got stuck on for a while was figuring out that obj.cat raises TypeError if obj is a Series but AttributeError otherwise (and apparently that distinction only takes effect on Python 3). Streamlining that might be helpful to others.

mwaskom · 2015-03-08T20:18:44Z

Or maybe it was that hasattr(obj, "cat") raises a TypeError only on Python 3 -- anyway, "determine if obj has categories" remains a bit fraught.

shoyer · 2015-03-08T23:29:33Z

I will double check, but I'm pretty sure this is at least consistent now between Python 2 and 3 on master. Any invalid use should now raise TypeError. I suppose there is a reasonable case that that should be AttributeError instead.

On Sun, Mar 8, 2015 at 4:18 PM, Michael Waskom notifications@github.com
wrote:

Or maybe it was that hasattr(obj, "cat") raises a TypeError only on Python 3 -- anyway, "determine if obj has categories"` remains a bit fraught.

Reply to this email directly or view it on GitHub:
#361 (comment)

mwaskom · 2015-03-09T02:59:18Z

Sorry, took a little bit, but I've reproduced the issue. The following code returns False on Python 2.7 but raises TypeError on Python 3.4:

import pandas as pd
x = ["a", "c", "c", "b", "a", "d"]
hasattr(pd.Series(x), "cat")

`AttributeError` is really the appropriate error to raise for an invalid attribute. In particular, it is necessary to ensure that tests like `hasattr(s, 'cat')` work consistently on Python 2 and 3: on Python 2, `hasattr(s, 'cat')` will return `False` even if a `TypeError` was raised, but Python 3 more strictly requires `AttributeError`. This is an unfortunate trap that we should avoid. See this discussion in Seaborn for a full report: mwaskom/seaborn#361 (comment) Note that technically, this is an API change, since these accessors (all but `.str`, I think) raised TypeError in the last release. This also suggests another possibility for testing for Series with a Categorical dtype (GH8814): just use `hasattr(s, 'cat')` (at least for Python 2 or pandas >=0.16). CC mwaskom jorisvandenbossche JanSchulz

This will include levels that appear in the `category` list, but that do not appear in the data. See #361

This fixes #472. This also changes the default `hue_order` to use the same `category_order` rules as elsewhere in seaborn (cf #361).

mwaskom · 2015-05-10T01:50:05Z

Alright with #548 I think categorical variables in seaborn should work as articulated in this thread and uniformly across the package.

Please open an issue if you find something that does not behave as expected.

mwaskom added question plots labels Nov 12, 2014

mwaskom mentioned this issue Dec 24, 2014

ENH: Pairwise dropna in PairGrid. Closes #407. #409

Closed

jseabold mentioned this issue Dec 24, 2014

Ordered vs. Unordered Categoricals pandas-dev/pandas#9148

Closed

jankatins mentioned this issue Jan 22, 2015

BUG: don't sort unique values from categoricals pandas-dev/pandas#9331

Merged

shoyer mentioned this issue Jan 23, 2015

DOC/TST: is pd.unique and the order it returns API? pandas-dev/pandas#9346

Closed

jankatins mentioned this issue Jan 23, 2015

Categorical: don't sort the categoricals if Categorical(..., ordered=False) pandas-dev/pandas#9347

Closed

mwaskom mentioned this issue Feb 11, 2015

Do not attempt ordering unless it's needed #447

Closed

mwaskom added a commit that referenced this issue Mar 8, 2015

Respect pandas categorical order in category plots

8583391

This will include levels that appear in the `category` list, but that do not appear in the data. See #361

shoyer mentioned this issue Mar 9, 2015

BUG/API: Accessors like .cat raise AttributeError when invalid pandas-dev/pandas#9617

Merged

mwaskom mentioned this issue Mar 9, 2015

Unify categorical plots #466

Merged

4 tasks

mwaskom added a commit that referenced this issue Mar 13, 2015

Respect pandas categorical order in category plots

52081b5

This will include levels that appear in the `category` list, but that do not appear in the data. See #361

mwaskom added a commit that referenced this issue May 9, 2015

Fix hue_order in PairGrid

b0a975e

This fixes #472. This also changes the default `hue_order` to use the same `category_order` rules as elsewhere in seaborn (cf #361).

This was referenced May 9, 2015

Fix hue_order in PairGrid #547

Merged

Extend new categorical ordering rules to FacetGrid #548

Merged

mwaskom closed this as completed in #548 May 10, 2015

mwaskom mentioned this issue Aug 22, 2016

FacetGrid with pd.Categorical data plots also empty categories #997

Closed

mwaskom mentioned this issue Sep 7, 2017

factorplot does not sort entries alpha-numerically by default #1274

Closed

Seaborn should respect categorical order when sorting pd.Categorical objects #361

Seaborn should respect categorical order when sorting pd.Categorical objects #361

Comments

shoyer commented Nov 11, 2014

wrobstory commented Nov 11, 2014

shoyer commented Nov 11, 2014

I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart.

mwaskom commented Nov 12, 2014

mwaskom commented Nov 14, 2014

olgabot commented Dec 4, 2014

shoyer commented Dec 4, 2014

yoshiserry commented Dec 5, 2014

shoyer commented Dec 5, 2014

jseabold commented Dec 26, 2014

mwaskom commented Dec 26, 2014

mwaskom commented Dec 26, 2014

wrobstory commented Dec 26, 2014

jseabold commented Dec 26, 2014

mwaskom commented Dec 26, 2014

jseabold commented Dec 26, 2014

jseabold commented Dec 26, 2014

shoyer commented Dec 29, 2014

mwaskom commented Jan 21, 2015

jankatins commented Jan 22, 2015

mwaskom commented Jan 22, 2015

mwaskom commented Jan 22, 2015

shoyer commented Jan 22, 2015

mwaskom commented Jan 22, 2015

shoyer commented Jan 23, 2015

mwaskom commented Jan 23, 2015

jankatins commented Jan 23, 2015

mwaskom commented Jan 23, 2015

shoyer commented Jan 23, 2015

mwaskom commented Jan 23, 2015

jankatins commented Jan 23, 2015

jankatins commented Jan 23, 2015

jankatins commented Jan 26, 2015

mwaskom commented Mar 8, 2015

shoyer commented Mar 8, 2015

mwaskom commented Mar 8, 2015

mwaskom commented Mar 8, 2015

shoyer commented Mar 8, 2015

Or maybe it was that hasattr(obj, "cat") raises a TypeError only on Python 3 -- anyway, "determine if obj has categories"` remains a bit fraught.

mwaskom commented Mar 9, 2015

mwaskom commented May 10, 2015

Or maybe it was that `hasattr(obj, "cat")` raises a `TypeError` only on Python 3 -- anyway, "determine if `obj` has categories"` remains a bit fraught.