Represent pandas ordered categoricals as ordinal data #2522

joelostblom · 2021-11-18T03:19:55Z

A suggestion for how Altair could support ordered pandas categoricals. This doesn't deal with the order of segments in a stacked bar chart, but I think that is something that should be fixed in VL (vega/vega-lite#1734) rather than here.

close #245, ref #2170

joelostblom · 2021-11-18T03:20:46Z

altair/utils/core.py

@@ -193,8 +193,6 @@ def infer_vegalite_type(data):
    # Otherwise, infer based on the dtype of the input
    typ = infer_dtype(data)

-    # TODO: Once this returns 'O', please update test_select_x and test_select_y in test_api.py


I assume this refers to ordinal data, but I could not find these specific tests, but I updated the one test that seemed relevant.

altair/vegalite/v4/tests/test_api.py

joelostblom · 2021-11-18T03:56:55Z

altair/vegalite/v4/tests/test_api.py

+        .encode(
+            alt.X("x", type="nominal"),
+            alt.Y("y", type="ordinal"),
+            alt.Size("s", type="nominal", sort=None),


It is a little annoying that the sort categories has to be set to None manually when changing the data type, but if the shorthand is used 's:O' then this happens automatically (and the shorthands are probably the most commonly used syntax for this).

I reviewed the PR and it looks good to me! Only thing where it could be nice to dig some more is this part. I think this could be solved by a custom logic in https://github.com/altair-viz/altair/blob/master/tools/schemapi/schemapi.py#L365. Pseudocode (sorry didn't have time to test this more):

if "sort" in parsed_shorthand and kwds.get("type") != "ordinal": parsed_shorthand.pop("sort")

Given that "sort" is only added in parsed_shorthand if an ordered pandas categorical was used, it seems safe to remove it here if the type is specified already.

Not sure if there are other places which would need to be modified. Btw, this code is only in schemapi.py after you merged in the latest changes from master due to #2813

Great, thanks @binste ! I had to add Undefined to your suggested solution, otherwise it would incorrectly remove the sorting for categorical data where the type was not manually specified, does that makes sense to you? I rebased on master and pushed the latest changes.

Ah yes, that makes sense!

joelostblom · 2022-03-30T18:26:42Z

Sorry I noticed that I forgot to click "Request review" here initially.

nkrishnaswami · 2022-05-17T16:04:38Z

Would love to see support for pandas ordinals! @jakevdp, ptal?

mattijn · 2022-12-27T21:31:45Z

I noticed that the changes to the tests are aiming vegalite/v4, you probably want to change this to vegalite/v5 if you think this PR should be merged @joelostblom.

joelostblom · 2023-01-06T11:39:54Z

@mattijn I moved the tests to v5 instead of v4 and also updated the docs. Feel free to give this a review if you have the time.

binste

See my comment as a reply to yours.

…the order

… is specified

mattijn · 2023-01-24T12:29:43Z

I looked to the code changes and looked to the related issues PRs, but I still feel I miss a bit of context. Would it be possible to include a comment with a very simple example of the behavior before this PR and the behavior incorporating the changes of this PR?

binste · 2023-01-25T06:15:28Z

Example

Based on this comment.

Data and imports:

import altair as alt
import pandas as pd
from vega_datasets import data

source = data.barley()

If a field is passed in as an ordered pd.Categorical such as

site_lst = ['Crookston', 'Morris', 'University Farm','Duluth', 'Grand Rapids', 'Waseca']
source.site = pd.Categorical(source.site, site_lst, ordered = True)

Altair so far ignored this ordering and just treated it as a nominal field as if it would be a column of strings.

This means that in the current master branch, the colors are not sorted according to site_lst and the type of site is inferred as nominal:

alt.Chart(source).mark_bar().encode(
    x='sum(yield)',
    y='variety',
    color="site",
)

Running the same code again on the branch of this PR, Altair does two things differently:

it will choose ordinal as the type for site
and it will set the sort argument to the order defined in site_lst

So this PR basically expands color='site to color=alt.Color('site:O', sort=alt.Sort(site_lst)) (note the :O as well as sort).

To sort the stacked bar charts as well, one still needs to use the trick documented by @mattijn here. Note that this is maybe not a good example to show the use of ordinal data as the data is probably best represented as nominal to get a qualitative color scheme and with a sort attribute on Color to get the desired sorting. But I hope this helps in reviewing.

binste · 2023-01-25T06:21:56Z

A better example is probably the one in this comment.

import altair as alt
import pandas as pd

dfdict = {
    "Questions": {
        0: "Question Text",
        1: "Question Text",
        2: "Question Text",
        3: "Question Text",
        4: "Question Text",
        5: "Question Text",
    },
    "level": {
        0: "1 - Strongly Disagree",
        1: "2 - Disagree",
        2: "3 - Neutral",
        3: "4 - Agree",
        4: "5 - Strongly Agree",
        5: "N/A",
    },
    "value": {0: 1.4, 1: 5.7, 2: 10.0, 3: 32.9, 4: 47.1, 5: 2.9},
}

df = pd.DataFrame(dfdict)

sort_order = [
    "N/A",
    "5 - Strongly Agree",
    "4 - Agree",
    "3 - Neutral",
    "2 - Disagree",
    "1 - Strongly Disagree",
]

# This doesn't seem to be needed
df["level"] = pd.Categorical(df["level"], sort_order, ordered=True)

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("value"),
        y=alt.Y("Questions", title=""),
        color="level", 
        order=alt.Order("color_site_sort_index:Q"),
    )
)

chart

Behaviour with this PR:

You can see that color is represented as ordinal and with the colors sorted according to sort_order.

mattijn

I can understand the PR. I think it is useful, if you prepare your pandas dataframe nicely using ordered categoricals it is nice if that would be respected by Altair.

Although it would be a first that the sort encoding channel option would be set based on the type of a pandas data frame column.

I found one more related comment to this suggested change (#899 (comment)):

Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects.

The unintended side-effects.. what can it be? Eg. if I have a column with ordered categorical values. What will happen if I define it as type :N. Would Altair in that case still respect the sorting?

altair/utils/core.py

binste · 2023-01-25T18:19:26Z

If a different data type is specified such as :N then sort is not set and the behaviour is the same as on the master branch. This happens here https://github.com/altair-viz/altair/pull/2522/files#diff-25906c2570b127f0bb9cff4550bcc153006b2fdb36aea55ae804ca9a74eb4404R367. Just tested and verified this with an example.

I guess it can be surprising to people who are used to that only the datatypes are inferred. But converting a column in pandas to an ordered categorical one is usually a rather conscious decision so I'd expect them to either be happy about the choice that Vega-Lite then makes in terms of color scale and ordering or that they at least figure out how to fix it. Certainly a feature which would be great to have in the release candidate.

mattijn · 2023-01-25T19:43:01Z

I agree. Since the introduced behaviour is only targeting pandas data frames with ordered categorical columns the unintended consequences are, as far I can see, limited.
But the introduced behaviour of this PR will probably appreciated if respected by Altair by the limited user-base who make use of ordered categorical columns. All tests pass, so its merged.
Thanks @joelostblom for this PR, thanks @binste for the review!

joelostblom commented Nov 18, 2021

View reviewed changes

joelostblom force-pushed the ordered-categories branch from fabcbf9 to 3cff3b4 Compare November 18, 2021 03:47

joelostblom commented Nov 18, 2021

View reviewed changes

altair/vegalite/v4/tests/test_api.py Outdated Show resolved Hide resolved

joelostblom force-pushed the ordered-categories branch from 3cff3b4 to 8554d2a Compare November 18, 2021 03:56

joelostblom commented Nov 18, 2021

View reviewed changes

joelostblom mentioned this pull request Mar 30, 2022

Support Vega-Lite's optional encoding types #2584

Open

joelostblom requested a review from jakevdp March 30, 2022 18:26

joelostblom force-pushed the ordered-categories branch from 8554d2a to b0641aa Compare January 6, 2023 10:51

joelostblom requested a review from mattijn January 6, 2023 11:38

binste requested changes Jan 18, 2023

View reviewed changes

joelostblom added 5 commits January 19, 2023 13:56

Represent pandas ordered categoricals as ordinal data

9e35bce

Move new test to v5 from v4

7d42367

Add notes about categorical sorting to the docs

3159663

Note that specifying the type explicitly remove the autodetection of …

54f5b0a

…the order

Remove automatic sort order of categorical data if a non-ordinal type…

9ad570b

… is specified

joelostblom force-pushed the ordered-categories branch from c878869 to 9ad570b Compare January 19, 2023 13:50

binste approved these changes Jan 20, 2023

View reviewed changes

mattijn reviewed Jan 25, 2023

View reviewed changes

altair/utils/core.py Outdated Show resolved Hide resolved

mattijn added 2 commits January 25, 2023 20:34

Update altair/utils/core.py

dc74f9e

Merge branch 'master' into ordered-categories

8efc09e

mattijn merged commit 97ff1eb into vega:master Jan 25, 2023

mattijn mentioned this pull request Feb 12, 2023

tooltip throws error for Categorical variable #2879

Closed

joelostblom mentioned this pull request Feb 13, 2023

Remove the automatic sort of categoricals for channels that do not support sorting #2885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Represent pandas ordered categoricals as ordinal data #2522

Represent pandas ordered categoricals as ordinal data #2522

joelostblom commented Nov 18, 2021

joelostblom Nov 18, 2021

joelostblom Nov 18, 2021

binste Jan 18, 2023 •

edited

Loading

joelostblom Jan 19, 2023

binste Jan 20, 2023

joelostblom commented Mar 30, 2022

nkrishnaswami commented May 17, 2022

mattijn commented Dec 27, 2022

joelostblom commented Jan 6, 2023

binste left a comment •

edited

Loading

mattijn commented Jan 24, 2023

binste commented Jan 25, 2023 •

edited

Loading

binste commented Jan 25, 2023

mattijn left a comment

binste commented Jan 25, 2023 •

edited

Loading

mattijn commented Jan 25, 2023

Represent pandas ordered categoricals as ordinal data #2522

Represent pandas ordered categoricals as ordinal data #2522

Conversation

joelostblom commented Nov 18, 2021

joelostblom Nov 18, 2021

Choose a reason for hiding this comment

joelostblom Nov 18, 2021

Choose a reason for hiding this comment

binste Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

joelostblom Jan 19, 2023

Choose a reason for hiding this comment

binste Jan 20, 2023

Choose a reason for hiding this comment

joelostblom commented Mar 30, 2022

nkrishnaswami commented May 17, 2022

mattijn commented Dec 27, 2022

joelostblom commented Jan 6, 2023

binste left a comment • edited Loading

Choose a reason for hiding this comment

mattijn commented Jan 24, 2023

binste commented Jan 25, 2023 • edited Loading

Example

binste commented Jan 25, 2023

mattijn left a comment

Choose a reason for hiding this comment

binste commented Jan 25, 2023 • edited Loading

mattijn commented Jan 25, 2023

binste Jan 18, 2023 •

edited

Loading

binste left a comment •

edited

Loading

binste commented Jan 25, 2023 •

edited

Loading

binste commented Jan 25, 2023 •

edited

Loading