-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Represent pandas ordered categoricals as ordinal data #2522
Conversation
@@ -193,8 +193,6 @@ def infer_vegalite_type(data): | |||
# Otherwise, infer based on the dtype of the input | |||
typ = infer_dtype(data) | |||
|
|||
# TODO: Once this returns 'O', please update test_select_x and test_select_y in test_api.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this refers to ordinal data, but I could not find these specific tests, but I updated the one test that seemed relevant.
fabcbf9
to
3cff3b4
Compare
3cff3b4
to
8554d2a
Compare
altair/vegalite/v4/tests/test_api.py
Outdated
.encode( | ||
alt.X("x", type="nominal"), | ||
alt.Y("y", type="ordinal"), | ||
alt.Size("s", type="nominal", sort=None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little annoying that the sort categories has to be set to None manually when changing the data type, but if the shorthand is used 's:O' then this happens automatically (and the shorthands are probably the most commonly used syntax for this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the PR and it looks good to me! Only thing where it could be nice to dig some more is this part. I think this could be solved by a custom logic in https://github.com/altair-viz/altair/blob/master/tools/schemapi/schemapi.py#L365. Pseudocode (sorry didn't have time to test this more):
if "sort" in parsed_shorthand and kwds.get("type") != "ordinal":
parsed_shorthand.pop("sort")
Given that "sort" is only added in parsed_shorthand
if an ordered pandas categorical was used, it seems safe to remove it here if the type is specified already.
Not sure if there are other places which would need to be modified. Btw, this code is only in schemapi.py
after you merged in the latest changes from master due to #2813
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks @binste ! I had to add Undefined
to your suggested solution, otherwise it would incorrectly remove the sorting for categorical data where the type was not manually specified, does that makes sense to you? I rebased on master and pushed the latest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that makes sense!
Sorry I noticed that I forgot to click "Request review" here initially. |
Would love to see support for pandas ordinals! @jakevdp, ptal? |
I noticed that the changes to the tests are aiming vegalite/v4, you probably want to change this to vegalite/v5 if you think this PR should be merged @joelostblom. |
8554d2a
to
b0641aa
Compare
@mattijn I moved the tests to v5 instead of v4 and also updated the docs. Feel free to give this a review if you have the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment as a reply to yours.
c878869
to
9ad570b
Compare
I looked to the code changes and looked to the related issues PRs, but I still feel I miss a bit of context. Would it be possible to include a comment with a very simple example of the behavior before this PR and the behavior incorporating the changes of this PR? |
ExampleBased on this comment. Data and imports: import altair as alt
import pandas as pd
from vega_datasets import data
source = data.barley() If a field is passed in as an ordered site_lst = ['Crookston', 'Morris', 'University Farm','Duluth', 'Grand Rapids', 'Waseca']
source.site = pd.Categorical(source.site, site_lst, ordered = True) Altair so far ignored this ordering and just treated it as a nominal field as if it would be a column of strings. This means that in the current master branch, the colors are not sorted according to alt.Chart(source).mark_bar().encode(
x='sum(yield)',
y='variety',
color="site",
) Running the same code again on the branch of this PR, Altair does two things differently:
So this PR basically expands To sort the stacked bar charts as well, one still needs to use the trick documented by @mattijn here. Note that this is maybe not a good example to show the use of ordinal data as the data is probably best represented as |
A better example is probably the one in this comment. import altair as alt
import pandas as pd
dfdict = {
"Questions": {
0: "Question Text",
1: "Question Text",
2: "Question Text",
3: "Question Text",
4: "Question Text",
5: "Question Text",
},
"level": {
0: "1 - Strongly Disagree",
1: "2 - Disagree",
2: "3 - Neutral",
3: "4 - Agree",
4: "5 - Strongly Agree",
5: "N/A",
},
"value": {0: 1.4, 1: 5.7, 2: 10.0, 3: 32.9, 4: 47.1, 5: 2.9},
}
df = pd.DataFrame(dfdict)
sort_order = [
"N/A",
"5 - Strongly Agree",
"4 - Agree",
"3 - Neutral",
"2 - Disagree",
"1 - Strongly Disagree",
]
# This doesn't seem to be needed
df["level"] = pd.Categorical(df["level"], sort_order, ordered=True)
chart = (
alt.Chart(df)
.mark_bar()
.encode(
x=alt.X("value"),
y=alt.Y("Questions", title=""),
color="level",
order=alt.Order("color_site_sort_index:Q"),
)
)
chart You can see that color is represented as ordinal and with the colors sorted according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the PR. I think it is useful, if you prepare your pandas dataframe nicely using ordered categoricals it is nice if that would be respected by Altair.
Although it would be a first that the sort
encoding channel option would be set based on the type of a pandas data frame column.
I found one more related comment to this suggested change (#899 (comment)):
Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects.
The unintended side-effects.. what can it be? Eg. if I have a column with ordered categorical values. What will happen if I define it as type :N
. Would Altair in that case still respect the sorting?
If a different data type is specified such as I guess it can be surprising to people who are used to that only the datatypes are inferred. But converting a column in pandas to an ordered categorical one is usually a rather conscious decision so I'd expect them to either be happy about the choice that Vega-Lite then makes in terms of color scale and ordering or that they at least figure out how to fix it. Certainly a feature which would be great to have in the release candidate. |
I agree. Since the introduced behaviour is only targeting pandas data frames with ordered categorical columns the unintended consequences are, as far I can see, limited. |
A suggestion for how Altair could support ordered pandas categoricals. This doesn't deal with the order of segments in a stacked bar chart, but I think that is something that should be fixed in VL (vega/vega-lite#1734) rather than here.
close #245, ref #2170