-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deep validation and instead use error hierarchy to improve error messages #2842
Conversation
… in jsonschema 2.3 or later. Altair requires > 3
…ma package to build more informative error messages
Great work @binste! @joelostblom is probably more capable in reviewing this than myself. 'asdf' is not of type 'null' In Python Altair, |
@mattijn Is this dictionary possibly relevant to your comment? (In another version, in |
I think translating those types would be very helpful to users! Thanks @ChristopherDavisUCI, definitely looks relevant. There is also https://github.com/altair-viz/altair/blob/a8172b7134ed262224e5dfb4e0c99d4bdc55bcb0/tools/schemapi/utils.py#L183 which is only slightly different. I'm happy to tackle this as part of a larger effort around types as mentioned in #2846 (comment) but if anyone gets a chance earlier to replace the types in the error messages then please, don't hold back! :) |
In short, this is looking fantastic! Thank you so much for digging into the depths of this @binste and sorry for the delay in reviewing. I am going to get to this in the next couple of days. On brief comment now, do you think it is confusing to see three error messages like in your example above? I don't know what would be a feasible solution technically here, but I am thinking that instead of
the message would be clearer as:
This is a minor comment though |
No worries and thank you for the kind words! :) Indeed, it would be a nice enhancement if the error messages can be merged into a one-liner. For that we probably need to check how consistent these strings are and if that changes across versions for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said above, this PR is fantastic and makes both the code logic and the error message much more straightforward! One additional benefit of this PR is that it makes the traceback much shorter. Before, it was common to have around 120 lines or so of traceback; now that is down to 20, which makes it much easier to find the error message without having to scroll so much.
Also a big thanks for the details comments in the code and explanations in the PR which made it much easier to review. I have a few minor comments and then we're good to merge.
Thank you @joelostblom for the review and the great suggestions! Just fyi, I'll probably get to implementing them in a week or so. Same for any other open PRs I'm involved with. |
All comments except two are implemented. For those I'll wait for your feedback. |
Co-authored-by: Joel Ostblom <joelostblom@users.noreply.github.com>
Thanks for the review! I also added a line to the changelog. Not sure if this is the preferred way moving forward to keep it up-to-date? Could be nice to include a line for every PR which is relevant for the changelog but let me know if I should remove it again. |
Yup I agree, I think each notable PR should come with a changelog entry so that we can easier keep track. Will merge when checks go green. |
Whoop whoop. Congrats @binste for this one! |
Happy that this approach worked out, thanks! :) |
Background
Many exception messages that a user sees in Altair are based on the validation of the chart specification against the vega-lite schema. This validation with the package
jsonschema
produces error messages on multiple level of the schema, e.g. if an error happens in a subschema, it produces errors for the subschema as well as the schemas above. For the user, the most informative errors are the one directly pinpointing the root cause, i.e. the errors on the lowest level where the wrong specification occurs.To generate and raise these "deep" errors, #555 introduced "deep validation". This meant that a chart is validated once using
jsonschema
, if a validation error is raised, the chart is validated again but this time the validation is performed iteratively on all individual subschemas in a recursive fashion. This was a great improvement and I'm not sure if there was a better way at that time. However, these subschemas by themselves are not always valid, see #2609 for an example, which in turn leads to new confusing error messages which have nothing to do with the original issue.Proposed solution
This PR removes the deep validation again. Instead, it only validates the chart specifications once and then uses the exception hierarchy which is returned by
jsonschema
to find the root cause. See the "Walkthrough" section below for more details. With this approach, we can still raise the exception which is at the root of the issue, but there is no need to validate individual schemas and therefore issues such as #2609 are solved:raises correctly:
Furthermore, we can combine exceptions on the same "hierarchy level" in the schema to provide even more messages:
Previously, the error only showed the line
'asdf' is not one of ['zero', 'center', 'normalize']
, althoughstack
could also be a boolean or null.I already added these examples and a few more to the tests to make sure that informative exceptions are raised. Let me know if you have more that I can test against!
Fixes #2175, fixes #2609, fixes #2744, related PRs which are probably no longer needed: #2841, #2835
I tested the code also against jsonschema 3.0 which is the minimum version required by Altair.
Walkthrough
To help with the review process, I try to outline below how we get from an invalid chart specification to the final error message which is raised. I can't put in links to all the places in the code I refer to as they are not modified in the PR. Let me know if you need more details.
Let's start with the example from above:
TopLevelMixin.to_dict
method which then ends up callingsuper(...).to_dict
here which propagates toSchemaBase.to_dict
SchemaBase.to_dict
callsself.validate
which uses thevalidate_jsonschema
function to validate the specification. This raises ajsonschema.ValidationError
. This error is catched inSchemaBase.to_dict
and turned into a customSchemaValidationError
which has better error messages and then raised again{'aggregate': 'sum', 'field': 'yield', 'stack': 'asdf', 'type': 'quantitative'} is not valid under any of the given schemas
which is not very informative. This is why the deep validation was originally introduced.TopLevelMixin.to_dict
and then the deep validation is run iterating through all objects in the specification and validating them individually. This raises the final error'asdf' is not one of ['zero', 'center', 'normalize']
to the user which is much better.With the new approach implemented in this PR, we use the error hierarchy which is in the
context
attribute of the originaljsonschema.ValidationError
which was raised. "If the error was caused by errors in subschemas, the list of errors from the subschemas will be available on this property.", see the jsonschema docs for all the details. These errors from the subschemas have again acontext
attribute which allows to find the root error. I added a new function _get_most_relevant_errors which tries to find the most relevant errors based on this hierarchy. For the example it looks like this:ValidationError: "{'aggregate': 'sum', 'field': 'yield', 'stack': 'asdf', 'type': 'quantitative'} is not valid under any of the given schemas
and from now on referenced withtop_error
top_error.context
givescontext
or if itsNone
. In this case, there is more.top_error.context[0].context
gives:.context
again on the first validation error in this list shows that it is None and therefore we reached the bottom of the hierarchy._get_most_relevant_errors
will return all 3 errors.validate_jsonschema
will then raise the first of them (ValidationError: "'asdf' is not one of ['zero', 'center', 'normalize']"
) and add the other two as_additional_errors
attribute to the first. For errors of typeadditionalProperties
only the first error on the lowest level is enough as the others do not contain more information, they are instead more verbose and somewhat confusing messages.SchemaValidationError
so it should not break any users code if they haveexcept
statements. We use the 2 additional errors to craft the final error message