Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a better abstraction layer for efficiently handling annotations #1065

Open
jviotti opened this issue Aug 29, 2024 · 11 comments
Open

Comments

@jviotti
Copy link
Member

jviotti commented Aug 29, 2024

In JSON Schema, annotations are JSON values associated to a given instance location (a JSON Pointer) and evaluation path (a JSON Pointer). A container to hold annotations, in theory could be something like this:

Map<Pair<Pointer, Pointer>, Set<JSON>> annotations;

Note that there can be multiple JSON values associated with a given instance location and evaluation path, but duplicates are not taken into account.

Write Operations

Insertion: You need to be able to insert a JSON document into a given instance location and evaluation path. The challenge here is to avoid copying the given pointers and the JSON value.

insert(const Pointer &instance_location, const Pointer &evaluate_path, const JSON &value) -> void;

Removal: You need to be able to remove all annotations produced at a given evaluation path or its children, independently of its instance location. For example, if you want to remove annotations at evaluate path /properties/not, you also remove any annotation produced at evaluate path /properties/not/foo and /properties/not/foo/bar/baz.

remove(const Pointer &evaluate_path) -> void;

Read Operations

Adjacent containment: You need to be able to check if a given set of keywords (OR) emitted the given annotation value for a given instance location, sibling to the given evaluation path.

defines_adjacent(const Pointer &instance_location, const Pointer &evaluate_path, const Set<String> &keywords, const JSON &value) -> bool;

For example, you might be looking for the annotation value "foo" at instance location /myObject, from keywords bar or baz, sibling to evaluation path /one/two/three. That means the following combinations are valid:

  • /myObject / /one/two/bar / "foo"
  • /myObject / /one/two/baz / "foo"

General containment: You need to be able to check if a given set of keywords (OR) emitted the given annotation value for a given instance location, sibling to the given evaluation path or on any of its children

defines(const Pointer &instance_location, const Pointer &evaluate_path, const Set<String> &keywords, const JSON &value) -> bool;

For example, you might be looking for the annotation value "foo" at instance location /myObject, from keywords bar or baz, sibling to evaluation path /one/two/three. That means all the following combinations are valid:

  • /myObject / /one/two/bar / "foo"
  • /myObject / /one/two/baz / "foo"
  • /myObject / /one/two/whatever/a/b/c/bar / "foo"

Real Example

Consider the following schema. If the instance is an object, if it defines the property foo, the latter must be a string. If it defines any other property that is not foo, that property must be an integer.

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": {
    "type": "integer"
  }
}

Now consider the following instance:

{ "foo": "bar", "baz": 1 }

Relying on annotations, the evaluation goes as follows:

  • First, evaluate properties. The foo property is defined and it validates against the given subschema
  • Emit the annotation value "" (instance location, the root of the document) / "/properties" (evaluation path) / "foo" (value)
  • Then, evaluate additionalProperties
  • This keyword will check any property in the instance that is not defined as an annotation at "" (instance location) / "/properties" OR "" (instance location) / "/patternProperties"
  • For the ones that don't match any, validate
@jviotti
Copy link
Member Author

jviotti commented Aug 29, 2024

@michaelmior I just added a real example based on properties and additionalProperties, but yes:

  • Instance Location: The Pointer to the part of the instance being evaluated
  • Evaluation Path: The Pointer to the schema that shows the "call stack" of keywords evaluated

On evaluation path, you might start evaluating /properties, which then takes you to /properties/foo/type, which then takes you to /properties/bar/type, etc. It is where you are in the evaluation process with regards to the schema.

@jviotti
Copy link
Member Author

jviotti commented Aug 29, 2024

In the current evaluator implementation, we directly keep the instance location and evaluation paths as pointers that we push to and pop from when executing every individual instruction:

Pointer evaluate_path_;
Pointer instance_location_;

@michaelmior
Copy link
Contributor

Perfect! That clarifies things. Is it true that the actual value of annotations never need to be returned in the current implementation?

@jviotti
Copy link
Member Author

jviotti commented Aug 29, 2024

Correct. You never need to "retrieve" the annotation values, just check containment given the adjacent and non-adjacent algorithms. That said, we do pass references to the value at the time of emission to the evaluation callback to the user for them to hook on them if needed (for whatever they are doing or for reporting reasons). But internally, the evaluator doesn't need them.

@tony-go
Copy link
Contributor

tony-go commented Aug 30, 2024

hey @jviotti :)

I gave a first look at this document.

Here are my questions mark:

  • I struggle to understand why do we have two pointers here. Not why we pass two pointers, because I guess that it is due consumer constraints. But rather, why should we store them for annotations. The way that I understand it is that evaluate_path is just an extension of instance_location. Then I wonder if we could not have a path splited as tree.
    Or maybe do you prefer to use a Map because its way more easier to implement?

  • Regarding the defines_adjacent and defines APIs. When you said "from keywords bar or baz": Is that mean that const Set<String> &keywords contains keyword that have a "OR" relationship? What if we want a "AND" relationship?

@jviotti
Copy link
Member Author

jviotti commented Aug 30, 2024

@tony-go

I struggle to understand why do we have two pointers here. Not why we pass two pointers, because I guess that it is due consumer constraints. But rather, why should we store them for annotations. The way that I understand it is that evaluate_path is just an extension of instance_location. Then I wonder if we could not have a path splited as tree.
Or maybe do you prefer to use a Map because its way more easier to implement?

They are not related to each other. Each represent an orthogonal path that might change on its own. For example, consider this schema and instance based on the one I showed above:

// Schema
{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": {
    "type": "integer"
  }
}

// Instance
{ "foo": "bar", "one": 1, "two": 2 }

The evaluation with regards to those pointers goes like this :

  • Evaluation path "" (the root of the schema) to instance location "" (the root of the instance)
  • Evaluation path "/properties" to instance location "", as we evaluate properties against the top level object
  • Evaluation path "/properties/foo/type" to instance location "/foo", as we evaluate the foo subschema against foo
  • Evaluation path "/additionalProperties" to instance location "/one", the first additional property
  • Evaluation path "/additionalProperties" to instance location "/bar", the second additional property

As you see, each moves pretty much independently and you cannot consider one as an extension of the other.

I suggested a map above as a theoretical representation, but if we can do a tree, that sounds good too. Whatever satisfies those requirements.

Regarding the defines_adjacent and defines APIs. When you said "from keywords bar or baz": Is that mean that const Set &keywords contains keyword that have a "OR" relationship? What if we want a "AND" relationship?

It is an OR. I'll clarify!

@jviotti
Copy link
Member Author

jviotti commented Aug 30, 2024

@tony-go @michaelmior One thing that could help with allocations is the observation that for the evaluation path, its JSON Pointer tokens almost always correspond to JSON Schema keywords (i.e. /propreties/foo/items/type). In that way, the tokens are not completely arbitrary and repeat quite often. For example, most schemas use properties and type.

One interesting thing we could do is have a variant of the sourcemeta::jsontoolkit::Pointer class (or extend that one?) so that you can instantiate it based on a "pool" of potential property tokens, so that instead of allocating the strings every time, we reference them from the pool?

i.e. imagine you can construct the pointer with a second argument that is the pool or something like that

@michaelmior
Copy link
Contributor

so that you can instantiate it based on a "pool" of potential property tokens

I think that makes sense. I suspect that you'll have a lot of repetition even among tokens that aren't keywords, so the pool approach makes sense to me even without pre-populating the pool.

@tony-go
Copy link
Contributor

tony-go commented Aug 30, 2024

@jviotti

As you see, each moves pretty much independently and you cannot consider one as an extension of the other.

Yeah, way more clear now! Also I wonder if we could not rename them to:

  • shcema_location / instance_location
  • schema_pointer / instance_pointer

I think it would be less confusing if it was like that. WDYT?

It is an OR. I'll clarify!

❤️

@jviotti
Copy link
Member Author

jviotti commented Aug 31, 2024

I think it would be less confusing if it was like that. WDYT?

We shouldn't be inventing our own names. instance_location and evaluate_path are the right terms defined by the JSON Schema specification itself.

@jviotti
Copy link
Member Author

jviotti commented Aug 31, 2024

@michaelmior

Perfect! That clarifies things. Is it true that the actual value of annotations never need to be returned in the current implementation?

Actually, I'm realising there is one case where we do need this: prefixItems emits an index of the last array position covered by it, which items needs to get. So I guess you do need to get all of the annotations produced for a certain location after all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants