Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine behavior of $ref #66

Closed
awwright opened this issue Sep 21, 2016 · 74 comments
Closed

Determine behavior of $ref #66

awwright opened this issue Sep 21, 2016 · 74 comments

Comments

@awwright
Copy link
Member

"$ref" is causing a lot of problems because it's been inconsistently implemented. Determine:

  1. Proper URI base
  2. How to validate instances that are literally {"$ref":"some string"}
  3. Support for constants/non-schema references
@sgpinkus
Copy link

sgpinkus commented Sep 22, 2016

@sam-at-github I'm proposing a behavior that falls in line with a lot of current implementations, where you can only use $ref in places where a schema is expected, meaning you can use "$ref" literally in places that expect a literal value (like "enum" and "properties")

So yeah my general objection is your taking something that make sense stand alone and making its behaviour dependent on json schema. For example, if $ref is independent of JSON Schema one does this:

   dereferenced_schema_doc = JSONDereferencer.deref(some_doc)
   validation_results = valaidate(dereferenced_schema_doc, some_doc)

Step one and step two are independent. Your proposing step one has to know about the structure of JSON Schema.

@awwright
Copy link
Member Author

$ref already seems to be dependent on JSON Schema behavior, because "id"
sets the base URI, and it has to be late bound to support recursive schemas.

What I figure is we're defining a new media type anyways, we can get to
define how we represent hyperlinks.

In any event, we'll have to write in the behavior if there's no
standards-level spec to reference. The question is, what's the behavior.

On Sep 21, 2016 17:21, "sam-at-github" notifications@github.com wrote:

@sam-at-github https://github.com/sam-at-github I'm proposing a
behavior that falls in line with a lot of current implementations, where
you can only use $ref in places where a schema is expected, meaning you can
use "$ref" literally in places that expect a literal value (like "enum" and
"properties")

So yeah my general objection is your taking something that make sense
stand alone and making its behaviour dependent on json schema. For example,
if $ref is independent of JSON Schema one does this:

dereferenced_schema_doc = JSONDereferencer.deref(some_doc)
validation_results = valaidate(dereferenced_schema_doc, some_doc)

Step one and step two are independent. Your proposing step one has to know
about the structure of JSON Schema.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#66 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAatDRycBWj4UWTNzItepJxu7F4jnOf9ks5qscoFgaJpZM4KDMfd
.

@sgpinkus
Copy link

sgpinkus commented Sep 22, 2016

$ref already seems to be dependent on JSON Schema behavior, because "id" sets the base URI, and it has to be late bound to support recursive schemas.

id is supposed to establish a base URI for relative URI resolution. It's a JSON schema specific way of doing Base URI resolution. The JSON Reference spec leaves how it is done undefined, just like URI spec does:

It is beyond the scope of this specification to specify how, for each
media type, a base URI can be embedded.

Your right, the JSON Schema imposes an especially complex method that requires tight coupling... But note its still a qualitatively different type of dependency on JSON Schema, than restricting where a $ref can occur and what the target must be based on a JSON Schema structural definition.

If you want to talk about feature that should be dropped because they are practically useless and not widely implemented, I would say id base URI resolution is at the top of the list!

@handrews
Copy link
Contributor

handrews commented Sep 23, 2016

I am strongly in favor of obliterating id from the specification. It's exceptionally confusing and in the course of publishing nearly 20 service definitions involving over 600 JSON Schemas we did not find any use for it whatsoever.

$ref on the other hand is extremely useful and easy to understand, completely independent of JSON Schema.

@awwright
Copy link
Member Author

@handrews You should post this critique in the appropriate issue, and reference the part of the current 'master' draft that's too confusing.

It serves the same purpose as "base" and a rel=self link in HTML, which shouldn't be a confusing concept at all: it gives the document a base to resolve URI references against, and it lets you bundle multiple 3rd party schemas into a "definitions" section without needing to make any changes to them.

@handrews
Copy link
Contributor

@awwright Will do. From what I've seen the current "master" is substantially less confusing. There are sensible uses but it opens up a lot of complexity if it still has the same capabilities as in v4.

@sgpinkus
Copy link

It serves the same purpose as "base" and a rel=self link in HTML, which shouldn't be a confusing concept at all: it gives the document a base to resolve URI references against,

Serves the same broad purpose yes, but in a much more complicated way. HTML4 BASE occurs once per document. Plus in HTML you don't actually need to dereference every reference to read a given document..

When present, the BASE element must appear in the HEAD section of an HTML document, before any element that refers to an external source. The path information specified by the BASE element only affects URIs in the document where the element appears.

@awwright
Copy link
Member Author

@sam-at-github There's precedent for this in other technologies though, RFC3986 describes how this works in general, XML has xml:base (which works in application/xhtml+xml), and Atom I think too, and HTML has iframes.

In HTML you do have to parse the entire document starting from the top, URI references are resolved at the same time. (And if you use the DOM to change the base, they all have to be re-computed!)

Most of the issue is that JSON Schema is context-free, so we can't enforce a restriction that a keyword must only appear in a root schema, any root schema must also be a valid subschema, and behave the same way. And this is intentional, so that you can bundle together third party schemas into a "definitions" section.

@sgpinkus
Copy link

sgpinkus commented Sep 24, 2016

OK so the general argument for id is embedding schema within schema. I understand that id is providing a technical capability here. I just don't think it's a) practically useful, and b) widely implemented (TODO: survey). And as such an unnecessary barrier to conformance.

The recurring example of where this might actually be practically useful seems to be:

so that you can bundle together third party schemas into a "definitions" section.

Is that actually done anywhere in the wild though?? The alternative is to just not embed independent schema in your schema and to use absolute URLs (or paths). That is what I've been doing. Works fine. What is the argument for why this is so much less appealing to embedding schema in your schema? Something to do with efficiency right? I don't buy it.

@awwright You say that JSON Schemas are "context free". But all JSON Schema are loaded from a resource location. They have this implicit context. This is the base URI one resolves against in the absence of an id. There is also "$schema" which - "MUST be located at the root of a JSON Schema".

@handrews
Copy link
Contributor

@awwright I think use case for "definitions"-based re-use needs some thought, at which point the desirability of id as the mechanism for this will be more clear. I'm basing this, of course, on the 600+-schema system I mentioned which did not use id. What we did isn't an ideal solution either (combine a bunch of schemas in a JSON document which is not itself a schema- there were reasons but I'm not going into it right now). But it indicates to me that there are viable alternatives that are less problematic than id.

Actually I think multi-schema integration in general needs some thought. I will give it some and file new issues / comment on existing issues appropriately. I understand the precedents and mechanisms (I had to in order to explain to people writing schemas what it did, which was not what they thought it did, and why they should never use it). But even if/when URI resolution needs adjustment, this has never felt like a good way to accomplish it.

@awwright
Copy link
Member Author

awwright commented Sep 24, 2016

Is that actually done anywhere in the wild though??

I've done it on a few cases! Primarily now, though, I'm storing schemas I use (including cached 3rd party schemas) in a document database, and looking them up by "id".

They have this implicit context.

There is a context, but it only has two properties, both strings: (1) the URI base (set by "id"), and (2) the schema vocabulary (set by "$schema"). This is why JSON Schema suggests every root schema set these, so it doesn't exhibit unknown behavior.

Schemas with these keywords can be found in sub-schemas too, and it sets the new "context" in the same fashion. So it's not an issue that "$schema" MUST be present in a root schema (though this is isn't the case in the current 'master' draft, it's merely SHOULD). Any root schema (following the suggested behavior) can be embedded as a sub-schema without any changes, without change in behavior.

This is kkiiiind of getting off track though. What does this have to do with $ref?

@handrews
Copy link
Contributor

handrews commented Sep 26, 2016

This is kkiiiind of getting off track though. What does this have to do with $ref?

@awwright you earlier justified tying $ref to JSON Schema by saying:

$ref already seems to be dependent on JSON Schema behavior, because "id"
sets the base URI, and it has to be late bound to support recursive schemas.

and that seems to have prompted a side discussion about killing off id. Which I agree should not be going on in this issue and I apologize for my part in the derailment.

I'm still generally a bit puzzled by this concern over $ref as it was always one of the least problematic things about working with JSON Schema for me and the teams I worked with. At least once people understood how JSON Pointers work as URI fragments, which also wan't hard.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 11, 2016

I think killing ID altogether is a terrible idea - you need a way to link schemas in multiple files. and within the file you want to use IDs as the base for resolution so you don't have to write the whole URI, only the file name.

At the same time the embedding argument is weird - I think it simply should be not allowed to use root schema as a subschema (and it's easy to have a meta-schema that would respect that distinction).

However much time I invested into correctly managing $ref resolution in Ajv (it seems to be the only validator that fully supports the spec), I like the idea of renaming id to $id (for consistency) and restricting it to the top level and using JSON-pointers for everything else.

@fge was expressing similar views repeatedly and the only use case which I didn't like loosing at the time was "named dependencies" - where you give shorter ids to them. But I would happily trade this convenience for the simplification and the consistency of support in all validators.

At the same time, I would say that validators MUST support recursive and mutually recursive references (both within a single root schema and between root schemas) as it is the only way to define recursive data structures - trees, graphs, etc. It also means that $refs cannot be resolved in all cases and the "final"/"resolved" schema cannot be generated.

@awwright
Copy link
Member Author

After implementing my "jsonschema" package, I'm having a hard time imagining how limiting the functionality would make implementation or usage any easier, since the "context free" paradigm is so central to schemas (that the behavior for a schema and a subschema is the same).

But you're using an entirely different approach than I am.

Performance for "jsonschema" is explicitly not the top priority like you're going for, but being reference-quality and customizable is, I recall mine being the first ECMAScript implementation to fully pass the JSON Schema Test Suite (it right now appears to pass all 824 tests from the test suite, though skipping network tests, and of course bigint/arbitrary precision tests).

Presently, I'm working on a brand new implementation from scratch that takes a JSON document as a stream, and reports errors with line numbers, that does support arbitrary precision numbers, and generally has better error reporting, especially for very large (possibly indefinitely large) JSON documents. I'll report back how that goes.


Anyways, I'm primarily trying to figure out here, since it's important that JSON Schema has no edge cases, how do we handle validation of instances that look literally like {"$ref": "some string", "$refb": "more strings, ..."}

As of right now's master branch, "$ref" is now only interpreted as such where a schema is expected. Does anyone have a list of cases in the wild where it's being used otherwise?

@handrews
Copy link
Contributor

The only requirement for instance validation is that "$ref" be treated as a literal key name in properties, correct? Any other location, whether it expects a schema or not, is unambiguous, so it could be allowed anywhere else.

@awwright
Copy link
Member Author

Possibly also as a value for "enum", and proposed stuff like "constant" or custom properties.

@epoberezkin
Copy link
Member

Also dependencies and patternProperties. I think allowing $ref only in places where the schema is expected is a more sane approach than allowing to use $ref for anything else.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 16, 2016

@handrews I am bringing here the conversation regarding how the $ref should be treated: inclusion vs validation (from issues #85 and #98).

Treating $ref as inclusion has two problems:

  1. recursive schemas
  2. $ref resolution inside referenced subschemas

Problem 1: recursive schemas

@awwright wrote above:

$ref already seems to be dependent on JSON Schema behavior, because "id"
sets the base URI, and it has to be late bound to support recursive schemas.

I am not sure what "late bound" means here if not "executing validation on the current part of the data instance using the referenced schema". @awwright could you explain what else could an implementation do if not validation?

The recursive data structures, and therefore recursive schemas that reference one another are very common - trees and graphs are used to represent many real world objects. I can point to some examples if necessary, but it seems quite obvious.

If we treat "$ref" as inclusion/structural manipulation, how does it work with recursion? Please bear in mind that in case you have mutual recursion between different files you cannot determine whether the $ref is recursive from the format of the URI.

I was relatively recently addressing issues with mutual recursion - see ajv-validator/ajv#210 (comment) and ajv-validator/ajv#240 . I am only posting these links as an illustration that a lot of people use recursive schemas in the wild, so we can't simply ignore this issue.

Problem 2: $ref resolution inside referenced subschemas

Another issue is reference resolution that would work differently, depending on whether you treat $ref as inclusion or as validation.

@awwright writes about it:

The only time there would be a difference is if the base URI changes. Which isn't a problem if your root schemas always have an absolute-URI "id" like JSON Schema recommends.

But it only solves the problem if you include the whole root schema that has ID. If you include the fragment, this fragment usually won't have id (or will have a relative id) to correctly change resolution scope. So if this fragment contains relative $ref to the schema from which it is included, the reference will not correctly resolve.

There is a test case in JSON-Schema-Test-Suite that illustrates this problem. If you treat $ref as inclusion the test will fail. I will post a slightly modified version here, so it is simpler to understand the problem (It is only modified to not rely on some assumptions that test-suite makes about schema IDs, the structure is the same).

Main schema:

{
    "id": "http://localhost:1234/schema.json",
    "properties": {
        "int": {
            "$ref": "definitions.json#/refToInteger"
        }
    }
}

definitions.json:

{
    "id": "http://localhost:1234/definitions.json",
    "integer": {
        "type": "integer"
    }, 
    "refToInteger": {
        "$ref": "#/integer"
    }
}

It all seems clear - property int points to "definitions.json#/refToInteger" which in its turn points to "#/integer" (that is a relative reference to "definitions.json#/integer"). If "$ref" is an instruction to validate referenced schema there is no problem. If "$ref" is an inclusion, then the main schema should be equivalent to this schema:

{
    "id": "http://localhost:1234/schema.json",
    "properties": {
        "int": {
            "$ref": "#/integer"
        }
    }
}

But the problem here is that this schema contains "$ref" that points to "#/integer" that is undefined in this schema. It was obviously present in "definitions.json", but as soon as we've included the fragment into the main schema we have lost that context.

That use case is very common in real world. When you define the collection of schemas in some domain space, it is a common practice to group many definitions in one file, so other schemas can reference them. And some definitions are usually referring to others, like in this example. So if these definitions were simply included they would not work.

Conclusion

I understand that historically $ref started as a separate thing, based on another standard. But both the spec, the official test-suite and the usage practice made $ref evolve and essentially become a special validation keyword, at least in some cases that are important enough to ignore...

@awwright @handrews I am looking forward to your suggestions how these problems can be addressed in any simpler way (!) than treating "$ref" as a special validation keyword.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 16, 2016

And, by the way, if we decide to acknowledge that $ref is a special validation keyword, as I believe it deserves :), we can also drop the requirement to have it as the ONLY keyword in the schema and ignore everything else. We would finally be able to stop dancing around $ref with clunky allOf to do what we need.

If usage practice is any indication, people do mix $refs with other keywords, it seems natural. When I relatively recently introduced the option in Ajv to ignore other keywords used with $ref (it will be the default behaviour in the next version as per spec, but now it's an option) and added a warning that you should not be mixing them, I immediately got an issue asking to be able to suppress the warning.

I think ignoring other keywords with "$ref" is the worst thing we can do - it's unexpected and confusing when some keywords do not apply. Also it's quite difficult to detect false positives in validation - very few people add enough fail tests to their schemas to understand that some keywords are ignored. I think we should either allow mixing (to acknowledge existing usage practice and to de-clunkify compliant schemas) or make the schemas where "$ref" has siblings invalid (to avoid confusion and surprises).

@handrews
Copy link
Contributor

@epoberezkin thanks for moving this here, and even more for expanding on your concerns in detail. I see what is going on now.

I am not sure what "late bound" means here if not "executing validation on the current part of the data instance using the referenced schema".

"late bound" just means that you only dereference the references as needed during the process of validation:

{
    "definitions": {
        "foo": {"properties": {"bar": {"$ref": "#/definitions/bar"}}},
        "bar": {"properties": {"foo": {"$ref": "#/definitions/foo"}}}
    },
    "type": "object",
    "properties": {"foo": {"$ref": "#/definitions/foo"}}
}

This schema validates instances like:
{"foo": {"bar": {"foo": {"bar": {"foo": {}}}}}}

and so on.

It works because the as it validates each child value, it goes through just the one reference needed to do that. Eventually it gets down to that innermost foo, which has no properties, so it doesn't need to dereference anything else, and validation passes. Which means the whole thing passes. No recursion properties.

We had a lot of recursive or mutually recursive situations in my last project and it was not a problem- this worked just fine.

So it's not "inclusion" in the sense of the C pre-processor where all of the inclusion happens before you run validation, and it would be possible to write out an equivalent schema with no "$ref". It's only "inclusion" in the sense that, at each level, one at a time, it is as if you have included that level.

In other words, it's just a difference of how the inclusion is implemented.

This does, of course, involve validating against the referenced schema, but that doesn't make "$ref" a validation keyword. Here are my definitions:

  • A validation keyword potentially changes the outcome of validation
  • A structural keyword changes how the schema is expressed without changing the outcome

To illustrate, I'll unroll the references enough to validate the same instance without further "$ref" dereferencing. This means i've applied the minimum transformations specified by the structural keyword "$ref", and I have not impacted the validation outcomes against any possible instance in any way.

This is the unrolled schema:

{
    "definitions": {
        "foo": {"properties": {"bar": {"$ref": "#/definitions/bar"}}},
        "bar": {"properties": {"foo": {"$ref": "#/definitions/foo"}}}
    },
    "type": "object",
    "properties": {
        "foo": {
            "properties": {
                "bar": {
                    "properties": {
                        "foo": {
                            "properties": {
                                "bar": {
                                    "properties": {
                                        "foo": {
                                            "properties": {
                                                "bar": {"$ref": "#/definitions/foo"}
                                             }  
                                        }
                                    }   
                                }   
                            }   
                        }   
                    }   
                }   
            }   
        }   
    }   
}

which also validates
{"foo": {"bar": {"foo": {"bar": {"foo": {}}}}}}
in the exact same way as the reference-only one does. The validator just doesn't need to do any dereferencing here as we have done it manually.

I'll address the id / resolution problem 2 in another comment. I just wanted to handle the easy case first and see if we could agree on this part.

@epoberezkin
Copy link
Member

@handrews your approach essentially means that the schema with all "$refs" included depends on the data being validated. I.e. for each data instance you will have different equivalent schema without "$refs". I specifically was asking for a simpler way than treating "$ref" as a special validation keyword. Creating a new schema for each data instance kind of solves the problem, but seems more complex - you could have avoided this issue altogether.

A validation keyword potentially changes the outcome of validation

That depends on the point of view and on the definition of what the $ref is, not the other way around. If you consider "the result of the validation of the data against referenced subschema" to be the result of the validation of $ref keyword, then $ref keyword satisfies the definition. If you consider $ref to be a structural transformation, then it would satisfy the second clause.

@handrews
Copy link
Contributor

we can also drop the requirement to have it as the ONLY keyword in the schema and ignore everything else

I don't think this works the way you think it does. If it works just like "allOf", we're not gaining anything except a tiny streamlining of syntax, and no difference in behavior (which makes it irrelevant to #98 where overwriting is needed).

If it is not just a shorthand for "allOf", that just turns it into $merge with less clear semantics, where

{
    "$ref": "#/definitions/x",
    "properties": {"y": {"type": "boolean}}
}

is roughly equivalent to

{
    "$merge": {
        "source": {"$ref": "#/definitions/x"},
        "with": {"properties": {"y": {"type": "boolean}}}
    }
}

Except without the clarity of application/merge-patch+json semantics. So either you specify them (in which case they are exactly equivalent) or you're making up yet another set of merge semantics (which seems like a bad idea, we've got two already with "$merge/$patch").

So if we want $merge/$patch, and then want to declare $ref plus other keywords to be a shorthand for $merge, that's fine. But the same objections from issue #15 apply whether we spell it as $merge or as this expanded $ref.

@epoberezkin
Copy link
Member

I'd rather we have a separate keyword for inclusion ($merge would do or any other) and treat $ref as validation - it would simplify things and also allow polymorphism.

@awwright
Copy link
Member Author

@handrews Note JSON Schema already segregates keywords into classes, Core has "core keywords", JSON Schema Validation has "validation keywords" and "metadata keywords", Hyper-schema has, well, there's no name, but we can just call them "Hypermedia keywords"

@epoberezkin
Copy link
Member

epoberezkin commented Oct 16, 2016

Let's not diverge to $merge/inclusion here, I will tomorrow post another suggestion to $merge #15.

@epoberezkin
Copy link
Member

@awwright sorry, I just noticed this definition:

Since I filed this issue, we posted the new draft https://tools.ietf.org/html/draft-wright-json-schema-00.
So as I interpret it, a schema with a "$ref" property means two things: First, set the URI base to the target of the $ref. Then, substitute the keywords of the target schema into the $ref object.
That is to say, it should always be the same as a simple substitution, except the addition the URI base changes to the remote document's URI.

It's definitely a step in the right direction, although I think it does still have issues. I will write tomorrow, it is getting late here, sorry...

@handrews
Copy link
Contributor

handrews commented Oct 16, 2016

@awwright , @epoberezkin : I went back and looked at problem 2 and I think that @awwright 's "First, set the URI base to the target of the $ref" produces the correct behavior.

I have always thought of it a bit differently, but with the same outcome.

Rather than copying over the id needed to change things and then substituting the values, I have always thought of it as the validator simply "running" the validation from the referenced location. So the id and resolution stuff work just fine without needing to use id to mess with it. This is why I've never had any use for id and just find it confusing. "$ref" is just "go over there and continue validating, and when you're done come back here and keep going as if nothing unusual happened."

I have not worked through whether that conceptual view causes problems elsewhere- I'd be interested in that but not enough to push anyone else through another long discussion. As long as the specification is focused on the outcome rather than implementation I am happy.

[EDIT]:
@epoberezkin while walking to and from the grocery store just now your rationale for describing "$ref" as involved in validation finally clicked. The way I think about "$ref" working is very similar to your description of "$ref" being defined to return the validation result of the thing being referenced. I still wouldn't call "$ref" a validation keyword, but I get what you meant now.

@handrews
Copy link
Contributor

Literal "$ref" values

This was one of the original points of this issue, and I wanted to put a proposal on record even though I know there is movement to avoid the problem by restricting where "$ref" can be used (which I dislike).

Here is a proposal for how to have a literal "$ref" property name. Basically, you take the object that should have a literal "$ref" key and wrap it in another "$ref". So instead of a Reference object taking only a URI string value, it can either take a URI (current behavior) or an object (replace the reference with the literal object).

Only one level of literal escaping happens at a time. This is analogous to backslash escaping in many string formats- \ is a backslash token for escaping, \\ is a literal backslash, \\\ is a literal backslash followed by a backslash token for escaping, and so on.

The simplest form- strip off an outer "$ref" and treat the inner one as literal:
{"$ref": {"$ref": "foo"}} => {"$ref": "foo"}

While proper "$ref" objects should not have other properties, the object including the literal "$ref" may:
{"$ref": {"$ref": "foo", "x": "bar"}} => {"$ref": "foo", "x": "bar"}

Multiple levels evaluate one level at a time, so odd numbers leave the innermost reference as an actual reference:
{"$ref": {"$ref": {"$ref": "#/foo"}}} => {"$ref": {"$ref": "#/foo"}}`
where the result is an object with a literal "$ref" property, the value of which is a reference to "#/foo".

So if you want to define a property called "$ref" you do so like this (including additionalProperties-false to show that interaction since it is so often problematic):

{
    "type": "object",
    "properties": {
        "$ref": {
            "$ref": {"type": "string"},
            "stuff": {"type": "boolean"}
        }
    },
    "additionalProperties": false,
    "required": ["$ref"]
}

Note that the "$ref" in the required array is not a problem because only objects can be references.

This would translate to (with the remaining "$ref" now considered a literal property name):

{
    "type": "object",
    "properties": {
        "$ref": {"type": "string"},
        "stuff": {"type": "boolean"}
    },
    "additionalProperties": false,
    "required": ["$ref"]
}

The following instances validate against that schema:

{"$ref": "foo"}
{"$ref": "bar", "stuff": true}

The following would not:

{"$ref": "foo", "x": 42}
{"stuff": true}
{}

@epoberezkin
Copy link
Member

epoberezkin commented Oct 17, 2016

@awwright:

So as I interpret it, a schema with a "$ref" property means two things: First, set the URI base to the target of the $ref. Then, substitute the keywords of the target schema into the $ref object.
That is to say, it should always be the same as a simple substitution, except the addition the URI base changes to the remote document's URI.

Several problems here.

Base URI for $refs inside referenced schema

"First, set the URI base to the target of the $ref." - that is simply incorrect. The base URI for the included schema cannot be derived from $ref URI (which is what I assume you mean by "the target of $ref"). The base URI is determined by the included schema context in the source schema - it is the nearest resolved (in the context of source schema) ID attribute, starting from the root of the included schema. This can form the part of the definition, whatever $ref mechanism is agreed:

As the base URI for the $ref keywords inside the referenced schema (or schema fragment) should be used the resolved (in the context of the source schema from which the schema fragment is included) id attribute present in the top level of the included schema or the nearest id attribute above it (up to the root level).

It is complex and that's where the motivation to only allow id attributes in the root schemas comes from - it would substantially simplify many things, this definition for example.

Recursion

"Then, substitute the keywords of the target schema into the $ref object." - this says nothing about when it should be done and how to deal with recursion.

Possible definition suggesting the inclusion paradigm could be:

The referenced schema (or schema fragment) should replace the schema object containing $ref attribute. This replacement should only happen during validation if there is a data available for the validation with this schema object in order to allow for recursive and mutually recursive schemas

This definition would work, but it's more complex than necessary and suggest the wrong paradigm - similar would be to define a function call equivalent to "copying the function code in place and executing it". It's possible but it both complicates things and limits future potential (e.g. passing parameters to the referenced schemas - why instead of arguing about $ref we extend it to support parameters that can be used in the included schema? just an idea).

Also with this definition the above notion of base URI simply doesn't make sense for any new user - one should ask, if we are copying the schema here why the hell are we resolving IDs based on its location in the source schema?

@handrews wrote:

I have always thought of it as the validator simply "running" the validation from the referenced location.

That's exactly what I am insisting on (although I don't understand how it correlates with @handrews insisting that $ref is a structural manipulation - if $ref "is running validation" using another schema, then $ref is a special validation keyword). if we agree than $ref has nothing to do with copying anything two good things happen:

`1. The definition becomes simpler:

$ref should validate the current part of the data instance using referenced schema (or schema fragment). It should support recursion

No need to say that it happens at validation time - it is obvious.

`2. The logic for base URI also becomes sensible - because the schema is not copied anywhere, we obviously use its lexical context to resolve refs inside it. If $ref means copying then using dynamic context makes more sense and we have to make sure to include convoluted explanations how it works and why it is the case.

There is still a place for the actual inclusion that would neither support recursion nor use lexical context and instead use current context (similar to pre-compiler includes) - we can define an additional keyword for it or use $merge.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 17, 2016

@handrews:

Here is a proposal for how to have a literal "$ref" property name.

It is already supported, what is the problem here? All keys inside properties object are treated as property names. Why would you want to include internals of properties from elsewhere?

As I said repeatedly, I agree that there is a place for a simple inclusion, that can support non schema values as well (but not the internals of properties - that makes little sense tbh). Let's just define an additional thing called $include (or $merge) that would copy things in place and use current context for ref resolution (if the included thing is a schema).

@handrews
Copy link
Contributor

It is already supported, what is the problem here?

There are cases suggested in various places (I'll have to look up the exact issues later) where the old broader "$ref is independent of schema syntax" would be useful. Or for that matter, just being able to $ref in some property names would be useful. And this issue brought up the literal case. So I proposed something for the record. I brought it up in some other issue already where it is more obviously applicable, but I wanted to consolidate the $ref stuff here.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 17, 2016

I see. I think given the conversation we are having above the idea of "$ref" being independent of schema should be dead by now regardless of which approach we decide on - it is quite dependent anyway...

@awwright
Copy link
Member Author

"First, set the URI base to the target of the $ref." - that is simply incorrect. The base URI for the included schema cannot be derived from $ref URI

The thing I'm trying to get at is there's multiple mechanisms that can change the base URI, RFC 3986 Section 5.1 describes the way URI references are resolved, JSON Schema implements this by normative reference. The process is:

  1. There's an application-specific default base URI (for example, a UUID).
  2. Then, if the document was retrieved at some URI, it is set to that (this includes reference by $ref, or telling a validator "validate this instance against the <http://example.com/foo> schema").
  3. Then the URI sent back by the server, if any, e.g. Content-Location
  4. Finally, apply any base URI embedded in the document (i.e. "id")

At any step, URI References are resolved against the URI base thus far to get the new URI base.

@handrews
Copy link
Contributor

I see. I think given the conversation we are having above the idea of "$ref" being independent of schema should be dead by now regardless of which approach we decide on - it is quite dependent anyway...

I don't follow this at all. Is this more about your conceptualization of $ref "returning" a validation result of the thing being referenced? That is part of why I say it is not a validation operation. It's just a reference. It's neither physical copy-over inclusion nor is it a function call in the overall validation equation.

@handrews
Copy link
Contributor

@awwright I think I follow your URI resolution stuff. I am definitely with you on the base URI of the referenced schema being the URI by which it was identified (in the case of actually fetching the referenced schema, this is the request-URI for that retrieval, which as you note is standard behavior under RFC 3986).

Basically, the behavior should be the same whether you fetch it from the URI or load it from a cache indexed by the URI. If I understand correctly this is:

(the terminology may not be RFC-perfect, but please try to see the point here rather than jumping on terminology)

  • Load the referenced schema document from the server/cache (this ignores the fragment), and set the base to that reference URI (temporarily ignoring the fragment)
  • Apply the fragment to the reference schema document, applying any $schema and id keywords that you find along the way. If there's no id in the way, then by the end of applying the fragment, you're back to the full referencing URI that was in the "$ref"
  • Evaluate the referenced JSON found at the end of the fragment using whatever base URI has been built up to this point.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 18, 2016

@awwright wrote:

At any step, URI References are resolved against the URI base thus far to get the new URI base.

And that is not correct. You cannot, in general case, use resolved $ref URI as the base URI for the references inside the referenced fragment. The definition I wrote is the simplest one that is covering all cases.

Consider this example:

Source schema

{
  "id": "http://localhost:1234/schema1.json#",
  "properties": {
    "foo": {
      "id": "schema2.json#",
      "properties": {
        "bar": { "$ref": "#/baz" }
      }
    }
  }
}

Target schema, where a fragment from source schema is included:

{
  "id": "http://localhost:1234/schema3.json#",
  "properties": {
    "boo": { "$ref": "schema1.json#/properties/foo/properties/bar" }
  }
}

Now, following your logic the base URI for the innermost $ref "#/baz" will be the URI of the resolved "$ref" that included that schema: "http://localhost:1234/schema1.json#/properties/foo/properties/bar" and resolved $ref should be "http://localhost:1234/schema1.json#/baz" (which is not correct)

And that is incorrect, because there is a resolution scope change in the source schema that changes the base URI to "http://localhost:1234/schema2.json#" so the correct resolved "$ref" should be "http://localhost:1234/schema2.json#/baz"

EDIT: Supporting inline references by spec also means that instead of "schema1.json#/properties/foo/properties/bar" you should be able to use "schema2.json#/properties/foo/properties/bar" which makes it even messier.

I completely agree with @fge on this point - by allowing to change the resolution scope we have introduced a substantial mess here and made it very difficult to comply with the spec (at least judging by the fact that almost no validator does).

@handrews wrote that you should:

Apply the fragment to the reference schema document, applying any $schema and id keywords that you find along the way. If there's no id in the way, then by the end of applying the fragment, you're back to the full referencing URI that was in the "$ref"

Which is almost correct apart from $schema has nothing to do with it, only id attributes matter. That's where the motivation to only allow id in the root schema comes.

@handrews
Copy link
Contributor

Which is almost correct apart from $schema has nothing to do with it, only id attributes matter.

$schema sets the meta-schema, which could in theory instruct a client to recognize custom schema keywords. It wouldn't have anything to do with finding the referenced schema unless the custom properties are specified to change resolution scope. I'm not suggesting that as a typical thing, or even a good idea, but it's certainly possible.

Anyway, that's not all that interesting except that $schema and id are the only contextual keywords. $schema doesn't change resolution, but if it is different in the source and destination documents, the referenced schema should be considered by the meta-schema that it's containing document specified. I think.

@handrews
Copy link
Contributor

and resolved $ref should be "http://localhost:1234/schema1.json#/baz" (which is not correct)

No, this is what I was talking about when I said "applying any '$schema' and 'id' keywords along the way. The first $ref (to bar) is processed something like this:

  1. resolve http://localhost:1234/schema1.json#/properties/foo/properties/bar
  2. get http://localhost:1234/schema1.json (possibly from local cache)
  3. set scope to its top-level id which matches the request URI
  4. Apply the fragment JSON pointer of /properties/foo/properties/bar
  5. After applying /properties/foo set the scope to the id in the foo object, schema2.json
  6. resolve #/baz to http://localhost:1234/schema2.json#/baz

@epoberezkin
Copy link
Member

No, this is what I was talking about when I said "applying any '$schema' and 'id' keywords along the way.

Not sure what you mean by "no", but we are arriving to the same result, if you read what I wrote in the context of the whole comment.

@handrews
Copy link
Contributor

@epoberezkin I will attempt to clarify :-)

Now, following your logic the base URI for the innermost $ref "#/baz" will be the URI of the resolved "$ref" that included that schema: "http://localhost:1234/schema1.json#/properties/foo/properties/bar" and resolved $ref should be "http://localhost:1234/schema1.json#/baz" (which is not correct)

And that is incorrect, because there is a resolution scope change in the source schema that changes the base URI to "http://localhost:1234/schema2.json#" so the correct resolved "$ref" should be "http://localhost:1234/schema2.json#/baz"

I read the above two paragraphs as you thinking that the correct resolve reference should be "http://localhost:1234/schema2.json#/baz" (we agree on that), but that you think @awwright 's described algorithm would produce "http://localhost:1234/schema1.json#/baz" instead (we do not agree on this- hence the "no"). I was intending to show how @awwright 's algorithm would be applied to produce the correct answer (assuming I'm understanding the algorithm correctly).

Does that make more sense or am I still missing the point of your objection?

@epoberezkin
Copy link
Member

epoberezkin commented Oct 18, 2016

@handrews I understand. Your algorithm would produce correct results, that's what I wrote.
@awwright description would produce incorrect result as it does not take the source context into account, only resolved $ref URI (which is resolved in the target context).

I think we are both saying the same thing really, about what the base URI should be.

@handrews
Copy link
Contributor

handrews commented Oct 18, 2016

I think we are both saying the same thing really, about what the base URI should be.

I guess we'll have to wait for @awwright to weigh in. I think I'm saying the same thing as him, you think I'm saying the same thing as you, but you also think that disagrees with him, so... let's see what he says.

@awwright
Copy link
Member Author

awwright commented Oct 18, 2016

@epoberezkin Both schemas have absolute "id"s so any URI Reference will be resolved relative to that. For the distinction I'm making to matter, all URI References must be relative URI references.

Note there's no such thing as a "resolution scope" anymore, and note base URIs (and probably root schemas in general) do not have fragments, since fragments are always completely replaced with the one in the URI Reference. (Fun historical tidbit, fragments were a late addition to the URI grammar, subsumed from HTML).

And, apologies if the URI vocabulary this far is a bit confusing, but to reiterate:

URI = a full URI, including scheme, hier-part, an optional query, and an optional fragment
URI Reference = a full URI, or a relative URI. URI References are "resolved" into URIs against a base URI.
Absolute URI = a full URI, but without a fragment
Base URI = Special context-dependent Absolute URI that URI References are resolved against

Here's an example of the only case where it would matter. Suppose I have a schema

{ allOf: {$ref: "http://example.com/schema.json"} }

In order to validate an instance, I need to dereference the http://example.com/schema.json schema. I grab it from my database and it looks like this:

{
id: "/schemas/foo.json",
items: {$ref: "item.json"},
}

Ok, so I see there's one URI reference in here. We need to know the URI of it in order to get it from the database. We figure out the base URI by working our way down to this URI reference:

  1. The application-dependent default URI, which might be e.g. <urn:uuid:aa95e0bb-2133-4bce-9542-d32f100a7b5b> (randomly generated per document)
  2. The URI the current document was retrieved at, if known. This changes the base URI to <http://example.com/schema.json>
  3. The "id" </schemas/foo.json> is resolved against the previous step, and we get <http://example.com/schemas/foo.json>. This forms our URI base.
  4. We resolve the <item.json> against the URI base to get <http://example.com/schemas/item.json>

I grab <http://example.com/schemas/item.json> from my database, and the process repeats itself...

The only effect this has, is if I store the same copy of a JSON Schema in different paths, the URI References inside might get resolved to different URIs depending which URI I used to look the document up.

So if instead my original schema looked like this:

{ allOf: {$ref: "https://example.net:8443/schema.json"} }

But the document was identical, I would end up resolving the <item.json> reference into <https://example.net:8443/schemas/item.json>.

Sometimes this is a feature -- so I can change port numbers or change between "http" and "https" on my website. But not always, so be careful.

@epoberezkin
Copy link
Member

epoberezkin commented Oct 18, 2016

@awwright I understand all that you are writing above and that is correct. Your definition above, nevertheless doesn't work in the example I posted in my comment (and your example in the previous comment is different).

For the avoidance of any doubt, I am referring here to this definition suggested by you:

There's an application-specific default base URI (for example, a UUID).
Then, if the document was retrieved at some URI, it is set to that (this includes reference by $ref, or telling a validator "validate this instance against the http://example.com/foo schema").
Then the URI sent back by the server, if any, e.g. Content-Location
Finally, apply any base URI embedded in the document (i.e. "id")
At any step, URI References are resolved against the URI base thus far to get the new URI base.

And I infer from this definition that by "document" in "Finally" step you mean the loaded document.

In your previous example the schema you load it's own ID keyword that is resolved in the context of URI that was used to load the schema, so your definition works.

In my example ID attribute does not exist in the loaded schema, it exists in the parent object, and your definition does not take it into account (or at least I think that it doesn't).

@handrews algorithm seems to be taking it into account if I understand it correctly, it's not very precisely formulated, but it says:

Apply the fragment to the reference schema document, applying any $schema and id keywords that you find along the way. If there's no id in the way, then by the end of applying the fragment, you're back to the full referencing URI that was in the $ref

So I infer from it that it will be taking all IDs it meets on all levels while traversing the document into account.

Does it make sense?

@epoberezkin
Copy link
Member

epoberezkin commented Oct 18, 2016

@awwright maybe in your definition you mean the same, but in this case it is not sufficiently clear - I am reading it differently (and I hope that my previous comment makes it easy to understand what difference I am talking about).

If it means the same, it is better to say it directly rather than imply based on some other definitions.

@awwright
Copy link
Member Author

@epoberezkin My explanation is only informative, RFC3986 is normative here... URI references are resolved, eventually, against the URI of the document used to dereference them.

The <#/baz> URI reference is going to be resolved into <http://localhost:1234/schema2.json#/baz> because that's the URI base it is found inside.

@epoberezkin
Copy link
Member

RFC3986 is about URIs. It doesn't take ID attributes into account at all. URI base is not defined by this RFC in this case I think. So as long as we document how "URI base is found inside" it's all ok. Anyway, it seems we all agree on what we mean, at the very least, even if we explain it differently.

@awwright
Copy link
Member Author

@epoberezkin RFC3986 does, indeed, specify all the vocabulary I just defined, and exactly how to resolve URI references. JSON Schema defines "id" as (among other things) a "Base URI embedded in content", as seen in https://tools.ietf.org/html/rfc3986#section-5.1

(Specifically: The "id" keyword defines... the base URI that other URI references within the schema are resolved against.)

@awwright
Copy link
Member Author

And I should probably mention at this point, I understand what you're saying about my explanation, yeah, I was trying far too hard to word that well.

I don't have any outstanding problems with how $ref works, if anyone else any outstanding problems that might have been covered by this, feel free to open a new issue, or find an existing one that maybe works better.

@handrews
Copy link
Contributor

handrews commented Oct 19, 2016

@awwright : Just to note for the record, you closing this definitively indicates that we are going with forbidding "$ref" anywhere but where a schema is allowed? In which case the meta-schema should be updated to be oneOf (the current definition or a single property of "$ref")

(I am fine with this btw, just put my proposal for escaping "$ref" in to have it written down)

@awwright
Copy link
Member Author

The latest I-D has that language - $ref is only allowed wherever you can find a schema.

@ruifortes
Copy link

ruifortes commented Oct 25, 2016

Hello.
Would you care to try a json dereferencer I'm putting together and present some test cases?
See here
Still didn't do much on the path revolving part though
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants