Skip to content
This repository has been archived by the owner on Nov 18, 2024. It is now read-only.

add notes on schema generation strategies #2

Conversation

gregsdennis
Copy link

@gregsdennis gregsdennis commented Jun 5, 2024

I've added a bunch of notes regarding the decisions made for this POC. Mostly they boil down to:

  1. format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.
  2. It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.
  3. The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

There are a few other more targeted notes and corrections as well.

I don't intend for this PR to merge. It's just a vehicle for notes.

@eiriktsarpalis
Copy link
Owner

Thanks for the feedback. A few remarks on my side:

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

I did this intentionally because I wanted to ensure that the schema corresponding to each generated type is JsonObject to allow user modifications and additions via the OnSchemaGenerated callback. So for particular types {} is used instead of true and { "not" : "true" } is used instead of false.

@@ -61,21 +62,29 @@ public static IEnumerable<ITestData> GetTestDataCore()
yield return new TestData<DateTimeOffset>(
Value: new(new DateTime(2021, 1, 1), TimeSpan.Zero),
AdditionalValues: [DateTimeOffset.MinValue, DateTimeOffset.MaxValue],
ExpectedJsonSchema: """{"type":"string","format": "date-time"}""");
// `date-time` format is RFC3339; not sure about DTO, but DT serializes as IS08601.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. For similar reasons we replaced "format" : "time" with a pattern keyword for the case of TimeSpan. We might want to do the same here.

cc @eerhardt

@@ -84,14 +93,18 @@ public static IEnumerable<ITestData> GetTestDataCore()

// Enum types
yield return new TestData<IntEnum>(IntEnum.A, ExpectedJsonSchema: """{"type":"integer"}""");
yield return new TestData<StringEnum>(StringEnum.A, ExpectedJsonSchema: """{"type": "string", "enum": ["A","B","C"]}""");
yield return new TestData<FlagsStringEnum>(FlagsStringEnum.A, ExpectedJsonSchema: """{"type":"string"}""");
// `type` is redundant here since `enum` is exclusive
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added the type keyword to enums following feedback from the Semantic Kernel team that this is required by the Gemini APIs (google's LLM).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would say that the Gemini LLM is wrong. It's not hurting anything; it's redundant and a lot of people come to us asking about it.

There is an argument that type in this case can serve as an annotation for code gen scenarios for languages which don't support enums directly.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's wrong, however supporting function calling in all LLMs that we support is what funds this project. As such, I can't see how we could remove it at this point. I think we could remove it in a future revision though, we should be free to evolve the generated schemas without it being considered a breaking change.

@@ -303,7 +319,10 @@ public static IEnumerable<ITestData> GetTestDataCore()
}
""");

// would be neat to have the converter expose or be attributed with a schema somehow
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's something we should be exposing in converters once we have a proper exchange type.

@gregsdennis
Copy link
Author

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

Not just the validation tests. Any validator would be expected to ignore assertions for format. This is of primary concern when a generated schema is shared with another team which may use a different validator implementation.

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

Recursive types work just fine. Ideally, you can reference the root schema $ref: # or a subschema of $defs.

Linked list

{
  "type": "object",
  "properties": {
    "value": { "type": "integer" },
    "next": { "$ref": "#" }
  }
}

Object with linked list property

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#/$defs/linked-list" }
  },
  "$defs": {
    "linked-list": {
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#/$defs/linked-list" }
      }
    }
  }
}

You could also give the linked list an $id which would define that subschema as a resource. Then the $ref pointer could stay relative to that resource.

Object with identified linked list property

{
  "$id": "https://dotnet.org/schemas/linked-list-container",
  "type": "object",
  "properties": {
    "list": { "$ref": "linked-list" }  // resolves to https://dotnet.org/schemas/linked-list
  },
  "$defs": {
    "linked-list": {
      "$id": "linked-list",  // https://dotnet.org/schemas/linked-list
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#" }  // resolves back to https://dotnet.org/schemas/linked-list
      }
    }
  }
}

Comment on lines +98 to +99
// why not use `enum`?
yield return new TestData<FlagsStringEnum>(FlagsStringEnum.A, ExpectedJsonSchema: """{"type":"string"}""");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple flags-enum-to-string approaches:

  • array of flags
  • comma- or pipe-delimited

(My comment actually probably isn't a good idea since you can have 2^n options.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter uses comma separated. The number of valid inputs is actually infinite because for example "A, A, A, A, A, A, A, A, A" is accepted.

@eiriktsarpalis
Copy link
Owner

You could also give the linked list an $id which would define that subschema as a resource.

One concern I had with using $id or $anchor is that from the perspective a general-purpose type-to-schema exporter there's no good way to generate non-conflicting identifiers without effectively embedding the fully qualified name of the type. I discussed that possibility with @captainsafia and we came to the conclusion that such an approach would be leaking implementation detail from the .NET type system into the generated schema.

@gregsdennis
Copy link
Author

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI. That would abstract it sufficiently from .Net while still providing a consistent map. Collisions could still occur, but would be extremely rare.

@eiriktsarpalis
Copy link
Owner

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI.

Wouldn't that make it look like a solution file? Wouldn't get many points for human readability.

@captainsafia
Copy link

At least for the purposes of OpenAPI, there's a requirement for reference IDs to be human readable and semantically meaningful. The reference ID also has to be unique to the schema, and not the .NET type. For example, sequence-types that effectively map to the same schema (e.g. all types that would be represented by { type: "array", items: { type: "string" } } would be identified by the same reference ID in the top-level schema store (List<string>, IList<string>)

@gregsdennis
Copy link
Author

there's a requirement for reference IDs to be human readable and semantically meaningful

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

I've searched through 3.1, and I see no such requirement.

@gregsdennis
Copy link
Author

The reference ID also has to be unique to the schema, and not the .NET type.

Then you'd have to identify unique subschemas and extract them. That's actually what I do in JsonSchema.Net.Generation. I find this approach works better because a type on its own will have a schema, but a property of that type could have additional requirements imposed by property-level attributes which would yield a different schema.

Scanning the generated schema structure (or tracking it during generation) for common subschemas (not common JSON data; they must be subschemas) will yield a good result.

@captainsafia
Copy link

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

The "reference IDs" I'm referring to here are specifically in reference to the identifier used when constructing the url in the $ref property in a schema (e.g. ref: "#/components/schemas/ThisIsTheIdToConstruct).

@gregsdennis
Copy link
Author

gregsdennis commented Jun 5, 2024

I recommend you have two processes:

  1. generate the schema, with all references intact
  2. embed the schema in some other kind of document, which performs pointer transformations as necessary

For this effort, don't worry about (2).

FWIW, OpenAPI is the reason my SchemaRegistry accepts and returns IBaseDocument instead of just JsonSchema. It recognizes that the root document may not itself be a schema. Everything needs an ID, though. If I get a schema that doesn't have one, I generate one.

The thing is, you're not going to get around needing to use $ref. It's the only way to solve recursion. $anchor isn't a bad option, actually.

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#linked-list" }
  },
  "$defs": {
    "linked-list": {
      "$anchor": "linked-list",
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#linked-list" }
      }
    }
  }
}

You just need to be sure your anchor is unique. Again, this is the problem you're facing, which is why I recommend less-readable GUIDs. (Editors generally have "find" functions.)

@eiriktsarpalis
Copy link
Owner

This repo is now being archived.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants