add notes on schema generation strategies #2

gregsdennis · 2024-06-05T20:00:56Z

I've added a bunch of notes regarding the decisions made for this POC. Mostly they boil down to:

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.
It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.
The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

There are a few other more targeted notes and corrections as well.

I don't intend for this PR to merge. It's just a vehicle for notes.

eiriktsarpalis · 2024-06-05T20:35:52Z

Thanks for the feedback. A few remarks on my side:

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

I did this intentionally because I wanted to ensure that the schema corresponding to each generated type is JsonObject to allow user modifications and additions via the OnSchemaGenerated callback. So for particular types {} is used instead of true and { "not" : "true" } is used instead of false.

eiriktsarpalis · 2024-06-05T20:38:22Z

tests/JsonSchemaMapper.Tests/TestTypes.cs

@@ -61,21 +62,29 @@ public static IEnumerable<ITestData> GetTestDataCore()
        yield return new TestData<DateTimeOffset>(
            Value: new(new DateTime(2021, 1, 1), TimeSpan.Zero), 
            AdditionalValues: [DateTimeOffset.MinValue, DateTimeOffset.MaxValue],
-            ExpectedJsonSchema: """{"type":"string","format": "date-time"}""");
+			// `date-time` format is RFC3339; not sure about DTO, but DT serializes as IS08601.


Thanks. For similar reasons we replaced "format" : "time" with a pattern keyword for the case of TimeSpan. We might want to do the same here.

cc @eerhardt

eiriktsarpalis · 2024-06-05T20:40:22Z

tests/JsonSchemaMapper.Tests/TestTypes.cs

@@ -84,14 +93,18 @@ public static IEnumerable<ITestData> GetTestDataCore()

        // Enum types
        yield return new TestData<IntEnum>(IntEnum.A, ExpectedJsonSchema: """{"type":"integer"}""");
-        yield return new TestData<StringEnum>(StringEnum.A, ExpectedJsonSchema: """{"type": "string", "enum": ["A","B","C"]}""");
-        yield return new TestData<FlagsStringEnum>(FlagsStringEnum.A, ExpectedJsonSchema: """{"type":"string"}""");
+        // `type` is redundant here since `enum` is exclusive


We added the type keyword to enums following feedback from the Semantic Kernel team that this is required by the Gemini APIs (google's LLM).

Then I would say that the Gemini LLM is wrong. It's not hurting anything; it's redundant and a lot of people come to us asking about it.

There is an argument that type in this case can serve as an annotation for code gen scenarios for languages which don't support enums directly.

I agree that it's wrong, however supporting function calling in all LLMs that we support is what funds this project. As such, I can't see how we could remove it at this point. I think we could remove it in a future revision though, we should be free to evolve the generated schemas without it being considered a breaking change.

eiriktsarpalis · 2024-06-05T20:42:09Z

tests/JsonSchemaMapper.Tests/TestTypes.cs

@@ -303,7 +319,10 @@ public static IEnumerable<ITestData> GetTestDataCore()
            }
            """);

+		// would be neat to have the converter expose or be attributed with a schema somehow


Yeah, it's something we should be exposing in converters once we have a proper exchange type.

gregsdennis · 2024-06-05T20:53:48Z

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

Not just the validation tests. Any validator would be expected to ignore assertions for format. This is of primary concern when a generated schema is shared with another team which may use a different validator implementation.

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

Recursive types work just fine. Ideally, you can reference the root schema $ref: # or a subschema of $defs.

Linked list

{
  "type": "object",
  "properties": {
    "value": { "type": "integer" },
    "next": { "$ref": "#" }
  }
}

Object with linked list property

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#/$defs/linked-list" }
  },
  "$defs": {
    "linked-list": {
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#/$defs/linked-list" }
      }
    }
  }
}

You could also give the linked list an $id which would define that subschema as a resource. Then the $ref pointer could stay relative to that resource.

Object with identified linked list property

{
  "$id": "https://dotnet.org/schemas/linked-list-container",
  "type": "object",
  "properties": {
    "list": { "$ref": "linked-list" }  // resolves to https://dotnet.org/schemas/linked-list
  },
  "$defs": {
    "linked-list": {
      "$id": "linked-list",  // https://dotnet.org/schemas/linked-list
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#" }  // resolves back to https://dotnet.org/schemas/linked-list
      }
    }
  }
}

gregsdennis · 2024-06-05T20:58:35Z

tests/JsonSchemaMapper.Tests/TestTypes.cs

+		// why not use `enum`?
+		yield return new TestData<FlagsStringEnum>(FlagsStringEnum.A, ExpectedJsonSchema: """{"type":"string"}""");


There are multiple flags-enum-to-string approaches:

array of flags

comma- or pipe-delimited

(My comment actually probably isn't a good idea since you can have 2^n options.)

The converter uses comma separated. The number of valid inputs is actually infinite because for example "A, A, A, A, A, A, A, A, A" is accepted.

eiriktsarpalis · 2024-06-05T21:00:57Z

You could also give the linked list an $id which would define that subschema as a resource.

One concern I had with using $id or $anchor is that from the perspective a general-purpose type-to-schema exporter there's no good way to generate non-conflicting identifiers without effectively embedding the fully qualified name of the type. I discussed that possibility with @captainsafia and we came to the conclusion that such an approach would be leaking implementation detail from the .NET type system into the generated schema.

gregsdennis · 2024-06-05T21:12:04Z

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI. That would abstract it sufficiently from .Net while still providing a consistent map. Collisions could still occur, but would be extremely rare.

eiriktsarpalis · 2024-06-05T21:16:26Z

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI.

Wouldn't that make it look like a solution file? Wouldn't get many points for human readability.

captainsafia · 2024-06-05T21:16:36Z

At least for the purposes of OpenAPI, there's a requirement for reference IDs to be human readable and semantically meaningful. The reference ID also has to be unique to the schema, and not the .NET type. For example, sequence-types that effectively map to the same schema (e.g. all types that would be represented by { type: "array", items: { type: "string" } } would be identified by the same reference ID in the top-level schema store (List<string>, IList<string>)

gregsdennis · 2024-06-05T21:19:55Z

there's a requirement for reference IDs to be human readable and semantically meaningful

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

I've searched through 3.1, and I see no such requirement.

gregsdennis · 2024-06-05T21:28:20Z

The reference ID also has to be unique to the schema, and not the .NET type.

Then you'd have to identify unique subschemas and extract them. That's actually what I do in JsonSchema.Net.Generation. I find this approach works better because a type on its own will have a schema, but a property of that type could have additional requirements imposed by property-level attributes which would yield a different schema.

Scanning the generated schema structure (or tracking it during generation) for common subschemas (not common JSON data; they must be subschemas) will yield a good result.

captainsafia · 2024-06-05T21:31:08Z

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

The "reference IDs" I'm referring to here are specifically in reference to the identifier used when constructing the url in the $ref property in a schema (e.g. ref: "#/components/schemas/ThisIsTheIdToConstruct).

gregsdennis · 2024-06-05T21:50:16Z

I recommend you have two processes:

generate the schema, with all references intact
embed the schema in some other kind of document, which performs pointer transformations as necessary

For this effort, don't worry about (2).

FWIW, OpenAPI is the reason my SchemaRegistry accepts and returns IBaseDocument instead of just JsonSchema. It recognizes that the root document may not itself be a schema. Everything needs an ID, though. If I get a schema that doesn't have one, I generate one.

The thing is, you're not going to get around needing to use $ref. It's the only way to solve recursion. $anchor isn't a bad option, actually.

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#linked-list" }
  },
  "$defs": {
    "linked-list": {
      "$anchor": "linked-list",
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#linked-list" }
      }
    }
  }
}

You just need to be sure your anchor is unique. Again, this is the problem you're facing, which is why I recommend less-readable GUIDs. (Editors generally have "find" functions.)

eiriktsarpalis · 2024-11-18T14:41:42Z

This repo is now being archived.

add notes on schema generation strategies

5089921

gregsdennis mentioned this pull request Jun 5, 2024

[API Proposal]: Add a JSON schema exporting component for STJ contracts dotnet/runtime#102788

Closed

gregsdennis marked this pull request as draft June 5, 2024 20:01

eiriktsarpalis reviewed Jun 5, 2024

View reviewed changes

gregsdennis commented Jun 5, 2024

View reviewed changes

eiriktsarpalis closed this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add notes on schema generation strategies #2

add notes on schema generation strategies #2

gregsdennis commented Jun 5, 2024 •

edited

Loading

eiriktsarpalis commented Jun 5, 2024

eiriktsarpalis Jun 5, 2024

eiriktsarpalis Jun 5, 2024

gregsdennis Jun 5, 2024

eiriktsarpalis Jun 5, 2024

eiriktsarpalis Jun 5, 2024

gregsdennis commented Jun 5, 2024

gregsdennis Jun 5, 2024

eiriktsarpalis Jun 5, 2024

eiriktsarpalis commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

eiriktsarpalis commented Jun 5, 2024

captainsafia commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

captainsafia commented Jun 5, 2024

gregsdennis commented Jun 5, 2024 •

edited

Loading

eiriktsarpalis commented Nov 18, 2024

		// why not use `enum`?
		yield return new TestData<FlagsStringEnum>(FlagsStringEnum.A, ExpectedJsonSchema: """{"type":"string"}""");

add notes on schema generation strategies #2

add notes on schema generation strategies #2

Conversation

gregsdennis commented Jun 5, 2024 • edited Loading

eiriktsarpalis commented Jun 5, 2024

eiriktsarpalis Jun 5, 2024

Choose a reason for hiding this comment

eiriktsarpalis Jun 5, 2024

Choose a reason for hiding this comment

gregsdennis Jun 5, 2024

Choose a reason for hiding this comment

eiriktsarpalis Jun 5, 2024

Choose a reason for hiding this comment

eiriktsarpalis Jun 5, 2024

Choose a reason for hiding this comment

gregsdennis commented Jun 5, 2024

gregsdennis Jun 5, 2024

Choose a reason for hiding this comment

eiriktsarpalis Jun 5, 2024

Choose a reason for hiding this comment

eiriktsarpalis commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

eiriktsarpalis commented Jun 5, 2024

captainsafia commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

gregsdennis commented Jun 5, 2024

captainsafia commented Jun 5, 2024

gregsdennis commented Jun 5, 2024 • edited Loading

eiriktsarpalis commented Nov 18, 2024

gregsdennis commented Jun 5, 2024 •

edited

Loading

gregsdennis commented Jun 5, 2024 •

edited

Loading