New validator instances are built for each instancefile, discarding any internal caching #463

sirosen · 2024-07-16T00:44:02Z

First reported via #451

For each file, we get a validator from the "schema loader" objects. This interface is correct for the top-level checker. It allows, for example, metaschema checking to proceed with different schemas for different files.

However, this also is being applied to the remote and local class -- SchemaLoader.
As a result, a new validator is built for each file, and any internal caching done under the validator is discarded.

To resolve, one of two approaches should be used:

create exactly one validator per remote schema (or (schema, settings) where there might not be any current settings)
create a new validator per call, but reuse the components from any matching schemas

The text was updated successfully, but these errors were encountered:

sirosen · 2024-07-16T01:05:23Z

There's a secondary bug here, which is that the retrieve callable's in-memory caching is not working correctly because it's mixing absolute and relative URIs.

I have a potential fix for this pair of issues which I need to think about for a little before I proceed:

$ git diff HEAD
diff --git a/src/check_jsonschema/schema_loader/main.py b/src/check_jsonschema/schema_loader/main.py
index 4ce95c9..88ff6a0 100644
--- a/src/check_jsonschema/schema_loader/main.py
+++ b/src/check_jsonschema/schema_loader/main.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import functools
 import pathlib
 import typing as t
 import urllib.error
@@ -130,11 +131,21 @@ class SchemaLoader(SchemaLoaderBase):
         instance_doc: dict[str, t.Any],
         format_opts: FormatOptions,
         fill_defaults: bool,
+    ) -> jsonschema.protocols.Validator:
+        return self._get_validator(format_opts, fill_defaults)
+
+    @functools.lru_cache
+    def _get_validator(
+        self,
+        format_opts: FormatOptions,
+        fill_defaults: bool,
     ) -> jsonschema.protocols.Validator:
         retrieval_uri = self.get_schema_retrieval_uri()
         schema = self.get_schema()
 
         schema_dialect = schema.get("$schema")
+        if schema_dialect is not None and not isinstance(schema_dialect, str):
+            schema_dialect = None
 
         # format checker (which may be None)
         format_checker = make_format_checker(format_opts, schema_dialect)
diff --git a/src/check_jsonschema/schema_loader/resolver.py b/src/check_jsonschema/schema_loader/resolver.py
index c63b7bb..5084328 100644
--- a/src/check_jsonschema/schema_loader/resolver.py
+++ b/src/check_jsonschema/schema_loader/resolver.py
@@ -79,8 +79,8 @@ def create_retrieve_callable(
         else:
             full_uri = uri
 
-        if full_uri in cache._cache:
-            return cache[uri]
+        if full_uri in cache:
+            return cache[full_uri]
 
         full_uri_scheme = urllib.parse.urlsplit(full_uri).scheme
         if full_uri_scheme in ("http", "https"):
@@ -100,8 +100,8 @@ def create_retrieve_callable(
         else:
             parsed_object = get_local_file(full_uri)
 
-        cache[uri] = parsed_object
-        return cache[uri]
+        cache[full_uri] = parsed_object
+        return cache[full_uri]
 
     return retrieve_reference

This makes validator creation cached internally for the main SchemaLoader, and fixes the remote resource caching bug. Between these two, we get much improved performance in the "many files with many refs" case.

That said, I need to at the very least devise some tests for these to ensure there are no regressions in the future.

alex1701c · 2024-07-17T16:42:18Z

It would be amazing to get this fixed - then I can integrate it efficiently in the CI setup I am planning to :)

sirosen · 2024-07-27T22:16:23Z

I wasn't able to work on this last week, but I've just put the above into a PR with a test to run against. I should be able to get this all merged and released soon, assuming that there are no surprises when CI runs.

sirosen · 2024-07-27T22:36:09Z

v0.29.1 has the above fix and is freshly available.

I tested it against the usage described in #451 and saw execution times of around 0.8s on average. Please let me know (comment or fresh issue) if the new version is still performing in an unexpectedly bad way!

alex1701c · 2024-07-28T05:41:27Z

I can confirm the execution time very fast now - thanks!

sirosen added the bug Something isn't working label Jul 16, 2024

sirosen mentioned this issue Jul 27, 2024

Fix caching issues which render remote ref caching ineffective #466

Merged

sirosen closed this as completed in #466 Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New validator instances are built for each instancefile, discarding any internal caching #463

New validator instances are built for each instancefile, discarding any internal caching #463

sirosen commented Jul 16, 2024

sirosen commented Jul 16, 2024 •

edited

Loading

alex1701c commented Jul 17, 2024

sirosen commented Jul 27, 2024

sirosen commented Jul 27, 2024

alex1701c commented Jul 28, 2024

New validator instances are built for each instancefile, discarding any internal caching #463

New validator instances are built for each instancefile, discarding any internal caching #463

Comments

sirosen commented Jul 16, 2024

sirosen commented Jul 16, 2024 • edited Loading

alex1701c commented Jul 17, 2024

sirosen commented Jul 27, 2024

sirosen commented Jul 27, 2024

alex1701c commented Jul 28, 2024

sirosen commented Jul 16, 2024 •

edited

Loading