Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.match_schema performance #7011

Open
lcarva opened this issue Sep 10, 2024 · 1 comment
Open

json.match_schema performance #7011

lcarva opened this issue Sep 10, 2024 · 1 comment
Labels

Comments

@lcarva
Copy link
Contributor

lcarva commented Sep 10, 2024

Short description

The json.match_schema function takes much longer when the JSON schema is significantly large.

I created a simple reproducer here: https://github.com/lcarva/opa-json-schema-perf
(Schema too large for rego playground)

The reproducer validates a small object against the CycloneDX SBOM JSON Schema (about 5k lines long).

$ opa eval --data . 'data.main.results' --profile --format=pretty
[
  [
    true,
    []
  ],
  [
    true,
    []
  ]
]
+------------------------------+-----------+
|            METRIC            |   VALUE   |
+------------------------------+-----------+
| timer_rego_load_files_ns     | 26615260  |
| timer_rego_module_compile_ns | 24751477  |
| timer_rego_module_parse_ns   | 26171100  |
| timer_rego_query_compile_ns  | 38743     |
| timer_rego_query_eval_ns     | 288952134 |
| timer_rego_query_parse_ns    | 37035     |
+------------------------------+-----------+
+--------------+----------+----------+--------------+-------------------+
|     TIME     | NUM EVAL | NUM REDO | NUM GEN EXPR |     LOCATION      |
+--------------+----------+----------+--------------+-------------------+
| 288.827434ms | 4        | 4        | 4            | main.rego:10      |
| 63.714µs     | 4        | 4        | 4            | main.rego:12      |
| 32.994µs     | 4        | 4        | 4            | main.rego:14      |
| 15.366µs     | 1        | 1        | 1            | data.main.results |
| 3.809µs      | 1        | 1        | 1            | main.rego:5       |
| 3.181µs      | 1        | 1        | 1            | schema.rego:3     |
| 2.385µs      | 1        | 1        | 1            | schema.rego:36    |
+--------------+----------+----------+--------------+-------------------+

main.rego:10 is the json.match_schema call where the CycloneDX schema is being used. main.rego:12 uses a much smaller schema. That's 288,827 vs 63 microseconds.

Version: 0.68.0
Build Commit: db53d77c482676fadd53bc67a10cf75b3d0ce00b
Build Timestamp: 2024-08-29T15:23:19Z
Build Hostname: 3aae2b82a15f
Go Version: go1.22.5
Platform: linux/amd64
WebAssembly: available

Steps To Reproduce

See description.

Expected behavior

Validation of object should not take longer than 1ms.

@lcarva lcarva added the bug label Sep 10, 2024
@anderseknert
Copy link
Member

Hi there! And thanks for filing this issue.

Looking into this briefly, and almost all of that time is spent in loading the JSON schema, not actually validating.
This loading isn't cached either, so each call is going to repeat loading the schema. Using a cached schema makes things... faster, to say the least. Notice the use of gojsonschema.NewSchema(sl) below, where the returned schema is reused:

package main

import (
	"fmt"
	"os"
	"time"
)
import "github.com/xeipuuv/gojsonschema"

func main() {
	now := time.Now()

	bs, err := os.ReadFile("schema.json")
	if err != nil {
		panic(err)
	}

	sl := gojsonschema.NewBytesLoader(bs)

	schema, err := gojsonschema.NewSchema(sl)
	if err != nil {
		panic(err)
	}

	dl := gojsonschema.NewStringLoader(`{"name": "John", "age": 30}`)

	result, err := schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))

	now = time.Now()

	dl = gojsonschema.NewStringLoader(`{"another": "object", "x": 1}`)

	result, err = schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))
}

Output

false
637.524583ms
false
14.709µs

I guess using the inter query cache for this built-in storing loaded schemas across decisions would be the way to go.

It wouldn't make your single opa eval call any faster though, as the first invocation would still need to load the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants