Investigate performance issues with schema parsing #2862

gshank · 2020-10-30T19:37:55Z

In a large dbt project, the schema parsing took 4 minutes out of 5.5 minutes total. (Schema parsing is parsing the yaml files in the model directories.) Currently we convert yaml to json, and use json schemas generated by hologram (a python package which generates json schemas from Python classes) to validate the json.

We need to do finer-grained performance analysis and investigation to determine which parts are slowest and come up with a plan to address the issues. Some possibilities include, but are not limited to, storing json schemas in a cache or on disk, using fast-jsonschema, and using some other validation method such as rx or pykwality. Another possibility to look at is to do a quick check for actionable parts of the yaml files -- the project we are testing included lots of yaml files with no actual effect on the parsed manifest.

jtcohen6 · 2020-11-13T19:26:01Z

Per our conversations over the past week, I think a good first step would be trying either:

to adapt hologram to store json schemas in a cache or on disk, rather than inspecting and validating objects on every call
to swap out hologram for a faster, less flexible library that works in a similar vein: mashumaro (#2885) or fast-jsonschema, just to prove that it's possible, even if we need to comment out / hack together a bunch of code in the process

In either case, there's some immediate benefit in doing (the least possible) legwork to generalize JSONSchemaMixin as an intermediary, so that we can avoid repeated updates of from hologram import JsonSchemaMixin in dozens of files.

gshank · 2020-12-03T16:50:13Z

Analysis of parsing performance determined that creation of jsonschemas was a very small percentage of the parsing time and the schemas are already stored in classes. The hologram 'get_fields' code instrospected the dataclass fields on every call to 'from_dict' and that consumed a lot of time. Instituting class caching of 'mapped_fields' decreased schema parsing time by 44 seconds on our large test project. Removing the 'deepcopy' on fqn_search decreased time by about 20 seconds. Adding code to support use of libyaml in pyyaml (and using an appropriate python and pyyaml package) decreased time by about 20 seconds.

Removing the validation performed in from_dict showed (for the test project with no validation errors) that it reduced time by 20 seconds in schema parsing, 27 seconds in all parsing. If we find a more efficient validator there are opportunities for improvement here.

The other main opportunity for performance improvement is in the 'render_update' done for tests in schema files. Of the remaining 140 seconds for schema parsing in our test project, 76 seconds are spent updating test nodes via the macro rendering process. A lot of this time is spent creating macro-objects-that-act-like-jinja-functions to include in "context" that's provided to jinja. Our large test project has about 42k tests and 1450 macros. The information in the macros is copied into 'MacroGenerator' objects for every macro 42k times. We are considering a number of ways to speed this up, such as by limiting the macro-generator functions produced on a per-test basis, marking some macros as 'static', etc.

jtcohen6 · 2021-02-09T11:51:45Z

I'm going to close this issue now that we've merged #3034. Even though (by its own admission) the PR is only a partial resolution of the issues we began broadly identifying here, we'll be opening more follow-on issues (such as #3048) to track the next steps in our performance work. We know a lot more now, and we can identify targeted opportunities for improvement.

jtcohen6 added the performance label Nov 2, 2020

jtcohen6 added this to the Kiyoshi Kuromiya milestone Nov 6, 2020

jtcohen6 modified the milestones: Kiyoshi Kuromiya, Oh-Twenty [TBD] Nov 13, 2020

gshank self-assigned this Nov 13, 2020

jtcohen6 modified the milestones: Oh-Twenty [TBD], 0.19.1 Dec 22, 2020

NiallRees mentioned this issue Dec 30, 2020

sqlfluff lint for a single file performs 10x slower with dbt templater than Jinja sqlfluff/sqlfluff#651

Closed

gshank mentioned this issue Jan 26, 2021

Parsing performance #3034

Merged

4 tasks

jtcohen6 closed this as completed Feb 9, 2021

kwigley mentioned this issue Apr 7, 2021

BigQuery project reversion in 0.19.1 #3218

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance issues with schema parsing #2862

Investigate performance issues with schema parsing #2862

gshank commented Oct 30, 2020

jtcohen6 commented Nov 13, 2020 •

edited

Loading

gshank commented Dec 3, 2020

jtcohen6 commented Feb 9, 2021

Investigate performance issues with schema parsing #2862

Investigate performance issues with schema parsing #2862

Comments

gshank commented Oct 30, 2020

jtcohen6 commented Nov 13, 2020 • edited Loading

gshank commented Dec 3, 2020

jtcohen6 commented Feb 9, 2021

jtcohen6 commented Nov 13, 2020 •

edited

Loading