-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate performance issues with schema parsing #2862
Comments
Per our conversations over the past week, I think a good first step would be trying either:
In either case, there's some immediate benefit in doing (the least possible) legwork to generalize |
Analysis of parsing performance determined that creation of jsonschemas was a very small percentage of the parsing time and the schemas are already stored in classes. The hologram 'get_fields' code instrospected the dataclass fields on every call to 'from_dict' and that consumed a lot of time. Instituting class caching of 'mapped_fields' decreased schema parsing time by 44 seconds on our large test project. Removing the 'deepcopy' on fqn_search decreased time by about 20 seconds. Adding code to support use of libyaml in pyyaml (and using an appropriate python and pyyaml package) decreased time by about 20 seconds. Removing the validation performed in from_dict showed (for the test project with no validation errors) that it reduced time by 20 seconds in schema parsing, 27 seconds in all parsing. If we find a more efficient validator there are opportunities for improvement here. The other main opportunity for performance improvement is in the 'render_update' done for tests in schema files. Of the remaining 140 seconds for schema parsing in our test project, 76 seconds are spent updating test nodes via the macro rendering process. A lot of this time is spent creating macro-objects-that-act-like-jinja-functions to include in "context" that's provided to jinja. Our large test project has about 42k tests and 1450 macros. The information in the macros is copied into 'MacroGenerator' objects for every macro 42k times. We are considering a number of ways to speed this up, such as by limiting the macro-generator functions produced on a per-test basis, marking some macros as 'static', etc. |
I'm going to close this issue now that we've merged #3034. Even though (by its own admission) the PR is only a partial resolution of the issues we began broadly identifying here, we'll be opening more follow-on issues (such as #3048) to track the next steps in our performance work. We know a lot more now, and we can identify targeted opportunities for improvement. |
In a large dbt project, the schema parsing took 4 minutes out of 5.5 minutes total. (Schema parsing is parsing the yaml files in the model directories.) Currently we convert yaml to json, and use json schemas generated by hologram (a python package which generates json schemas from Python classes) to validate the json.
We need to do finer-grained performance analysis and investigation to determine which parts are slowest and come up with a plan to address the issues. Some possibilities include, but are not limited to, storing json schemas in a cache or on disk, using fast-jsonschema, and using some other validation method such as rx or pykwality. Another possibility to look at is to do a quick check for actionable parts of the yaml files -- the project we are testing included lots of yaml files with no actual effect on the parsed manifest.
The text was updated successfully, but these errors were encountered: