-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient catalog generation #93
Comments
This would also resolve the fact that |
Adding to the discussion here, The current instance I'm working with has a single schema for A simple From what I understood, the challenge here would be to change dbt-spark's behavior to query |
@felippecaso Ah no, just the reverse: We want to run just one I don't know if there's anything we can do in the case where |
Hey @jtcohen6, Still, my point relates to the fact that Those 50 would be the number of tables that dbt needs to understand metadata from, but As always, I may also be missing something =) |
@felippecaso I don't think you're missing anything, and you've got a really good point. It's slow for dbt to run a separate I can imagine a future state of dbt, where it's able to marry the relation cache built at the start of the run with the catalog built at the end of the run, and actually update the former over the course of each model execution to produce the latter. In that future state, running For my part, on other databases / warehouses / query engines, dbt expects to be able to run metadata queries by limiting its scope to the database/schema of interest. It's pretty reasonable to expect that returning metadata queries on a schema with 1000 objects should actually be fairly quick... so I'm inclined to see this as a limitation of Apache Spark, more than anything. Is this behavior we need to make configurable? Could we offer an optional/configurable wildcard, in case (e.g.) all the tables in your source schema share a common prefix/suffix? Are there other secret SparkSQL approaches I don't know about (yet)? |
Picks up from #49
Background
Currently,
_get_one_catalog
runsshow tblproperties
anddescribe extended
on every single relation, one by one. As a result,dbt docs generate
can take several minutes to run.All the same information is available at a schema level by running
show table extended in [schema] like '*'
, which dbt-spark knows as theLIST_RELATIONS_MACRO_NAME
.Challenge
The tricky part is the formatting: the result of
show table extended...
is structured, but strange:As @aaronsteers helpfully found, we could use this regex string to parse out all the info we need from the result above:
This will require changes to
_get_columns_for_catalog
, and possibly toparse_describe_extended
as well.Alternatives
Who will this benefit?
Faster docs generation for everybody!
The text was updated successfully, but these errors were encountered: