Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always set database = schema #91

Closed
wants to merge 1 commit into from
Closed

Conversation

jtcohen6
Copy link
Contributor

Problems

  • If the catalog and the manifest are not in full agreement about the database value of a given node/relation, the auto-generated docs site will be missing information (see Alias database to allow matching between rows and Manifest #85)
  • Sources and snapshots which declare a schema/target_schema, but which do not also declare an identical database/target_database, return the exception we intended to only appear when a user has specified a model config database that differs from the model's configured schema (see Master not working with sources #89)

Approaches

  1. Always set database = ''

    • Upside: By checking if a node or relation has database != '', we can see if the user has manually set the database config, allowing us to raise an appropriate exception.
    • Downside: All relations now have the same database. When it comes time for docs generation, _get_one_catalog is passed a single database with multiple schemas, raising the exception 'Expected only one schema in spark _get_one_catalog'. This exception is well motivated; on Spark, we need to run a separate list_relations query for each database/schema.
  2. Always set database = schema

    • Upside: dbt should be able to know that relations in different schemas are in different databases for the purposes of running separate list_relations queries.
    • Upside: This is more in line with Spark's conceptual model.
    • Downside: If a relation's database != schema, there's no good way to know if this is because the user set both to be different, or because a source/snapshot simply has a schema/target_schema that differs from the default database (target.database). We can't raise a helpful exception in the event that the user is trying funky things with the database config; we'll just need to document this.

I think I prefer the second approach, and I've tried to implement it here. I figured out a way to change the value of database via __post_init__ even though SparkRelation is inhering a frozen dataclass.

However, dbt docs generate is still returning this error:

Encountered an error while generating catalog: Compilation Error
  Expected only one schema in spark _get_one_catalog
dbt encountered 1 failure while writing the catalog

My best guess right now:

  • _get_catalog_schemas uses the create_from classmethod from BaseRelation to create info_schema_name_map, i.e. the object which establishes which relations exist in which schemas in which databases
  • create_from --> create_from_node does not respect my __post_init__ resetting of database = schema, instead pulling the values of database, schema, etc directly off the node attributes.

@beckjake I'd be thrilled if you could check this out and help me debug what's going wrong. As it is, we may need to revisit some of the catalog generation methods in light of #90.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented May 29, 2020

Resolved by #92

@jtcohen6 jtcohen6 closed this May 29, 2020
@kwigley kwigley deleted the fix/schema-db-confundity branch March 23, 2021 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Master not working with sources
1 participant