Always set database = schema #91

jtcohen6 · 2020-05-26T02:34:14Z

closes Alias database to allow matching between rows and Manifest #85
resolves Master not working with sources #89

Problems

If the catalog and the manifest are not in full agreement about the database value of a given node/relation, the auto-generated docs site will be missing information (see Alias database to allow matching between rows and Manifest #85)
Sources and snapshots which declare a schema/target_schema, but which do not also declare an identical database/target_database, return the exception we intended to only appear when a user has specified a model config database that differs from the model's configured schema (see Master not working with sources #89)

Approaches

Always set database = ''
- Upside: By checking if a node or relation has database != '', we can see if the user has manually set the database config, allowing us to raise an appropriate exception.
- Downside: All relations now have the same database. When it comes time for docs generation, _get_one_catalog is passed a single database with multiple schemas, raising the exception 'Expected only one schema in spark _get_one_catalog'. This exception is well motivated; on Spark, we need to run a separate list_relations query for each database/schema.
Always set database = schema
- Upside: dbt should be able to know that relations in different schemas are in different databases for the purposes of running separate list_relations queries.
- Upside: This is more in line with Spark's conceptual model.
- Downside: If a relation's database != schema, there's no good way to know if this is because the user set both to be different, or because a source/snapshot simply has a schema/target_schema that differs from the default database (target.database). We can't raise a helpful exception in the event that the user is trying funky things with the database config; we'll just need to document this.

I think I prefer the second approach, and I've tried to implement it here. I figured out a way to change the value of database via __post_init__ even though SparkRelation is inhering a frozen dataclass.

However, dbt docs generate is still returning this error:

Encountered an error while generating catalog: Compilation Error
  Expected only one schema in spark _get_one_catalog
dbt encountered 1 failure while writing the catalog

My best guess right now:

_get_catalog_schemas uses the create_from classmethod from BaseRelation to create info_schema_name_map, i.e. the object which establishes which relations exist in which schemas in which databases
create_from --> create_from_node does not respect my __post_init__ resetting of database = schema, instead pulling the values of database, schema, etc directly off the node attributes.

@beckjake I'd be thrilled if you could check this out and help me debug what's going wrong. As it is, we may need to revisit some of the catalog generation methods in light of #90.

jtcohen6 · 2020-05-29T21:30:28Z

Resolved by #92

Set database = schema, always

39c61ba

jtcohen6 mentioned this pull request May 26, 2020

Alias database to allow matching between rows and Manifest #85

Closed

jtcohen6 closed this May 29, 2020

kwigley deleted the fix/schema-db-confundity branch March 23, 2021 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always set database = schema #91

Always set database = schema #91

jtcohen6 commented May 26, 2020

jtcohen6 commented May 29, 2020 •

edited

Loading

Always set database = schema #91

Always set database = schema #91

Conversation

jtcohen6 commented May 26, 2020

Problems

Approaches

jtcohen6 commented May 29, 2020 • edited Loading

jtcohen6 commented May 29, 2020 •

edited

Loading