Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt Constraints / model contracts #6271

Merged
merged 117 commits into from
Feb 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
e879cdd
start off the blueprints
sungchun12 Nov 7, 2022
bd1afc7
Merge branch 'main' of https://github.com/dbt-labs/dbt into feature/d…
sungchun12 Nov 9, 2022
4b2a881
Merge branch 'main' of https://github.com/dbt-labs/dbt into feature/d…
sungchun12 Nov 16, 2022
e8b52e0
test commit
sungchun12 Nov 16, 2022
01a07d3
working snowflake env
sungchun12 Nov 16, 2022
3b31a15
update manifest expectation
sungchun12 Nov 17, 2022
1a1f46a
add error handling
sungchun12 Nov 17, 2022
ebaa54c
clean up language
sungchun12 Nov 17, 2022
fd7a47a
constraints validator
sungchun12 Nov 17, 2022
7e3a9be
cleaner example
sungchun12 Nov 17, 2022
6df02da
better terminal output
sungchun12 Nov 17, 2022
bc3c5bc
add python error handling
sungchun12 Nov 17, 2022
4b901fd
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Nov 17, 2022
e29b571
add to manifest schema
sungchun12 Nov 17, 2022
7d085dc
add to schema config
sungchun12 Nov 17, 2022
e6559d4
clean up comments
sungchun12 Nov 18, 2022
ca89141
backwards compatible nodeconfig
sungchun12 Nov 18, 2022
2c4a4cf
remove comments
sungchun12 Nov 18, 2022
1975c6b
clean up more comments
sungchun12 Nov 18, 2022
f088a03
add changelog
sungchun12 Nov 18, 2022
7421caa
clarify error message
sungchun12 Nov 21, 2022
a7395bb
constraints list type
sungchun12 Nov 21, 2022
5d06524
fix grammar
sungchun12 Nov 21, 2022
380bd96
add global macros
sungchun12 Nov 21, 2022
e6e490d
clearer compile error
sungchun12 Nov 21, 2022
9c498ef
remove comments
sungchun12 Nov 21, 2022
bed1fec
fix tests in this file
sungchun12 Nov 21, 2022
8c466b0
conditional compile errors
sungchun12 Nov 22, 2022
547ad9e
add conditional check in ddl
sungchun12 Nov 22, 2022
52bd35b
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Nov 22, 2022
7582531
add macro to dispatch
sungchun12 Nov 22, 2022
7291094
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Nov 28, 2022
b87f57d
fix regressions in parsed
sungchun12 Nov 28, 2022
5529334
fix regressions in manifest tests
sungchun12 Nov 28, 2022
00f12c2
fix manifest test regressions
sungchun12 Nov 28, 2022
76bf69c
fix test_list regressions
sungchun12 Nov 28, 2022
2e51246
concise data_type terminal error
sungchun12 Nov 29, 2022
5891eb3
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Nov 29, 2022
5d2867f
remove placeholder function
sungchun12 Nov 29, 2022
801b2fd
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Nov 29, 2022
5d59cc1
fix failed regressions finally
sungchun12 Dec 2, 2022
92d2ea7
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Dec 2, 2022
4f747b0
Revert "Merge branch 'main' of https://github.com/dbt-labs/dbt into d…
sungchun12 Dec 2, 2022
eba0b6d
Revert "Revert "Merge branch 'main' of https://github.com/dbt-labs/db…
sungchun12 Dec 2, 2022
ae56da1
remove tmp.csv
sungchun12 Dec 2, 2022
de653e4
template test plans
sungchun12 Dec 5, 2022
cfc53b0
postgres columns spec macro
sungchun12 Dec 5, 2022
fc7230b
schema does not exist error handling
sungchun12 Dec 7, 2022
e1c72ac
update postgres adapter
sungchun12 Dec 7, 2022
b215b6c
remove comments
sungchun12 Dec 7, 2022
d550de4
first passing test
sungchun12 Dec 7, 2022
9d87463
fix postgres macro
sungchun12 Dec 7, 2022
ce85c96
add more passing tests
sungchun12 Dec 7, 2022
49a4120
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Dec 7, 2022
8ffb654
Add generated CLI API docs
FishtownBuildBot Dec 7, 2022
f2f2707
add disabled config test
sungchun12 Dec 7, 2022
096f3fd
column configs match
sungchun12 Dec 7, 2022
eae4e76
Merge branch 'dbt-constraints' of https://github.com/dbt-labs/dbt int…
sungchun12 Dec 7, 2022
b8c3812
test python error handling
sungchun12 Dec 7, 2022
bb1a6c3
adjust macro with rollback
sungchun12 Dec 7, 2022
b6dbcf6
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Dec 7, 2022
751cdc8
start postgres tests
sungchun12 Dec 8, 2022
6bbd797
remove begin commit
sungchun12 Dec 15, 2022
d364eeb
remove begin commit comments
sungchun12 Dec 15, 2022
4a58ece
passing expected compiled sql test
sungchun12 Dec 15, 2022
ac795dd
passing rollback test
sungchun12 Dec 15, 2022
d452cae
update changelog
sungchun12 Dec 16, 2022
baf18f0
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Dec 16, 2022
76c0e4f
fix artifacts regression
sungchun12 Dec 16, 2022
10ab3cb
modularize validator
sungchun12 Dec 18, 2022
6253ed0
PR feedback
sungchun12 Dec 18, 2022
307809d
verify database error occurs
sungchun12 Dec 19, 2022
ab4f396
focus on generic outcomes
sungchun12 Dec 19, 2022
b99e9be
fix global macro
sungchun12 Dec 20, 2022
5935201
rename to constraints_check
sungchun12 Dec 20, 2022
f59c9dd
missed a check rename
sungchun12 Dec 20, 2022
7e28a31
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Dec 24, 2022
3ddd666
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 4, 2023
fabe2ce
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 4, 2023
ffec7d7
validate at parse time
sungchun12 Jan 4, 2023
d338f33
raise error for modelparser only
sungchun12 Jan 5, 2023
e34c467
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 5, 2023
44b2f18
better spacing in terminal output
sungchun12 Jan 5, 2023
17b1f8e
fix test regressions
sungchun12 Jan 6, 2023
c652367
fix manifest test regressions
sungchun12 Jan 6, 2023
f9e020d
these are parsing errors now
sungchun12 Jan 9, 2023
926e555
merge main
sungchun12 Jan 11, 2023
c6bd674
fix tests
sungchun12 Jan 11, 2023
bcc35fc
test passes in json log format
sungchun12 Jan 11, 2023
880ed43
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 13, 2023
426789e
add column compile error handling
sungchun12 Jan 13, 2023
3d61eda
update global macros for column handling
sungchun12 Jan 13, 2023
f163b2c
remove TODO
sungchun12 Jan 13, 2023
bf45243
uppercase columns for consistency
sungchun12 Jan 17, 2023
dbef42b
more specific error handling
sungchun12 Jan 17, 2023
c4de8f3
migrate tests
sungchun12 Jan 17, 2023
ad07ced
clean up core tests
sungchun12 Jan 17, 2023
a501c29
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 17, 2023
59b0298
Update core/dbt/include/global_project/macros/materializations/models…
sungchun12 Jan 17, 2023
e46066b
Revert "Update core/dbt/include/global_project/macros/materialization…
sungchun12 Jan 17, 2023
e80b8cd
update for pre-commit hooks
sungchun12 Jan 17, 2023
257eacd
update for black formatter
sungchun12 Jan 17, 2023
391bd3b
update for black formatter on all files
sungchun12 Jan 17, 2023
0441417
Merge remote-tracking branch 'origin/main' into dbt-constraints
jtcohen6 Jan 23, 2023
e0bcb25
Refactor functional tests
jtcohen6 Jan 30, 2023
dcf7062
Fixup formatting
jtcohen6 Jan 30, 2023
903a2cb
Dave feedback
jtcohen6 Jan 30, 2023
c40ee92
another one - dave
jtcohen6 Jan 30, 2023
111683a
the hits keep coming
jtcohen6 Jan 30, 2023
f8b16bc
adjust whitespace
dave-connors-3 Jan 30, 2023
97f0c6b
Merge branch 'main' of https://github.com/dbt-labs/dbt into dbt-const…
sungchun12 Jan 31, 2023
2a33baf
Light touchup
jtcohen6 Feb 1, 2023
6806a7c
Merge remote-tracking branch 'origin/main' into dbt-constraints
jtcohen6 Feb 1, 2023
1256e7b
Add more flexibility for spark
jtcohen6 Feb 2, 2023
0304dbf
Nearly there for spark
jtcohen6 Feb 2, 2023
b5b1699
Merge main
jtcohen6 Feb 14, 2023
d338faa
Try regenerating docs
jtcohen6 Feb 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .changes/unreleased/Features-20221118-141120.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
kind: Features
body: Data type constraints are now native to SQL table materializations. Enforce
columns are specific data types and not null depending on database functionality.
time: 2022-11-18T14:11:20.868062-08:00
custom:
Author: sungchun12
Issue: "6079"
PR: "6271"
1 change: 1 addition & 0 deletions core/dbt/contracts/graph/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,7 @@ class NodeConfig(NodeAndTestConfig):
default_factory=Docs,
metadata=MergeBehavior.Update.meta(),
)
constraints_enabled: Optional[bool] = False
Copy link
Contributor

@jtcohen6 jtcohen6 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this nomenclature for now. I'll be curious to hear feedback from beta testers.

One thing that's potentially misleading: In addition to enabling constraints (if they're defined), this also enables (requires) the verification of the number, order, and data types of columns.

A long time ago, there was an idea to call this strict: #1570. (While searching for this issue, I was proud to find that I still know @jwerderits' GitHub handle by memory.)

Copy link
Contributor Author

@sungchun12 sungchun12 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to changing it after community feedback! There are exceptions that I flow through into DatabaseErrors vs. ParsingErrors because we get error handling for free(think: data type validation, number of columns mismatch SQL).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For follow-up issues, out of scope for this PR]

After a bit more discussion, I'm thinking about renaming constraints_enabled to either:

  • stable: true|false
  • contracted: true|false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If it has a default value False then it doesn't need to be an Option type, it could just be bool.

It's a shame the python typing lib uses the word Optional as a keyword; it's the default value that makes providing the attribute optional when instantiating the class.

On another note, a bool with a default value of None forces the use of Optional[bool] which only makes sense if you must have a lenient (public) API or if there's inheritance compatibility issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davehowell Good catch! It looks like @gshank already had the same thought, and implemented this change over in #7002, along with the config rename :)


# we validate that node_color has a suitable value to prevent dbt-docs from crashing
def __post_init__(self):
Expand Down
4 changes: 4 additions & 0 deletions core/dbt/contracts/graph/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
SnapshotConfig,
)


# =====================================================================
# This contains the classes for all of the nodes and node-like objects
# in the manifest. In the "nodes" dictionary of the manifest we find
Expand Down Expand Up @@ -146,6 +147,8 @@ class ColumnInfo(AdditionalPropertiesMixin, ExtensibleDbtClassMixin, Replaceable
description: str = ""
meta: Dict[str, Any] = field(default_factory=dict)
data_type: Optional[str] = None
constraints: Optional[List[str]] = None
constraints_check: Optional[str] = None
quote: Optional[bool] = None
tags: List[str] = field(default_factory=list)
_extra: Dict[str, Any] = field(default_factory=dict)
Expand Down Expand Up @@ -400,6 +403,7 @@ class CompiledNode(ParsedNode):
extra_ctes_injected: bool = False
extra_ctes: List[InjectedCTE] = field(default_factory=list)
_pre_injected_sql: Optional[str] = None
constraints_enabled: bool = False

@property
def empty(self):
Expand Down
2 changes: 2 additions & 0 deletions core/dbt/contracts/graph/unparsed.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ class HasDocs(AdditionalPropertiesMixin, ExtensibleDbtClassMixin, Replaceable):
description: str = ""
meta: Dict[str, Any] = field(default_factory=dict)
data_type: Optional[str] = None
constraints: Optional[List[str]] = None
constraints_check: Optional[str] = None
Copy link
Contributor

@jtcohen6 jtcohen6 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying Q that I'm sure has been answered elsewhere: Which among our most popular data platforms actually support the totally flexible check constraint? Just Postgres + Databricks?

All good things we'll want to get documented! In addition to, which constraints are actually enforced. For most of our platforms, it's only really not null, the others are just for show :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just postgres + databricks. We'll include adapter-specific disclaimers in this PR: dbt-labs/docs.getdbt.com#2601

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For follow-up issues, out of scope for this PR]

This didn't occur to me on the first read-through, but is there a reason why we want this to be a separate configuration, rather than just a subset of the available constraints (on platforms that support it)?

Answering my own question: It looks like it's because Databricks requires that check constraints have to be applied after table creation via alter statements. (It looks like Postgres supports them within the create table statement.)

I still think I might prefer a configuration like:

constraints: Optional[List[Union[str, Dict[str, str]]]] = None
columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
      - check: price > 0
        name: positive_price  # Postgres supports naming constraints

Postgres + Databricks also support table-level constraints that are composed of multiple columns. We should also support constraints at the model level, in addition to the column level. Should they be a config, or a non-config attribute?

docs: Docs = field(default_factory=Docs)
_extra: Dict[str, Any] = field(default_factory=dict)

Expand Down
Binary file modified core/dbt/docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified core/dbt/docs/build/doctrees/index.doctree
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{%- macro get_columns_spec_ddl() -%}
{{ adapter.dispatch('get_columns_spec_ddl', 'dbt')() }}
{%- endmacro -%}

{% macro default__get_columns_spec_ddl() -%}
{{ return(columns_spec_ddl()) }}
{%- endmacro %}

{% macro columns_spec_ddl() %}
{# loop through user_provided_columns to create DDL with data types and constraints #}
{%- set user_provided_columns = model['columns'] -%}
(
{% for i in user_provided_columns %}
mikealfare marked this conversation as resolved.
Show resolved Hide resolved
{% set col = user_provided_columns[i] %}
{% set constraints = col['constraints'] %}
{% set constraints_check = col['constraints_check'] %}
{{ col['name'] }} {{ col['data_type'] }} {% for x in constraints %} {{ x or "" }} {% endfor %} {% if constraints_check -%} check {{ constraints_check or "" }} {%- endif %} {{ "," if not loop.last }}
{% endfor %}
)
{% endmacro %}

{%- macro get_assert_columns_equivalent(sql) -%}
{{ adapter.dispatch('get_assert_columns_equivalent', 'dbt')(sql) }}
{%- endmacro -%}

{% macro default__get_assert_columns_equivalent(sql) -%}
{{ return(assert_columns_equivalent(sql)) }}
{%- endmacro %}

{% macro assert_columns_equivalent(sql) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is failing in Snowflake because get_columns_in_query() returns uppercase column names.

12:00:07    Please ensure the name and order of columns in your `yml` file match the columns in your SQL file.
12:00:07    Schema File Columns: ['int_column', 'float_column', 'bool_column', 'date_column']
12:00:07    SQL File Columns: ['INT_COLUMN', 'FLOAT_COLUMN', 'BOOL_COLUMN', 'DATE_COLUMN']

I could dispatch&replicate the macro and handle upper/lower-case conversion but I am wondering if it should be handled in Core directly and not in the adapter.

In addition, shouldn't we check if columns are quoted to know if we should perform lower/upper-case conversion or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-per Great point. And that quote_columns config is specific to seeds - not to be confused with the column-level quote property: https://docs.getdbt.com/reference/resource-properties/quote

(Sorry everyone: #2986)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-per Great feedback. I'll update the macro to uppercase all the columns across both the schema and sql files. I'll also figure out quote columns!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to uppercase all columns and added extra uppercase checks and quotes to the columns in the SQL file and added a quote:true config in test_constraints.py. @b-per can you pull in these changes and verify that it's working for snowflake?
image

sungchun12 marked this conversation as resolved.
Show resolved Hide resolved
{#- loop through user_provided_columns to get column names -#}
{%- set user_provided_columns = model['columns'] -%}
{%- set column_names_config_only = [] -%}
{%- for i in user_provided_columns -%}
{%- set col = user_provided_columns[i] -%}
{%- set col_name = col['name'] -%}
{%- set column_names_config_only = column_names_config_only.append(col_name) -%}
{%- endfor -%}
{%- set sql_file_provided_columns = get_columns_in_query(sql) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an assumption here that column ordering is consistent (and deterministic) in all our data platforms.

Follow-on TODO: We prefer option 2 in https://github.com/dbt-labs/dbt-core/pull/6271/files#r1069332715, let's make sure dbt always ensures order itself, and doesn't make yaml-ordering matter so much

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MichelleArk for opening #6975!


{#- uppercase both schema and sql file columns -#}
{%- set column_names_config_upper= column_names_config_only|map('upper')|join(',') -%}
{%- set column_names_config_formatted = column_names_config_upper.split(',') -%}
{%- set sql_file_provided_columns_upper = sql_file_provided_columns|map('upper')|join(',') -%}
{%- set sql_file_provided_columns_formatted = sql_file_provided_columns_upper.split(',') -%}

{%- if column_names_config_formatted != sql_file_provided_columns_formatted -%}
{%- do exceptions.raise_compiler_error('Please ensure the name, order, and number of columns in your `yml` file match the columns in your SQL file.\nSchema File Columns: ' ~ column_names_config_formatted ~ '\nSQL File Columns: ' ~ sql_file_provided_columns_formatted ~ ' ' ) %}
{%- endif -%}

{% endmacro %}
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@

create {% if temporary: -%}temporary{%- endif %} table
{{ relation.include(database=(not temporary), schema=(not temporary)) }}
{% if config.get('constraints_enabled', False) %}
{{ get_assert_columns_equivalent(sql) }}
{{ get_columns_spec_ddl() }}
{% endif %}
Comment on lines +28 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up (related to ongoing discussion in #6750): We might want to distinguish between the "contract" and the "constraints." That distinction would make it possible for users to define (e.g.) check constraints on Postgres/Databricks, without providing the full column specification in a yaml file. (This is already the case in dbt-databricks.)

What ought to be part of the contract (for detecting breaking changes, #6869)? "Data shape" for sure (column names, types, nullability). Additional constraints? Tests? We need to decide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning myself to #6750 for further refinement, before we pick this up for implementation

as (
{{ sql }}
);
Expand Down
15 changes: 14 additions & 1 deletion core/dbt/parser/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from dbt.contracts.graph.manifest import Manifest
from dbt.contracts.graph.nodes import ManifestNode, BaseNode
from dbt.contracts.graph.unparsed import UnparsedNode, Docs
from dbt.exceptions import DbtInternalError, ConfigUpdateError, DictParseError
from dbt.exceptions import DbtInternalError, ConfigUpdateError, DictParseError, ParsingError
from dbt import hooks
from dbt.node_types import NodeType, ModelLanguage
from dbt.parser.search import FileBlock
Expand Down Expand Up @@ -306,6 +306,19 @@ def update_parsed_node_config(
else:
parsed_node.docs = Docs(show=docs_show)

# If we have constraints_enabled in the config, copy to node level, for backwards
# compatibility with earlier node-only config.
if config_dict.get("constraints_enabled", False):
parsed_node.constraints_enabled = True

parser_name = type(self).__name__
if parser_name == "ModelParser":
original_file_path = parsed_node.original_file_path
error_message = "\n `constraints_enabled=true` can only be configured within `schema.yml` files\n NOT within a model file(ex: .sql, .py) or `dbt_project.yml`."
raise ParsingError(
f"Original File Path: ({original_file_path})\nConstraints must be defined in a `yml` schema configuration file like `schema.yml`.\nOnly the SQL table materialization is supported for constraints. \n`data_type` values must be defined for all columns and NOT be null or blank.{error_message}"
)
Comment on lines +309 to +320
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that this is necessary only:

As @gshank commented earlier, parse time is the right time for this sort of validation, versus compile-time, per our slightly nonstandard meanings of "parsing" + "compilation." I do see the downside of, we return parse-time errors as soon as we hit them, rather than batching up all the errors we can and returning them all. If it were a warning instead, then we could return them all together!

FWIW - I'm aligned with constraints_enabled being a config, with the idea that it should be possible to enable a broad swath of models (e.g. "everything in the marts/public folder") to have type-checked column specs. That's true even though it depends on the columns attribute, which is not a config, but a property specific to one model.

If we can't resolve that tech debt before the v1.5 release, I'd probably opt for removing the parse-time check, and letting the database return its own validation error at runtime. It's a less-nice UX to see the error later in the dev cycle, but I think it's better to offer the full flexibility around configuring constraints_enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct on why this is a Parsing Error today. I'm fine with the less-nice UX in favor of constraints_enabled=true being available in all configs come v1.5.


# unrendered_config is used to compare the original database/schema/alias
# values and to handle 'same_config' and 'same_contents' calls
parsed_node.unrendered_config = config.build_config_dict(
Expand Down
77 changes: 76 additions & 1 deletion core/dbt/parser/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,8 @@ def add(
column: Union[HasDocs, UnparsedColumn],
description: str,
data_type: Optional[str],
constraints: Optional[List[str]],
constraints_check: Optional[str],
meta: Dict[str, Any],
):
tags: List[str] = []
Expand All @@ -132,6 +134,8 @@ def add(
name=column.name,
description=description,
data_type=data_type,
constraints=constraints,
constraints_check=constraints_check,
meta=meta,
tags=tags,
quote=quote,
Expand All @@ -144,8 +148,10 @@ def from_target(cls, target: Union[HasColumnDocs, HasColumnTests]) -> "ParserRef
for column in target.columns:
description = column.description
data_type = column.data_type
constraints = column.constraints
constraints_check = column.constraints_check
meta = column.meta
refs.add(column, description, data_type, meta)
refs.add(column, description, data_type, constraints, constraints_check, meta)
return refs


Expand Down Expand Up @@ -914,6 +920,75 @@ def parse_patch(self, block: TargetBlock[NodeTarget], refs: ParserRef) -> None:
self.patch_node_config(node, patch)

node.patch(patch)
self.validate_constraints(node)

def validate_constraints(self, patched_node):
error_messages = []
if (
patched_node.resource_type == "model"
and patched_node.config.constraints_enabled is True
):
validators = [
self.constraints_schema_validator(patched_node),
self.constraints_materialization_validator(patched_node),
self.constraints_language_validator(patched_node),
self.constraints_data_type_validator(patched_node),
]
error_messages = [validator for validator in validators if validator != "None"]

if error_messages:
original_file_path = patched_node.original_file_path
raise ParsingError(
f"Original File Path: ({original_file_path})\nConstraints must be defined in a `yml` schema configuration file like `schema.yml`.\nOnly the SQL table materialization is supported for constraints. \n`data_type` values must be defined for all columns and NOT be null or blank.{self.convert_errors_to_string(error_messages)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the SQL table materialization is supported for constraints.

Good on you for having the explicit callout here!

This does not work for:

  • Views, because databases don't allow data type specifications or constraints when creating views. (They only allow column names, and those aren't actually cross-checked, just used for implicit renaming.)
  • Incremental models - unless, of course, they have configured 'on_schema_change: 'fail'! Should we disallow the more capable mechanisms for handling schema changes, though? This feels like a pretty big limitation.

Unless!:

  1. We add a check for column types after a view or temp table has been created. (For views: This could be within-transaction and before the alter-rename-swap on "older-school" databases, or after a view has already been created on "newer-school" databases that support create or replace.) That feels like a pretty big limitation, though - a big part of the appeal of "constraints," as opposed to dbt test, is that they should be checked before the object has been created/replaced, or before expensive incremental processing has happened.
  2. We add a check for column names + types before the view / temp table has been created, by means of get_columns_in_query, a.k.a. where false limit 0. That's a "free" query on most databases (including BQ!), there's just a bit of risk in extracting metadata + doing our own data type comparisons, rather than pushing it all down to the database.

I lean toward option 2, as a way to expand this support beyond just tables. In a future where we support project-level governance rules ("all public models must have constraints enabled"), without that support, we'd also be requiring that every public model be materialized: table — and that feels like a pretty big limitation. I've already talked to a handful of folks who have grown quite fond of a view-driven "consumption layer," and feel it offers them an additional lever of control between the underlying data (table) and its downstream users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Option 2, I'm aligned on approach to get incremental models working. Did you want this implemented in this PR or another?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to have this in a follow-up issue/PR! I don't think we need it for the initial cut / enough to unlock a beta experience

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For follow-up issues, out of scope for this PR]

After a bit more investigation, we'll need to use a modified version of get_columns_in_query, that pulls data types off the database cursor/API, if we want to also enforce that data types match before a table/view is created/replaced.

)

def convert_errors_to_string(self, error_messages: List[str]):
n = len(error_messages)
if not n:
return ""
if n == 1:
return error_messages[0]
error_messages_string = "".join(error_messages[:-1]) + f"{error_messages[-1]}"
return error_messages_string

def constraints_schema_validator(self, patched_node):
schema_error = False
if patched_node.columns == {}:
schema_error = True
schema_error_msg = "\n Schema Error: `yml` configuration does NOT exist"
schema_error_msg_payload = f"{schema_error_msg if schema_error else None}"
return schema_error_msg_payload

def constraints_materialization_validator(self, patched_node):
materialization_error = {}
if patched_node.config.materialized != "table":
materialization_error = {"materialization": patched_node.config.materialized}
materialization_error_msg = f"\n Materialization Error: {materialization_error}"
materialization_error_msg_payload = (
f"{materialization_error_msg if materialization_error else None}"
)
return materialization_error_msg_payload

def constraints_language_validator(self, patched_node):
language_error = {}
language = str(patched_node.language)
if language != "sql":
Copy link
Contributor

@jtcohen6 jtcohen6 Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For follow-up issues, out of scope for this PR]

It should be possible to support dbt-py models, by saving the Python model in a temp table, and then inserting it into a preexisting table (created per contract). The downside of that approach is, the full (potentially expensive) transformation has already taken place before the contract could be validated.

As a general rule, let's treat all of the "validators" defined here as potential TODOs. We'll either want to remove them as limitations, or keep raising explicit errors (with potentially reworked wording).

Copy link
Contributor Author

@sungchun12 sungchun12 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6
I recommend the syntax be: contract: true | false because contracted invokes imagery of pregnancy(a beautiful human experience but dissonant related to data contracts) at first glance and is in past tense, which may confuse the reader. Whenever the term data contracts come up, it's a noun in plural or singular, not a verb.

I'm on board with this example config, feels a lot more elegant compared to the first draft:

columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
      - check: price > 0
        name: positive_price  # Postgres supports naming constraints

I'm aligned with you on get_columns_in_query relying on the database cursor/API for more robustness and cheaper operations.

I'm aligned on python models as long as we document/warn the user of performance hits as a result.

I'm aligned with you on "validators". My main goal is to provide maximally useful terminal logs that invoke clear and quick actions to fix. This makes iteration a delight rather than a chore.

Thanks for such a thorough review and eye towards the future. I'm juiced up to finish these PRs!

language_error = {"language": language}
language_error_msg = f"\n Language Error: {language_error}"
language_error_msg_payload = f"{language_error_msg if language_error else None}"
return language_error_msg_payload

def constraints_data_type_validator(self, patched_node):
data_type_errors = set()
for column, column_info in patched_node.columns.items():
if column_info.data_type is None:
data_type_error = {column}
data_type_errors.update(data_type_error)
data_type_errors_msg = (
f"\n Columns with `data_type` Blank/Null Errors: {data_type_errors}"
)
data_type_errors_msg_payload = f"{data_type_errors_msg if data_type_errors else None}"
return data_type_errors_msg_payload


class TestablePatchParser(NodePatchParser[UnparsedNodeUpdate]):
Expand Down
9 changes: 8 additions & 1 deletion plugins/postgres/dbt/include/postgres/macros/adapters.sql
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,14 @@
{%- elif unlogged -%}
unlogged
{%- endif %} table {{ relation }}
as (
{% if config.get('constraints_enabled', False) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial piece (lines 8-12) is the same for both conditions. This if block should be moved down to just the piece that is the difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{{ get_assert_columns_equivalent(sql) }}
{{ get_columns_spec_ddl() }} ;
insert into {{ relation }} {{ get_column_names() }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On at least Postgres (at least recent versions), you do not need to provide the list of column names when inserting:

jerco=# create table dbt_jcohen.some_tbl (id int, other_col text);
CREATE TABLE
jerco=# insert into dbt_jcohen.some_tbl select 1 as id, 'blue' as color;
INSERT 0 1
jerco=# select * from dbt_jcohen.some_tbl;
 id | other_col
----+-----------
  1 | blue
(1 row)

On other databases (e.g. Redshift), it may indeed be required

Copy link
Contributor Author

@sungchun12 sungchun12 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I guess this is where dbt's opinions aim for explicit over implicit in compilation

{% else %}
as
{% endif %}
(
{{ sql }}
);
{%- endmacro %}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{% macro postgres__get_columns_spec_ddl() %}
Copy link
Contributor

@jtcohen6 jtcohen6 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any part of this (+ other pg-specific code here) reusable? Any sense in making it the "default" implementation, within the "global project," instead of just for Postgres? Asking just to confirm — I'm sure you've already given it thought, across the many adapter-specific implementations!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a toughie to discern. I biased towards the global macro default bundling the data type DDL and wrapping the select statement because other adapters may benefit from it(think: bigquery, snowflake, etc.). At the same time, the postgres implementation will likely be the most universally applicable minus the check constraint which isn't common across data warehouses. Overall, there is no good default implementation, just good template code to override in a specific adapter.

I recommend keeping it as is given the above, but given your team maintains this and you have a bigger picture view of how adapters will use this, I'm happy to make the postgres implementation the default.

{# loop through user_provided_columns to create DDL with data types and constraints #}
{%- set user_provided_columns = model['columns'] -%}
(
{% for i in user_provided_columns %}
{% set col = user_provided_columns[i] %}
{% set constraints = col['constraints'] %}
{% set constraints_check = col['constraints_check'] %}
{{ col['name'] }} {{ col['data_type'] }} {% for x in constraints %} {{ x or "" }} {% endfor %} {% if constraints_check -%} check {{ constraints_check or "" }} {%- endif %} {{ "," if not loop.last }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things here that are interesting, and worth a second look — at least to document, if we don't decide to actually do anything about them.

When an explicit column list is provided, both on "older-school" databases that need to use create+insert and on databases that support column lists in CTAs:

  • The database does not verify that the column names defined in SQL match the column names defined in the config. AFAIK, it's standard database behavior to coerce the former into the latter.
  • The database will also try coercing types. If you return a date-type column in your SQL named date_day, but your yaml says string, the table won't fail to build—it will succeed, and include date_day as a string.
  • The order of columns specified becomes really important! There needs to be an exact match between the SQL query output, and the yaml columns attribute. If there's a data type mismatch, okay, you get an error (which might be confusing to debug). The worse-case scenario is that you have two columns with the same data type, but accidentally swap the order. Imagine:
-- models/important_model.sql
select
  1 as id,
  10000 as revenue,
  100 as num_customers
version: 2
models:
  - name: important_model
    config:
      materialized: table
      constraints_enabled: True
    columns:
      - name: id
        data_type: numeric
      - name: num_customers
        data_type: int
      - name: revenue
        data_type: int

dbt run succeeds, and then:

jerco=# select * from dbt_jcohen.important_model;
 id | num_customers | revenue
----+---------------+---------
  1 |         10000 |     100
(1 row)

Average revenue per customer is going to look a bit off!

Two options here:

  1. We could stick in an additional check, either on the temp table mid-transaction ("older-school" databases) or via get_columns_in_query (create or replace databases), to verify that the column names match up exactly. That would also allow us to verify data types, and avoid unintended coercions.
  2. Within the insert statement (or elsewhere within the CTA), we could add another subquery layer to ensure we're selecting columns by name, rather than just implicitly by order:
insert into ...
  (id, num_customers, revenue)
  select id, num_customers, revenue from -- I added this
    (
    select
  1 as id,
  10000 as revenue,
  100 as num_customers
  ) model_subq; -- postgres requires named subqueries

The latter strategy is more or less what we do in incremental models, to handle cases where an incremental model's columns have been reordered since first run.

The good news: In all cases, we are guaranteed that the column names, types, and order in the resulting table match up exactly with the columns defined in the config. That is better, since it's static structured data that we know in advance, record in the manifest, and can use to communicate the "API" of this table to downstream consumers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote option 1 because it has less database performance hits because option 2 using subqueries depends a lot more on the database to optimize it(which varies).

I'll make this change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we create any temp tables in the current implementation, so where would we do the mid-transaction check in option 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3 The way we replace preexisting views/tables on Postgres + Redshift is by creating a "temp" relation (with a suffix __dbt_tmp, not always actually a temporary table), and then alter rename swapping the existing/old + new relations, before dropping the old one. That ensures zero downtime, and prevents the view/table from being missing if the model SQL runs into an error.

So we could try to stick another check in the middle of that process. But I actually think get_columns_in_query might be preferable across the board, because it will also catch mismatches before we run a potentially expensive query to create the table with new data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change by raising a compile error if the names and order of the columns don't match exactly in a global macro. I added 2 tests to prove this.

image

{% endfor %}
)
{% endmacro %}

{% macro get_column_names() %}
{# loop through user_provided_columns to get column names #}
{%- set user_provided_columns = model['columns'] -%}
(
{% for i in user_provided_columns %}
{% set col = user_provided_columns[i] %}
{{ col['name'] }} {{ "," if not loop.last }}
{% endfor %}
)
{% endmacro %}
Loading