Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to Model Contracts (v1.5 betas) #3100

Merged
merged 10 commits into from
Mar 30, 2023
43 changes: 25 additions & 18 deletions website/docs/docs/collaborate/publish/model-contracts.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,7 @@ description: "Model contracts define a set of parameters validated during transf
---

:::info Beta functionality
This functionality is new in v1.5. These docs provide a high-level overview of what's to come. The specific syntax is liable to change.

For more details and to leave your feedback, join the GitHub discussion:
* ["Model contracts" (dbt-core#6726)](https://github.com/dbt-labs/dbt-core/discussions/6726)
This functionality is new in v1.5! The syntax is mostly locked, but some small details are still liable to change.
:::

## Related documentation
Expand All @@ -19,7 +16,7 @@ For more details and to leave your feedback, join the GitHub discussion:

## Why define a contract?

Defining a dbt model is as easy as writing a SQL `select` statement or a Python Data Frame transformation. Your query naturally produces a dataset with columns of names and types based on the columns you select and the transformations you apply.
Defining a dbt model is as easy as writing a SQL `select` statement. Your query naturally produces a dataset with columns of names and types based on the columns you select and the transformations you apply.

While this is ideal for quick and iterative development, for some models, constantly changing the shape of its returned dataset poses a risk when other people and processes are querying that model. It's better to define a set of upfront "guarantees" that define the shape of your model. We call this set of guarantees a "contract." While building your model, dbt will verify that your model's transformation will produce a dataset matching up with its contract, or it will fail to build.

Expand All @@ -35,55 +32,65 @@ Let's say you have a model with a query like:
final as (

select
-- lots of columns
customer_id,
customer_name,
-- ... many more ...
from ...

)

select * from final
```

</File>

Your contract _must_ include every column's `name` and `data_type` (where `data_type` matches the type your data platform understands). If your model is materialized as `table` or `incremental`, you may optionally specify that certain columns must be `not_null` (containing zero null values). Depending on your data platform, you may also be able to define additional `constraints` enforced while the model is being built.
To enforce a model's contract, set `enforced: true` under the `contract` configuration.

Finally, you configure your model with `contract: true`.
When enforced, your contract _must_ include every column's `name` and `data_type` (where `data_type` matches one that your data platform understands).

If your model is materialized as `table` or `incremental`, and depending on your data platform, you may optionally specify additional [constraints](resource-properties/constraints), such as `not_null` (containing zero null values).

<File name="models/marts/customers.yml">

```yaml
models:
- name: dim_customers
config:
contract: true
contract:
enforced: true
columns:
- name: customer_id
data_type: int
not_null: true
constraints:
- type: not_null
- name: customer_name
data_type: string
...
...
```

</File>

When building a model with a defined contract, dbt will do two things differently:
1. dbt will run a "preflight" check to ensure that the model's query will return a set of columns with names and data types matching the ones you have defined.
2. dbt will pass the column names, types, `not_null`, and other constraints into the DDL statements it submits to the data platform, which will be enforced while building the table.
1. dbt will run a "preflight" check to ensure that the model's query will return a set of columns with names and data types matching the ones you have defined. This check is agnostic to the order of columns specified in your model (SQL) or yaml spec.
2. dbt will include the column names, data types, and constraints in the DDL statements it submits to the data platform, which will be enforced while building or updating the model's table.

## FAQs

### Which models should have contracts?

Any model can define a contract. Defining contracts for public models that are being shared with other groups, teams, and (soon) dbt projects is especially important.
Any model can define a contract. Defining contracts for "public" models that are being shared with other groups, teams, and (soon) dbt projects is especially important. For more, read about ["Model access"](model-access).

### How are contracts different from tests?

A model's contract defines the **shape** of the returned dataset.

[Tests](tests) are a more flexible mechanism for validating the content of your model. So long as you can write the query, you can run the test. Tests are also more configurable via `severity` and custom thresholds and are easier to debug after finding failures. The model has already been built, and the relevant records can be materialized in the data warehouse by [storing failures](resource-configs/store_failures).

In blue/green deployments (docs link TK), ... <!-- TODO write more here -->
In a parallel for software APIs, the structure of the API response is the contract. Quality and reliability ("uptime") are also very important attributes of an API's quality, but not part of the contract per se, indicating a breaking change or requiring a version bump.

### Where are contracts supported?

In parallel for software APIs:
- The structure of the API response is the contract
- Quality and reliability ("uptime") are also **crucial**, but not part of the contract per se.
At present, model contracts are supported for:
- SQL models (not yet Python)
- Models materialized as `table`, `view`, and `incremental` (with `on_schema_change: append_new_columns`)
- On the most popular data platforms — but which `constraints` are supported/enforced varies by platform
19 changes: 11 additions & 8 deletions website/docs/guides/migration/versions/02-upgrading-to-v1.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: New features and changes in dbt Core v1.5
---

:::info
v1.5 is currently available as a **beta prerelease.** Availability in dbt Cloud coming soon!
v1.5 is currently available as a **beta prerelease**
:::

### Resources
Expand All @@ -16,11 +16,11 @@ v1.5 is currently available as a **beta prerelease.** Availability in dbt Cloud

:::info

Planned release date: April 26, 2023
Planned release date: April 27, 2023

:::

dbt Core v1.5 is a feature release with two significant additions planned:
dbt Core v1.5 is a feature release, with two significant additions planned:
1. Models as APIs &mdash; the first phase of [multi-project deployments](https://github.com/dbt-labs/dbt-core/discussions/6725)
2. An initial Python API for dbt-core supporting programmatic invocations at parity with the CLI.

Expand All @@ -40,11 +40,14 @@ Changes planned for v1.5:

### For consumers of dbt artifacts (metadata)

The manifest schema version will be updated to `v9`. Specific changes to be noted here.
The manifest schema version will be updated to `v9`. Specific changes:
- Addition of `groups` as a top-level key
- Addition of `access` as a top-level node config for models
- Addition of `group` and `contract` as node configs

### For maintainers of adapter plugins

Coming soon: GH discussion detailing interface changes and offering a forum for Q&A
For more detailed information and to ask any questions, please visit [dbt-core/discussions/6624](https://github.com/dbt-labs/dbt-core/discussions/6624).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stealing from #3052 - no reason to delay getting this info live


## New and changed documentation

Expand All @@ -53,9 +56,9 @@ More to come!
:::

### Publishing models as APIs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the nomenclature here is liable to change, as we go through formal internal naming process. No updates just yet / for this PR; I expect to be ready with those updates before the RC (April 13).

- [Model contracts](model-contracts) ([#2839](https://github.com/dbt-labs/docs.getdbt.com/issues/2839))
- [Model access](model-access) ([#2840](https://github.com/dbt-labs/docs.getdbt.com/issues/2840))
- [Model versions](model-versions) ([#2841](https://github.com/dbt-labs/docs.getdbt.com/issues/2841))
- [Model contracts](model-contracts)
- [Model access](model-access)
- [Model versions](model-versions)

### dbt-core Python API
- Auto-generated documentation ([#2674](https://github.com/dbt-labs/docs.getdbt.com/issues/2674)) for dbt-core CLI & Python API for programmatic invocations
1 change: 1 addition & 0 deletions website/docs/reference/node-selection/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ Because state comparison is complex, and everyone's project is different, dbt su
- `state:modified.relation`: Changes to `database`/`schema`/`alias` (the database representation of this node), irrespective of `target` values or `generate_x_name` macros
- `state:modified.persisted_descriptions`: Changes to relation- or column-level `description`, _if and only if_ `persist_docs` is enabled at each level
- `state:modified.macros`: Changes to upstream macros (whether called directly or indirectly by another macro)
- `state:modified.contract`: Changes to a model's [contract](resource-configs/contract), which currently include the `name` and `data_type` of `columns`. Removing or changing the type of an existing column is considered a breaking change, and will raise an error.

Remember that `state:modified` includes _all_ of the criteria above, as well as some extra resource-specific criteria, such as modifying a source's `freshness` or `quoting` rules or an exposure's `maturity` property. (View the source code for the full set of checks used when comparing [sources](https://github.com/dbt-labs/dbt-core/blob/9e796671dd55d4781284d36c035d1db19641cd80/core/dbt/contracts/graph/parsed.py#L660-L681), [exposures](https://github.com/dbt-labs/dbt-core/blob/9e796671dd55d4781284d36c035d1db19641cd80/core/dbt/contracts/graph/parsed.py#L768-L783), and [executable nodes](https://github.com/dbt-labs/dbt-core/blob/9e796671dd55d4781284d36c035d1db19641cd80/core/dbt/contracts/graph/parsed.py#L319-L330).)

Expand Down
114 changes: 63 additions & 51 deletions website/docs/reference/resource-configs/contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,78 +11,90 @@ id: "contract"
- [Defining `columns`](resource-properties/columns)
- [Defining `constraints`](resource-properties/constraints)

<!-- TODO: move some of this content elsewhere, and update to reflect new proposed syntax -->

:::info Beta functionality
This functionality is new in v1.5. These docs exist to provide a high-level overview of what's to come. The specific syntax is liable to change.

In particular:
- The current name of the `contract` config is `constraints_enabled`.
- The "preflight" check only includes column `name` and is order-sensitive. The goal is to add `data_type` and make it insensitive to column order.
This functionality is new in v1.5! The syntax is mostly locked, but some small details are still liable to change.
:::

# Definition

When the `contract` configuration is enabled, dbt will ensure that your model's returned dataset exactly matches the attributes you have defined in yaml:
When the `contract` configuration is enforced, dbt will ensure that your model's returned dataset exactly matches the attributes you have defined in yaml:
- `name` and `data_type` for every column
- additional [`constraints`](resource-properties/constraints), as supported for this materialization + data platform

:::caution Under construction 🚧
More to come!
:::

You can manage data type constraints on your models using the `constraints_enabled` configuration. This configuration is available on all models and is disabled by default. When enabled, dbt will automatically add constraints to your models based on the data types of the columns in your model's schema. This is a great way to ensure your data is always in the correct format. For example, if you have a column in your model defined as a `date` data type, dbt will automatically add a data type constraint to that column to ensure the data in that column is always a valid date. If you want to add a `not null` condition to a column in a preventative manner rather than as a test, you can add the `not null` value to the column definition in your model's schema: `constraints: ['not null']`.
The `data_type` defined in your yaml file should match a data type recognized by your data platform. dbt does not do any type aliasing itself; if your data platform recognizes both `int` and `integer` as corresponding to the same type, then they will return a match.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

## When to use constraints vs. tests
## Example

Constraints serve as a **preventative** measure against bad data quality **before** the dbt model is (re)built. It is only limited by the respective database's functionality and the supported data types. Examples of constraints: `not null`, `unique`, `primary key`, `foreign key`, `check`
<File name='models/dim_customers.yml'>

Tests serve as a **detective** measure against bad data quality _after_ the dbt model is (re)built.
```yml
models:
- name: dim_customers
config:
contract:
enforced: true
columns:
- name: customer_id
data_type: int
constraints:
- type: not_null
- name: customer_name
data_type: string
```

Constraints are great when you define `constraints: ['not null']` for a column in your model's schema because it'll prevent `null` values from being inserted into that column at dbt model creation time and prevent other unintended values from being inserted into that column without dbt's intervention as it relies on the database to enforce the constraint. This can **replace** the `not_null` test. However, performance issues may arise depending on your database.
</File>

Tests should be used in addition to and instead of constraints when you want to test things like `accepted_values` and `relationships`. These are usually not enforced with built-in database functionality and are not possible with constraints. Also, custom tests will allow more flexibility and address nuanced data quality issues that may not be possible with constraints.
<File name='models/dim_customers.yml'>

## Current Limitations
Let's say your model is defined as:
```sql
select
'abc123' as customer_id,
'My Best Customer' as customer_name
```

- `contract` (a.k.a. `constraints_enabled`) must be configured in the yaml [`config`] property _only_. Setting this configuration via in-file config or in `dbt_project.yml` is not supported.
- `contract` (a.k.a. `constraints_enabled`) is supported only for a SQL model materialized as `table`.
- Prerequisite checks include the column `name,` but not yet their `data_type`. We intend to support `data_type` verification in an upcoming beta prerelease.
- The order of columns in your `yml` file must match the order of columns returned by your model's SQL query.
- While most data platforms support `not_null` checks, support for [additional `constraints`](resource-properties/constraints) varies by data platform.
</File>

When you `dbt run` your model, _before_ dbt has materialized it as a table in the database, you will see this error:
```txt
# example error message
Compilation Error in model constraints_example (models/constraints_examples/constraints_example.sql)
Please ensure the name, order, and number of columns in your `yml` file match the columns in your SQL file.
Schema File Columns: ['id', 'date_day', 'color']
SQL File Columns: ['id', 'color', 'date_day']
Compilation Error in model dim_customers (models/dim_customers.sql)
Contracts are enabled for this model. Please ensure the name, data_type, and number of columns in your `yml` file match the columns in your SQL file.
Schema File Columns: customer_id INT, customer_name TEXT
SQL File Columns: customer_id TEXT, customer_name TEXT
```

## Example
## Support
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more-positive spin on "Limitations" :) but if you think that's a clearer way of expressing it, we can switch out one for the other (or for something else)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Always more pleasing to the reader to put a positive spin on anything. "Don't apologize for keeping them waiting; thank them for their patience" type of experience.


<File name='models/schema.yml'>
At present, model contracts are supported for:
- SQL models (not yet Python)
- Models materialized as `table`, `view`, and `incremental` (with `on_schema_change: append_new_columns`)
- On the most popular data platforms — but which [`constraints`](resource-properties/constraints) are supported/enforced varies by platform

```yml
models:
- name: constraints_example
config:
constraints_enabled: true
columns:
- name: id
data_type: integer
description: hello
constraints: ['not null', 'primary key']
constraints_check: (id > 0)
tests:
- unique
- name: color
constraints:
- not null
- primary key
data_type: string
- name: date_day
data_type: date
### Incremental models and `on_schema_change`

Why require that incremental models also set [`on_schema_change`](incremental-models#what-if-the-columns-of-my-incremental-model-change), and why to `append_new_columns`?

Imagine:
- You add a new column to both the SQL and the yaml spec
- You don't set `on_schema_change`, or you set `on_schema_change: 'ignore'`
- dbt doesn't actually add that new column to the existing table — and the upsert/merge still succeeds, because it does that upsert/merge on the basis of the already-existing "destination" columns only (this is long-established behavior)
- The result is a delta between the yaml-defined contract, and the actual table in the database - which means the contract is now incorrect!

Why `append_new_columns`, rather than `sync_all_columns`? Because removing existing columns is a breaking change for contracted models!

### Catching breaking changes
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

When you use the `state:modified` selection method in Slim CI, dbt will detect changes to model contracts, and raise an error if any of those changes could be breaking for downstream consumers.

Breaking changes include:
- Removing an existing column
- Changing the `data_type` of an existing column
- (Future) Removing or modifying one of the `constraints` on an existing column

```
dbt.exceptions.ModelContractError: Contract Error in model dim_customers (models/dim_customers.sql)
There is a breaking change in the model contract because column definitions have changed; you may need to create a new version. See: https://docs.getdbt.com/docs/collaborate/publish/model-versions
```

</File>
Adding new columns, or adding new constraints to existing columns, is not considered a breaking change.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 3 additions & 1 deletion website/docs/reference/resource-properties/columns.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,4 +152,6 @@ Columns are not resources in and of themselves. Instead, they are child properti

Because columns are not resources, their `tags` and `meta` properties are not true configurations. They do not inherit the `tags` or `meta` values of their parent resources. However, you can select a generic test, defined on a column, using tags applied to its column or top-level resource; see [test selection examples](test-selection-examples#run-tests-on-tagged-columns).

Columns may optionally define a `data_type`. This is for metadata purposes only, such as to use alongside the [`external`](resource-properties/external) property of sources.
Columns may optionally define a `data_type`, which is necessary for:
- Enforcing a model [contract](resource-configs/contract)
- Use in other packages or plugins, such as the [`external`](resource-properties/external) property of sources and [`dbt-external-tables`](https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/)
Loading