Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New best practice guide for clone #4542

Merged
merged 16 commits into from
Nov 30, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions website/docs/best-practices/clone-incremental-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: "Clone incremental models as the first step of your CI job"
id: "clone-incremental-models"
description: Learn how to define clone incremental models as the first step of your CI job.
displayText: Clone incremental models as the first step of your CI job
hoverSnippet: Learn how to clone incremental models for CI jobs.
---

Before you begin, you must be aware of a few conditions:
- `dbt clone` is only available with dbt version 1.6 and newer. Refer to our [upgrade guide](/docs/dbt-versions/upgrade-core-in-cloud) for help enabling newer versions in dbt Cloud
- This strategy only works for warehouse that support zero copy cloning (otherwise `dbt clone` will just create pointer views).
- Some teams may want to test that their incremental models run in both incremental mode and full-refresh mode.

Imagine you've created a [Slim CI job](/docs/deploy/continuous-integration) in dbt Cloud and it is configured to:

- Defer to your production environment.
- Run the command `dbt build --select state:modified+` to run and test all of the models you've modified and their downstream dependencies.
- Trigger whenever a developer on your team opens a PR against the main branch.

<Lightbox src="/img/best-practices/slim-ci-job.png" width="70%" title="Example of a slim CI job with the above configurations" />

Now imagine your dbt project looks something like this in the DAG:

<Lightbox src="/img/best-practices/dag-example.png" width="70%" title="Sample project DAG" />

When you open a pull request (PR) that modifies `dim_wizards`, your CI job will kickoff and build _only the modified models and their downstream dependencies_ (in this case, `dim_wizards` and `fct_orders`) into a temporary schema that's unique to your PR.

This build mimics the behavior of what will happen once the PR is merged into the main branch. It ensures you're not introducing breaking changes, without needing to build your entire dbt project.

## What happens when one of the modified models (or one of their downstream dependencies) is an incremental model?

Because your CI job is building modified models into a PR-specific schema, on the first execution of `dbt build --select state:modified+`, the modified incremental model will be built in its entirety _because it does not yet exist in the PR-specific schema_ and [is_incremental will be false](/docs/building-a-dbt-project/building-models/configuring-incremental-models#understanding-the-is_incremental-macro). You're running in `full-refresh` mode.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

This can be suboptimal because:
- Typically incremental models are your largest datasets, so they take a long time to build in their entirety which can slow down development time and incur high warehouse costs.
- There are situations where a `full-refresh` of the incremental model passes successfully in your CI job but an _incremental_ build of that same table in prod would fail when the PR is merged into main (think schema drift where [on_schema_change](/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change) config is set to `fail`)

You can alleviate these problems by zero copy cloning the relevant, pre-exisitng incremental models into your PR-specific schema as the first step of the CI job using the `dbt clone` command. This way, the incremental models already exist in the PR-specific schema when you first execute the command `dbt build --select state:modified+` so the `is_incremental` flag will be `true`.

You'll have two commands for your dbt Cloud CI check to execute:
1. Clone all of the pre-existing incremental models that have been modified or are downstream of another model that has been modified: `dbt clone --select state:modified+,config.materialized:incremental,state:old`
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
2. Build all of the models that have been modified and their downstream dependencies: `dbt build --select state:modified+`
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Because of your first clone step, the incremental models selected in your `dbt build` on the second step will run in incremental mode.

<Lightbox src="/img/best-practices/clone-command.png" width="70%" title="Clone command in the CI config" />

Your CI jobs will run faster, and you're more accurately mimicking the behavior of what will happen once the PR has been merged into main.

## Additional help
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Relevant `dbt clone` Slack thread:** https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1692830261651829
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal slack thread and would not be relevant to readers, this was just on the issue to add more color to what content we should include - we should cut this


### From the "Better CI for better data quality" coalesce talk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we feel like this chunk adds any value? If it's just repetitive, happy to cut it!

"If you use the incremental materialization in your dbt project, you should consider cloning your relevant, pre-existing incremental models into your PR-specific schema as the first step of your CI check. This will force your second step to run in incremental mode (where is_incremental is true) because now the models already exist in your PR-specific schema (via cloning). This is beneficial because it more accurately mimics what will happen when you merge your changes into production and it will save time and money by not rebuilding your incremental models (which are often large data sets) from scratch for every PR that modifies them." -Grace Goheen, dbt Labs Product Manager

### Expansion on "think schema drift" where [on_schema_change](/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change) config is set to `fail`" from above

Imagine you have an incremental model `my_incremental_model` with the following config:

```sql

{{
config(
materialized='incremental',
unique_key='unique_id',
on_schema_change='fail'
)
}}

```

Now, let’s say you open up a PR that adds a new column to `my_incremental_model`. In this case:
- An incremental build will fail.
- A `full-refresh` will succeed.

If you have a daily production job that just executes `dbt build` without a `--full-refresh` flag, once the PR is merged into main and the job kicks off, you will get a failure. So the question is - what do you want to happen in CI?
- Do you want to also get a failure in CI, so that you know that once this PR is merged into main you need to immediately execute a `dbt build --full-refresh --select my_incremental_model` in production in order to avoid a failure in prod? This will block your CI check from passing.
- Do you want your CI check to succeed, because once you do run a `full-refresh` for this model in prod you will be in a successful state? This may lead unpleasant surprises if your production job is suddenly failing when you merge this PR into main if you don’t remember you need to execute a `dbt build --full-refresh --select my_incremental_model` in production.

There’s probably no perfect solution here; it’s all just tradeoffs! Our preference would be to have the failing CI job and have to manually override the blocking branch protection rule so that there are no surprises and we can proactively run the appropriate command in production once the PR is merged.

### Expansion on "why `state:old`"

For brand new incremental models, you want them to run in `full-refresh` mode in CI, because they will run in `full-refresh` mode in production when the PR is merged into `main`. They also don't exist yet in the production environment... they're brand new!
If you don't specify this, you won't get an error just a “No relation found in state manifest for…”. So, it technically works without specifying `state:old` but adding `state:old` is more explicit and means it won't even try to clone the brand new incremental models.
1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1059,6 +1059,7 @@ const sidebarSettings = {
"best-practices/materializations/materializations-guide-7-conclusion",
],
},
"best-practices/clone-incremental-models",
"best-practices/writing-custom-generic-tests",
"best-practices/best-practice-workflows",
"best-practices/dbt-unity-catalog-best-practices",
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading