Skip to content

Commit

Permalink
Merge branch 'current' into mckenna-kenney-test-smarter-part-2
Browse files Browse the repository at this point in the history
  • Loading branch information
joellabes authored Dec 5, 2024
2 parents f483572 + e6da85d commit f0e8261
Show file tree
Hide file tree
Showing 25 changed files with 362 additions and 80 deletions.
2 changes: 1 addition & 1 deletion website/docs/docs/build/exposures.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ dbt test -s +exposure:weekly_jaffle_report
```

When we generate the dbt Explorer site, you'll see the exposure appear:
When we generate the [dbt Explorer site](/docs/collaborate/explore-projects), you'll see the exposure appear:

<Lightbox src="/img/docs/building-a-dbt-project/dbt-explorer-exposures.jpg" title="Exposures has a dedicated section, under the 'Resources' tab in dbt Explorer, which lists each exposure in your project."/>
<Lightbox src="/img/docs/building-a-dbt-project/dag-exposures.png" title="Exposures appear as nodes in the dbt Explorer DAG. It displays an orange 'EXP' indicator within the node. "/>
Expand Down
2 changes: 0 additions & 2 deletions website/docs/docs/build/hooks-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,6 @@ Hooks are snippets of SQL that are executed at different times:

Hooks are a more-advanced capability that enable you to run custom SQL, and leverage database-specific actions, beyond what dbt makes available out-of-the-box with standard materializations and configurations.

<Snippet path="hooks-to-grants" />

If (and only if) you can't leverage the [`grants` resource-config](/reference/resource-configs/grants), you can use `post-hook` to perform more advanced workflows:

* Need to apply `grants` in a more complex way, which the dbt Core `grants` config doesn't (yet) support.
Expand Down
65 changes: 53 additions & 12 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Each "batch" corresponds to a single bounded time period (by default, a single d

This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills), concurrently, and [retry](#retry) them independently.

### Example
## Example

A `sessions` model aggregates and enriches data that comes from two other models:
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`.
Expand Down Expand Up @@ -175,22 +175,63 @@ It does not matter whether the table already contains data for that day. Given t

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_filters.png" title="Each batch of sessions filters page_views to the matching time-bound batch, but doesn't filter sessions, performing a full scan for each batch."/>

### Relevant configs
## Relevant configs

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| `begin` | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| `batch_size` | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| `lookback` | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |
| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

### Required configs for specific adapters
Some adapters require additional configurations for the microbatch strategy. This is because each adapter implements the microbatch strategy differently.

The following table lists the required configurations for the specific adapters, in addition to the standard microbatch configs:

| Adapter | `unique_key` config | `partition_by` config |
|----------|------------------|--------------------|
| [`dbt-postgres`](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ Required | N/A |
| [`dbt-spark`](/reference/resource-configs/spark-configs#incremental-models) | N/A | ✅ Required |
| [`dbt-bigquery`](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | N/A | ✅ Required |

For example, if you're using `dbt-postgres`, configure `unique_key` as follows:

<File name="models/sessions.sql">

```sql
{{ config(
materialized='incremental',
incremental_strategy='microbatch',
unique_key='sales_id', ## required for dbt-postgres
event_time='transaction_date',
begin='2023-01-01',
batch_size='day'
) }}
select
sales_id,
transaction_date,
customer_id,
product_id,
total_amount
from {{ source('sales', 'transactions') }}
```

In this example, `unique_key` is required because `dbt-postgres` microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table.

</File>

### Full refresh

As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.

### Usage
## Usage

**You must write your model query to process (read and return) exactly one "batch" of data**. This is a simplifying assumption and a powerful one:
- You don’t need to think about `is_incremental` filtering
Expand All @@ -207,7 +248,7 @@ During standard incremental runs, dbt will process batches according to the curr

**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models that configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.

### Backfills
## Backfills

Whether to fix erroneous source data or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Expand All @@ -222,13 +263,13 @@ dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_backfill.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

### Retry
## Retry

If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches.

![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f)

### Timezones
## Timezones

For now, dbt assumes that all values supplied are in UTC:

Expand Down
Loading

0 comments on commit f0e8261

Please sign in to comment.