BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

pcasteran · 2020-12-06T20:34:51Z

Describe the bug

I used the set_sql_header macro to declare a BigQuery scripting variable and initialize it so it can be used in the context of my model. The model materialization is of type incremental with incremental_strategy="insert_overwrite" and creates a partitioned table.
The initial run (i.e. not incremental) works fine. However a subsequent run (incremental) fails due to syntax error in the executed query.

Steps To Reproduce

Here is the model definition:

{{
  config(
    materialized="incremental",
    incremental_strategy="insert_overwrite",
    partition_by={
      "field": "insertion_time",
      "data_type": "timestamp"
    }
  )
}}

{% call set_sql_header(config) %}
DECLARE my_var INT64 DEFAULT 42;
{%- endcall %}

SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column

Here is the executed query for the initial run:

query

/* {"app": "dbt", "dbt_version": "0.18.1", ...} */

DECLARE my_var INT64 DEFAULT 42;

  create or replace table `<my_project>`.`<my_dataset>`.`my_incremental_model`
  partition by date(insertion_time)
  
  OPTIONS(
      description=""""""
    )
  as (
SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column
  );

And here is the executed query for the incremental run that fails:

query

/* {"app": "dbt", "dbt_version": "0.18.1", ...} */
 
      -- generated script to merge partitions into `<my_project>`.`<my_dataset>`.`my_incremental_model`
      declare dbt_partitions_for_replacement array<date>;
      declare _dbt_max_partition timestamp;

      set _dbt_max_partition = (
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
      );

      -- 1. create a temp table
      
DECLARE my_var INT64 DEFAULT 42;

  create or replace table `<my_project>`.`<my_dataset>`.`my_incremental_model__dbt_tmp`
  partition by date(insertion_time)
  OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour),
      description=""""""
    )
  as (
SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column
  );

<rest of the query omitted>

We can see that the definition and initialization of my variable happens after the ones for the _dbt_max_partition variable. Unfortunately, this is done in two steps for _dbt_max_partition : first a DECLARE statement then a SET statement.
However, as per the BigQuery documentation on variable definition:

Variable declarations must appear at the start of a script, prior to any other statements

Expected behavior

I should be able to declare variables that are available in my model, whether it's an initial or incremental run.

A simple solution to achieve this, could be merging the DECLARE and SET statements used for the _dbt_max_partition variable into one single statement (DECLARE + DEFAULT) as this:

declare _dbt_max_partition timestamp default(
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
      );

I can submit a PR with that change if the proposed solution is accepted.

Screenshots and log output

Completed with 1 error and 0 warnings:

Database Error in model my_incremental_model (models/perso_test/my_incremental_model.sql)
  Variable declarations are allowed only at the start of a block or script at [14:1]

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - redshift: 0.18.1
  - postgres: 0.18.1
  - snowflake: 0.18.1
  - bigquery: 0.18.1

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-12-07T21:53:35Z

I knew someone would run into this one, someday.

My first thought was to suggest a new config, declare, that would accept an array of values and then makes sure to declare the user's variables (along with the materialization's) once, all together, at the top of the script.

The approach you've suggested is much better than that:

A simple solution to achieve this, could be merging the DECLARE and SET statements used for the _dbt_max_partition variable into one single statement (DECLARE + DEFAULT) as this:
declare _dbt_max_partition timestamp default(
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
     );

We could take this one step further: #2596 introduced {{adapter.get_partitions_metadata(table)}}, a mechanism to get partition info via a legacy SQL metadata query (= free), rather than select max(partition_col) in standard SQL (= not free).

Here's what I'd love to see here:

{% set partition_metadata = adapter.get_partitions_metadata(target_relation) %}
{% set _dbt_max_partition = ... some agate/jinja magic to get the max value from partition_metadata ... %}
declare _dbt_max_partition {{ partition_by.data_type }} default '{{ _dbt_max_partition }}';

As far as improving the "dynamic" insert_overwrite incremental strategy, such an approach would both:

enable you to declare variables within set_sql_header
save you money

Is that code you're interested in contributing @pcasteran?

pcasteran · 2020-12-08T21:57:29Z

Yes @jtcohen6, I can have a look at it this weekend or maybe next week.

pcasteran added bug Something isn't working triage labels Dec 6, 2020

pcasteran changed the title ~~BigQuery - Unable to use the set_sql_header macro to declare variables for incremental models~~ BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models Dec 6, 2020

jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! bigquery and removed triage labels Dec 7, 2020

pcasteran mentioned this issue Dec 24, 2020

Fix _dbt_max_partition declaration and initialization for BigQuery incremental models #2976

Merged

4 tasks

jtcohen6 added this to the Margaret Mead milestone Jan 19, 2021

jtcohen6 closed this as completed in #2976 Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

pcasteran commented Dec 6, 2020 •

edited

Loading

jtcohen6 commented Dec 7, 2020 •

edited

Loading

pcasteran commented Dec 8, 2020

BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

Comments

pcasteran commented Dec 6, 2020 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

jtcohen6 commented Dec 7, 2020 • edited Loading

pcasteran commented Dec 8, 2020

pcasteran commented Dec 6, 2020 •

edited

Loading

jtcohen6 commented Dec 7, 2020 •

edited

Loading