Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models #2940

Closed
1 of 5 tasks
pcasteran opened this issue Dec 6, 2020 · 2 comments · Fixed by #2976
Closed
1 of 5 tasks
Labels
bigquery bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors!
Milestone

Comments

@pcasteran
Copy link
Contributor

pcasteran commented Dec 6, 2020

Describe the bug

I used the set_sql_header macro to declare a BigQuery scripting variable and initialize it so it can be used in the context of my model. The model materialization is of type incremental with incremental_strategy="insert_overwrite" and creates a partitioned table.
The initial run (i.e. not incremental) works fine. However a subsequent run (incremental) fails due to syntax error in the executed query.

Steps To Reproduce

Here is the model definition:

{{
  config(
    materialized="incremental",
    incremental_strategy="insert_overwrite",
    partition_by={
      "field": "insertion_time",
      "data_type": "timestamp"
    }
  )
}}

{% call set_sql_header(config) %}
DECLARE my_var INT64 DEFAULT 42;
{%- endcall %}

SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column

Here is the executed query for the initial run:

query
/* {"app": "dbt", "dbt_version": "0.18.1", ...} */

DECLARE my_var INT64 DEFAULT 42;

  create or replace table `<my_project>`.`<my_dataset>`.`my_incremental_model`
  partition by date(insertion_time)
  
  OPTIONS(
      description=""""""
    )
  as (
SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column
  );

And here is the executed query for the incremental run that fails:

query
/* {"app": "dbt", "dbt_version": "0.18.1", ...} */
 
      -- generated script to merge partitions into `<my_project>`.`<my_dataset>`.`my_incremental_model`
      declare dbt_partitions_for_replacement array<date>;
      declare _dbt_max_partition timestamp;

      set _dbt_max_partition = (
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
      );

      -- 1. create a temp table
      
DECLARE my_var INT64 DEFAULT 42;

  create or replace table `<my_project>`.`<my_dataset>`.`my_incremental_model__dbt_tmp`
  partition by date(insertion_time)
  OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour),
      description=""""""
    )
  as (
SELECT
  CURRENT_TIMESTAMP() AS insertion_time,
  my_var AS my_int_column
  );

<rest of the query omitted> 

We can see that the definition and initialization of my variable happens after the ones for the _dbt_max_partition variable. Unfortunately, this is done in two steps for _dbt_max_partition : first a DECLARE statement then a SET statement.
However, as per the BigQuery documentation on variable definition:

Variable declarations must appear at the start of a script, prior to any other statements

Expected behavior

I should be able to declare variables that are available in my model, whether it's an initial or incremental run.

A simple solution to achieve this, could be merging the DECLARE and SET statements used for the _dbt_max_partition variable into one single statement (DECLARE + DEFAULT) as this:

declare _dbt_max_partition timestamp default(
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
      );

I can submit a PR with that change if the proposed solution is accepted.

Screenshots and log output

Completed with 1 error and 0 warnings:

Database Error in model my_incremental_model (models/perso_test/my_incremental_model.sql)
  Variable declarations are allowed only at the start of a block or script at [14:1]

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - redshift: 0.18.1
  - postgres: 0.18.1
  - snowflake: 0.18.1
  - bigquery: 0.18.1
@pcasteran pcasteran added bug Something isn't working triage labels Dec 6, 2020
@pcasteran pcasteran changed the title BigQuery - Unable to use the set_sql_header macro to declare variables for incremental models BigQuery - Unable to use the "set_sql_header" macro to declare variables for incremental models Dec 6, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 7, 2020

I knew someone would run into this one, someday.

My first thought was to suggest a new config, declare, that would accept an array of values and then makes sure to declare the user's variables (along with the materialization's) once, all together, at the top of the script.

The approach you've suggested is much better than that:

A simple solution to achieve this, could be merging the DECLARE and SET statements used for the _dbt_max_partition variable into one single statement (DECLARE + DEFAULT) as this:

declare _dbt_max_partition timestamp default(
          select max(insertion_time) from `<my_project>`.`<my_dataset>`.`my_incremental_model`
     );

We could take this one step further: #2596 introduced {{adapter.get_partitions_metadata(table)}}, a mechanism to get partition info via a legacy SQL metadata query (= free), rather than select max(partition_col) in standard SQL (= not free).

Here's what I'd love to see here:

{% set partition_metadata = adapter.get_partitions_metadata(target_relation) %}
{% set _dbt_max_partition = ... some agate/jinja magic to get the max value from partition_metadata ... %}
declare _dbt_max_partition {{ partition_by.data_type }} default '{{ _dbt_max_partition }}';

As far as improving the "dynamic" insert_overwrite incremental strategy, such an approach would both:

  • enable you to declare variables within set_sql_header
  • save you money

Is that code you're interested in contributing @pcasteran?

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! bigquery and removed triage labels Dec 7, 2020
@pcasteran
Copy link
Contributor Author

Yes @jtcohen6, I can have a look at it this weekend or maybe next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants