dbt-labs · jtcohen6 · Feb 19, 2021 · Jan 12, 2021 · Jan 12, 2021 · Jan 12, 2021
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,10 @@
 
 ### Breaking changes
 - Users of the `http` and `thrift` connection methods need to install extra requirements: `pip install dbt-spark[PyHive]` ([#109](https://github.com/fishtown-analytics/dbt-spark/pull/109), [#126](https://github.com/fishtown-analytics/dbt-spark/pull/126))
+- Incremental models have `incremental_strategy: append` by default. This strategy adds new records
+without updating or overwriting existing records. For that, use `merge` or `insert_overwrite` instead, depending
+on the file format, connection method, and attributes of your underlying data. dbt will try to raise a helpful error
+if you configure a strategy that is not supported for a given file format or connection. ([#140](https://github.com/fishtown-analytics/dbt-spark/pull/140), [#141](https://github.com/fishtown-analytics/dbt-spark/pull/141))
 
 ### Under the hood
 - Enable `CREATE OR REPLACE` support when using Delta. Instead of dropping and recreating the table, it will keep the existing table, and add a new version as supported by Delta. This will ensure that the table stays available when running the pipeline, and you can track the history.

diff --git a/README.md b/README.md
@@ -161,48 +161,67 @@ The following configurations can be supplied to models run with the dbt-spark pl
 | partition_by         | Partition the created table by the specified columns. A directory is created for each partition.                                                            | Optional                                | `partition_1`        |
 | clustered_by         | Each partition in the created table will be split into a fixed number of buckets by the specified columns.                                                  | Optional                                | `cluster_1`          |
 | buckets              | The number of buckets to create while clustering                                                                                                            | Required if `clustered_by` is specified | `8`                  |
-| incremental_strategy | The strategy to use for incremental models (`insert_overwrite` or `merge`). Note `merge` requires `file_format` = `delta` and `unique_key` to be specified. | Optional (default: `insert_overwrite`)  | `merge`              |
+| incremental_strategy | The strategy to use for incremental models (`append`, `insert_overwrite`, or `merge`). | Optional (default: `append`)  | `merge`              |
 | persist_docs         | Whether dbt should include the model description as a table `comment`                                                                                       | Optional                                | `{'relation': true}` |
 
 
 **Incremental Models**
 
-To use incremental models, specify a `partition_by` clause in your model config. The default incremental strategy used is `insert_overwrite`, which will overwrite the partitions included in your query. Be sure to re-select _all_ of the relevant
-data for a partition when using the `insert_overwrite` strategy. If a `partition_by` config is not specified, dbt will overwrite the entire table as an atomic operation, replacing it with new data of the same schema. This is analogous to `truncate` + `insert`.
+dbt has a number of ways to build models incrementally, called "incremental strategies." Some strategies depend on certain file formats, connection types, and other model configurations:
+- `append` (default): Insert new records without updating or overwriting any existing data.
+- `insert_overwrite`: If `partition_by` is specified, overwrite partitions in the table with new data. (Be sure to re-select _all_ of the relevant data for a partition.) If no `partition_by` is specified, overwrite the entire table with new data.  [Cannot be used with `file_format: delta`. Not available on Databricks SQL Endpoints. For atomic replacement of Delta tables, use the `table` materialization.]
+- `merge`: Match records based on a `unique_key`; update old records, insert new ones. (If no `unique_key` is specified, all new data is inserted, similar to `append`.) [Requires `file_format: delta`. Available only on Databricks Runtime.]
 
+Examples:
+
+```sql
+{{ config(
+    materialized='incremental',
+    incremental_strategy='append'
+) }}
+
+
+--  All rows returned by this query will be appended to the existing table
+
+select * from {{ ref('events') }}
+{% if is_incremental() %}
+  where event_ts > (select max(event_ts) from {{ this }})
+{% endif %}
 ```
+
+```sql
 {{ config(
     materialized='incremental',
+    incremental_strategy='merge',
     partition_by=['date_day'],
     file_format='parquet'
 ) }}
 
-/*
-  Every partition returned by this query will be overwritten
-  when this model runs
-*/
+-- Every partition returned by this query will overwrite existing partitions
 
 select
     date_day,
     count(*) as users
 
 from {{ ref('events') }}
-where date_day::date >= '2019-01-01'
+{% if is_incremental() %}
+  where date_day > (select max(date_day) from {{ this }})
+{% endif %}
 group by 1
 ```
 
-The `merge` strategy is only supported when using file_format `delta` (supported in Databricks). It also requires you to specify a `unique key` to match existing records.
-
-```
+```sql
 {{ config(
     materialized='incremental',
     incremental_strategy='merge',
-    partition_by=['date_day'],
+    unique_key='event_id',
     file_format='delta'
 ) }}
 
-select *
-from {{ ref('events') }}
+-- Existing events, matched on `event_id`, will be updated
+-- New events will be appended
+
+select * from {{ ref('events') }}
 {% if is_incremental() %}
   where date_day > (select max(date_day) from {{ this }})
 {% endif %}

diff --git a/dbt/include/spark/macros/materializations/incremental.sql b/dbt/include/spark/macros/materializations/incremental.sql
@@ -1,5 +1,5 @@
 {% macro get_insert_overwrite_sql(source_relation, target_relation) %}
-
+    
     {%- set dest_columns = adapter.get_columns_in_relation(target_relation) -%}
     {%- set dest_cols_csv = dest_columns | map(attribute='quoted') | join(', ') -%}
     insert overwrite table {{ target_relation }}
@@ -8,6 +8,17 @@
 
 {% endmacro %}
 
+
+{% macro get_insert_into_sql(source_relation, target_relation) %}
+
+    {%- set dest_columns = adapter.get_columns_in_relation(target_relation) -%}
+    {%- set dest_cols_csv = dest_columns | map(attribute='quoted') | join(', ') -%}
+    insert into table {{ target_relation }}
+    select {{dest_cols_csv}} from {{ source_relation.include(database=false, schema=false) }}
+
+{% endmacro %}
+
+
 {% macro dbt_spark_validate_get_file_format() %}
   {#-- Find and validate the file format #}
   {%- set file_format = config.get("file_format", default="parquet") -%}
@@ -24,59 +35,79 @@
   {% do return(file_format) %}
 {% endmacro %}
 
+
 {% macro dbt_spark_validate_get_incremental_strategy(file_format) %}
   {#-- Find and validate the incremental strategy #}
-  {%- set strategy = config.get("incremental_strategy", default="insert_overwrite") -%}
+  {%- set strategy = config.get("incremental_strategy", default="append") -%}
 
   {% set invalid_strategy_msg -%}
     Invalid incremental strategy provided: {{ strategy }}
-    Expected one of: 'merge', 'insert_overwrite'
+    Expected one of: 'append', 'merge', 'insert_overwrite'
   {%- endset %}
 
   {% set invalid_merge_msg -%}
     Invalid incremental strategy provided: {{ strategy }}
     You can only choose this strategy when file_format is set to 'delta'
   {%- endset %}
+
+  {% set invalid_insert_overwrite_delta_msg -%}
+    Invalid incremental strategy provided: {{ strategy }}
+    You cannot use this strategy when file_format is set to 'delta'
+    Use the 'append' or 'merge' strategy instead
+  {%- endset %}
+
+  {% set invalid_insert_overwrite_endpoint_msg -%}
+    Invalid incremental strategy provided: {{ strategy }}
+    You cannot use this strategy when connecting via endpoint
+    Use the 'append' or 'merge' strategy instead
+  {%- endset %}
 
-  {% if strategy not in ['merge', 'insert_overwrite'] %}
+  {% if strategy not in ['append', 'merge', 'insert_overwrite'] %}
     {% do exceptions.raise_compiler_error(invalid_strategy_msg) %}
   {%-else %}
     {% if strategy == 'merge' and file_format != 'delta' %}
       {% do exceptions.raise_compiler_error(invalid_merge_msg) %}
     {% endif %}
+    {% if strategy == 'insert_overwrite' and file_format == 'delta' %}
+      {% do exceptions.raise_compiler_error(invalid_insert_overwrite_delta_msg) %}
+    {% endif %}
+    {% if strategy == 'insert_overwrite' and target.endpoint %}
+      {% do exceptions.raise_compiler_error(invalid_insert_overwrite_endpoint_msg) %}
+    {% endif %}
   {% endif %}
 
   {% do return(strategy) %}
 {% endmacro %}
 
-{% macro dbt_spark_validate_merge(file_format) %}
-  {% set invalid_file_format_msg -%}
-    You can only choose the 'merge' incremental_strategy when file_format is set to 'delta'
-  {%- endset %}
-
-  {% if file_format != 'delta' %}
-    {% do exceptions.raise_compiler_error(invalid_file_format_msg) %}
-  {% endif %}
-
-{% endmacro %}
-
 
 {% macro spark__get_merge_sql(target, source, unique_key, dest_columns, predicates=none) %}
   {# ignore dest_columns - we will just use `*` #}
+
+  {% set merge_condition %}
+    {% if unique_key %}
+        on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
+    {% else %}
+        on false
+    {% endif %}
+  {% endset %}
+
     merge into {{ target }} as DBT_INTERNAL_DEST
       using {{ source.include(schema=false) }} as DBT_INTERNAL_SOURCE
-      on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
+      {{ merge_condition }}
       when matched then update set *
       when not matched then insert *
 {% endmacro %}
 
 
 {% macro dbt_spark_get_incremental_sql(strategy, source, target, unique_key) %}
-  {%- if strategy == 'insert_overwrite' -%}
+  {%- if strategy == 'append' -%}
+    {#-- insert new records into existing table, without updating or overwriting #}
+    {{ get_insert_into_sql(source, target) }}
+  {%- elif strategy == 'insert_overwrite' -%}
     {#-- insert statements don't like CTEs, so support them via a temp view #}
     {{ get_insert_overwrite_sql(source, target) }}
-  {%- else -%}
-    {#-- merge all columns with databricks delta - schema changes are handled for us #}
+  {%- elif strategy == 'merge' -%}
+  {#-- merge all columns with databricks delta - schema changes are handled for us #}
     {{ get_merge_sql(target, source, unique_key, dest_columns=none, predicates=none) }}
   {%- endif -%}
 
@@ -85,31 +116,21 @@
 
 {% materialization incremental, adapter='spark' -%}
   {#-- Validate early so we don't run SQL if the file_format is invalid --#}
-  {% set file_format = dbt_spark_validate_get_file_format() -%}
+  {%- set file_format = dbt_spark_validate_get_file_format() -%}
   {#-- Validate early so we don't run SQL if the strategy is invalid --#}
-  {% set strategy = dbt_spark_validate_get_incremental_strategy(file_format) -%}
+  {%- set strategy = dbt_spark_validate_get_incremental_strategy(file_format) -%}
+  {%- set unique_key = config.get('unique_key', none) -%}
 
   {%- set full_refresh_mode = (flags.FULL_REFRESH == True) -%}
 
   {% set target_relation = this %}
   {% set existing_relation = load_relation(this) %}
   {% set tmp_relation = make_temp_relation(this) %}
 
-  {% if strategy == 'merge' %}
-    {%- set unique_key = config.require('unique_key') -%}
-    {% do dbt_spark_validate_merge(file_format) %}
+  {% if strategy == 'insert_overwrite' and config.get('partition_by') %}
+    set spark.sql.sources.partitionOverwriteMode = DYNAMIC
   {% endif %}
 
-  {% if config.get('partition_by') %}
-    {% call statement() %}
-      set spark.sql.sources.partitionOverwriteMode = DYNAMIC
-    {% endcall %}
-  {% endif %}
-
-  {% call statement() %}
-    set spark.sql.hive.convertMetastoreParquet = false
-  {% endcall %}
-
   {{ run_hooks(pre_hooks) }}
 
   {% if existing_relation is none %}

diff --git a/test/integration/spark-databricks-http.dbtspec b/test/integration/spark-databricks-http.dbtspec
@@ -9,16 +9,6 @@ target:
   connect_retries: 5
   connect_timeout: 60
 projects:
-  - overrides: incremental
-    paths:
-      "models/incremental.sql":
-        materialized: incremental
-        body: "select * from {{ source('raw', 'seed') }}"
-    facts:
-      base:
-        rowcount: 10
-      added:
-        rowcount: 20
   - overrides: snapshot_strategy_check_cols
     dbt_project_yml: &file_format_delta
       # we're going to UPDATE the seed tables as part of testing, so we must make them delta format
@@ -40,4 +30,3 @@ sequences:
   test_dbt_data_test: data_test
   test_dbt_ephemeral_data_tests: data_test_ephemeral_models
   test_dbt_schema_test: schema_test
-
diff --git a/test/integration/spark-databricks-odbc-cluster.dbtspec b/test/integration/spark-databricks-odbc-cluster.dbtspec
@@ -10,16 +10,6 @@ target:
   connect_retries: 5
   connect_timeout: 60
 projects:
-  - overrides: incremental
-    paths:
-      "models/incremental.sql":
-        materialized: incremental
-        body: "select * from {{ source('raw', 'seed') }}"
-    facts:
-      base:
-        rowcount: 10
-      added:
-        rowcount: 20
   - overrides: snapshot_strategy_check_cols
     dbt_project_yml: &file_format_delta
       # we're going to UPDATE the seed tables as part of testing, so we must make them delta format

diff --git a/test/integration/spark-databricks-odbc-sql-endpoint.dbtspec b/test/integration/spark-databricks-odbc-sql-endpoint.dbtspec
@@ -10,16 +10,6 @@ target:
   connect_retries: 5
   connect_timeout: 60
 projects:
-  - overrides: incremental
-    paths:
-      "models/incremental.sql":
-        materialized: incremental
-        body: "select * from {{ source('raw', 'seed') }}"
-    facts:
-      base:
-        rowcount: 10
-      added:
-        rowcount: 20
   - overrides: snapshot_strategy_check_cols
     dbt_project_yml: &file_format_delta
       # we're going to UPDATE the seed tables as part of testing, so we must make them delta format

diff --git a/test/integration/spark-thrift.dbtspec b/test/integration/spark-thrift.dbtspec
@@ -7,17 +7,6 @@ target:
   connect_retries: 5
   connect_timeout: 60
   schema: "analytics_{{ var('_dbt_random_suffix') }}"
-projects:
-  - overrides: incremental
-    paths:
-      "models/incremental.sql":
-        materialized: incremental
-        body: "select * from {{ source('raw', 'seed') }}"
-    facts:
-      base:
-        rowcount: 10
-      added:
-        rowcount: 20
 sequences:
   test_dbt_empty: empty
   test_dbt_base: base