-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalize incremental materialization #141
Conversation
50092a0
to
6f7e1f2
Compare
unfortunately, I don't think it is possible with the current impl. The adapter config does not know anything about the connection when it is created :( |
Thanks for looking into it @kwigley! I suppose this would be a compelling thing that we could do, if we split off the Databricks pieces of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for unraveling this @jtcohen6. Looking good. I think the most important thing is that we make sure that we test the edge cases. Would love to dive into this a bit deeper, but I'm currently a bit swamped in work :)
{% endif %} | ||
|
||
{% call statement() %} | ||
set spark.sql.hive.convertMetastoreParquet = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see this good :)
{%- else -%} | ||
{#-- merge all columns with databricks delta - schema changes are handled for us #} | ||
{%- elif strategy == 'merge' -%} | ||
{#-- merge all columns with databricks delta - schema changes are handled for us #} | ||
{{ get_merge_sql(target, source, unique_key, dest_columns=none, predicates=none) }} | ||
{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe raise an error if it doesn't match any of the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. It's already raised earlier on:
https://github.com/fishtown-analytics/dbt-spark/blob/c8e3770e077e8c54026156b14e61133ef59fa7ff/dbt/include/spark/macros/materializations/incremental.sql#L65-L66
Just the same, I'll add another explicit exception here, just in case users override some macros but not others.
{% if unique_key %} | ||
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }} | ||
{% else %} | ||
on false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels awkward, why not just use an INSERT INTO
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in line with default dbt behavior here.
As I see it, the only reason we implement spark__get_merge_sql
is to benefit from the improved wildcard update *
/insert *
.
I agree it feels a bit silly—this is identical to append
strategy / INSERT INTO
statement. The alternative is to keep requiring a unique_key
with merge
. In any case, I'd rather err closer to the side of default behavior.
Thanks for the review @Fokko! I just reorganized these macro files a bit as well.
I completely agree. I have a "suite" of "test cases" in a local project that I used to work through all the edge cases:
Keys:
I'm struggling a bit with how to implement this in a more-automated fashion:
FWIW We don't have automated tests today for incremental strategies, either, so I think this still constitutes a step in the right direction. |
As discussed with @kwigley yesterday, the ideal type of test to write for this kind of functionality is a "true" integration test, like the ones we have in Core ( To that end, it would be great to include and expose |
@kwigley I gave a go at writing custom integration tests for the changes in this PR, using the work in dbt-labs/dbt-adapter-tests#13. I'd love to get your help here—perhaps a bit of pair-programming? :) Once we can manage to get these tests running, I'll feel really good about merging and releasing the additions to the adapter testing suite, even if it's missing some of the bells and whistles. (E.g. I've decided that |
8e3bd9d
to
b4e05a0
Compare
b4e05a0
to
9bbc61b
Compare
@jtcohen6 can you take a peek at why |
Ugh. Here's where we're at: ODBC connections to Databricks clusters aren't respecting the (Those One thing we could do: Tell folks to set |
self.assertTablesEqual( | ||
"insert_overwrite_no_partitions", "expected_overwrite") | ||
self.assertTablesEqual( | ||
"insert_overwrite_partitions", "expected_upsert") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… feature/rationalize-incremental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working through the tests with me! I'm happy with shipping this with the intention of spending more time with adapter integration tests.
resolves #140
resolves #133
Description
This PR implements, with alterations, several of the recommendations from #140 (much fuller description there) to rationalize the
'incremental'
materialization ondbt-spark
.In particular:
append
, that provides better consistency with other adapters' default behavior. This strategy works across all file formats, connection types, Spark platforms.Notes
Fix failingIt brings it in line with the default adapter tests, which can only be a step in the right direction :)test_dbt_incremental
: I think this PR represents a genuine change in behavior!I'm not sure about how to implement recommendation (5), which proposed that we should setfile_format: delta
+incremental_strategy: merge
as defaults ifmethod == 'odbc'
(replacingparquet
+insert_overwrite
, respectively). A bit more on this below:leaving these notes below, even though they're not material to the changes in this PR
We could add this logic just within the Jinja macros (e.g. within
file_format_clause
,dbt_spark_validate_get_file_format
,dbt_spark_validate_get_incremental_strategy
), but then it wouldn't be consistent with the values inmanifest.json
, since the file format default is set in python:https://github.com/fishtown-analytics/dbt-spark/blob/2f7c2dcb419b66d463bb83212287bd29280d6964/dbt/adapters/spark/impl.py#L32-L33
@kwigley Any chance you could advise on if it would be possible to set that config default dynamically, based on the connection method? Or if that's a terrible idea? :)
Even if we can't get that last piece working, I still think we should move forward with 4/5, and advise Databricks users to simply set in
dbt_project.yml
:This is what we advise today, anyway.
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.