Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize incremental materialization #141

Merged
merged 16 commits into from
Feb 19, 2021

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Jan 12, 2021

resolves #140
resolves #133

Description

This PR implements, with alterations, several of the recommendations from #140 (much fuller description there) to rationalize the 'incremental' materialization on dbt-spark.

In particular:

  • New default strategy, append, that provides better consistency with other adapters' default behavior. This strategy works across all file formats, connection types, Spark platforms.
  • Raise better errors to enforce which strategies are actually supported when

Notes

  • Fix failing test_dbt_incremental: I think this PR represents a genuine change in behavior! It brings it in line with the default adapter tests, which can only be a step in the right direction :)

  • I'm not sure about how to implement recommendation (5), which proposed that we should set file_format: delta + incremental_strategy: merge as defaults if method == 'odbc' (replacing parquet + insert_overwrite, respectively). A bit more on this below:

leaving these notes below, even though they're not material to the changes in this PR

We could add this logic just within the Jinja macros (e.g. within file_format_clause, dbt_spark_validate_get_file_format, dbt_spark_validate_get_incremental_strategy), but then it wouldn't be consistent with the values in manifest.json, since the file format default is set in python:

https://github.com/fishtown-analytics/dbt-spark/blob/2f7c2dcb419b66d463bb83212287bd29280d6964/dbt/adapters/spark/impl.py#L32-L33

@kwigley Any chance you could advise on if it would be possible to set that config default dynamically, based on the connection method? Or if that's a terrible idea? :)

Even if we can't get that last piece working, I still think we should move forward with 4/5, and advise Databricks users to simply set in dbt_project.yml:

models:
  +file_format: delta
  +incremental_strategy: merge

This is what we advise today, anyway.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Jan 12, 2021
@jtcohen6 jtcohen6 force-pushed the feature/rationalize-incremental branch from 50092a0 to 6f7e1f2 Compare January 12, 2021 15:01
@kwigley
Copy link

kwigley commented Jan 13, 2021

Any chance you could advise on if it would be possible to set that config default dynamically

unfortunately, I don't think it is possible with the current impl. The adapter config does not know anything about the connection when it is created :(

@jtcohen6
Copy link
Contributor Author

Thanks for looking into it @kwigley!

I suppose this would be a compelling thing that we could do, if we split off the Databricks pieces of dbt-spark into a new adapter plugin, dbt-databricks, that inherited from the current plugin, added certain functionality, and overrode certain defaults.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for unraveling this @jtcohen6. Looking good. I think the most important thing is that we make sure that we test the edge cases. Would love to dive into this a bit deeper, but I'm currently a bit swamped in work :)

{% endif %}

{% call statement() %}
set spark.sql.hive.convertMetastoreParquet = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see this good :)

{%- else -%}
{#-- merge all columns with databricks delta - schema changes are handled for us #}
{%- elif strategy == 'merge' -%}
{#-- merge all columns with databricks delta - schema changes are handled for us #}
{{ get_merge_sql(target, source, unique_key, dest_columns=none, predicates=none) }}
{%- endif -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe raise an error if it doesn't match any of the cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. It's already raised earlier on:
https://github.com/fishtown-analytics/dbt-spark/blob/c8e3770e077e8c54026156b14e61133ef59fa7ff/dbt/include/spark/macros/materializations/incremental.sql#L65-L66

Just the same, I'll add another explicit exception here, just in case users override some macros but not others.

{% if unique_key %}
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
{% else %}
on false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels awkward, why not just use an INSERT INTO statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in line with default dbt behavior here.

As I see it, the only reason we implement spark__get_merge_sql is to benefit from the improved wildcard update */insert *.

I agree it feels a bit silly—this is identical to append strategy / INSERT INTO statement. The alternative is to keep requiring a unique_key with merge. In any case, I'd rather err closer to the side of default behavior.

@jtcohen6
Copy link
Contributor Author

Thanks for the review @Fokko! I just reorganized these macro files a bit as well.

I think the most important thing is that we make sure that we test the edge cases.

I completely agree. I have a "suite" of "test cases" in a local project that I used to work through all the edge cases:

CASE Apache Spark Databricks Cluster Databricks Endpoint
bad_file_format
bad_strategy
bad_merge_not_delta
bad_insert_overwrite_delta
default_append
insert_overwrite_no_partitions
insert_overwrite_partitions
delta_append
delta_merge_no_key
delta_merge_unique_key

Keys:

  • ✅ Works
  • ❌ Raises explicit, clear compilation error
  • ❔ dbt is unopinionated: no explicit error, may work or may return database error (e.g. using delta in Apache Spark, forthcoming support for create temp view via endpoint)

I'm struggling a bit with how to implement this in a more-automated fashion:

  • It's slightly more involved than a unit test. I tried extending unit/test_macros.py without success, macros are a bit more interlocking and require dbt-specific Jinja features like return
  • It's slightly more granular that our suite of adapter integration tests. Ideally we'd have a way to iterate through a sequence, changing one config each time, and catch exceptions where we hope for them.

FWIW We don't have automated tests today for incremental strategies, either, so I think this still constitutes a step in the right direction.

@jtcohen6 jtcohen6 marked this pull request as ready for review January 14, 2021 12:58
@jtcohen6
Copy link
Contributor Author

As discussed with @kwigley yesterday, the ideal type of test to write for this kind of functionality is a "true" integration test, like the ones we have in Core (dbt/test/integration).

To that end, it would be great to include and expose base.py as an available module in pytest-dbt-adapter. It's relatively stable boilerplate code that provides the foundation for writing integration or functional tests on any adapter. Breaking it out into its own module is also a prerequisite to someday achieving true modularity for the dbt-core/-postgres/-redshift/-snowflake/-bigquery tests.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Feb 3, 2021

@kwigley I gave a go at writing custom integration tests for the changes in this PR, using the work in dbt-labs/dbt-adapter-tests#13. I'd love to get your help here—perhaps a bit of pair-programming? :)

Once we can manage to get these tests running, I'll feel really good about merging and releasing the additions to the adapter testing suite, even if it's missing some of the bells and whistles. (E.g. I've decided that use_profile leveraging the .dbtspec target, while a very cool idea, is nice but not strictly necessary for this first version—it's the same environment variables, at the end of the day.)

@kwigley kwigley force-pushed the feature/rationalize-incremental branch 3 times, most recently from 8e3bd9d to b4e05a0 Compare February 17, 2021 16:35
@kwigley kwigley force-pushed the feature/rationalize-incremental branch from b4e05a0 to 9bbc61b Compare February 17, 2021 16:47
@kwigley
Copy link

kwigley commented Feb 18, 2021

@jtcohen6 can you take a peek at why test_insert_overwrite_apache_spark is failing?

@jtcohen6
Copy link
Contributor Author

Ugh. Here's where we're at: ODBC connections to Databricks clusters aren't respecting the set spark.sql.sources.partitionOverwriteMode = DYNAMIC parameter. I have no idea why. Does pyodbc generate a new connection for each executed SQL statement, with new default parameters each time?

(Those set statements aren't supported at all by the new SQL Analytics endpoints. This issue is exclusive to method: odbc + cluster.)

One thing we could do: Tell folks to set spark.sql.sources.partitionOverwriteMode = DYNAMIC within their Databricks clusters (docs). I'll try this in a few minutes. If it works, it may be a better idea anyway, rather than having the incremental materialization try to set this parameter each time.

self.assertTablesEqual(
"insert_overwrite_no_partitions", "expected_overwrite")
self.assertTablesEqual(
"insert_overwrite_partitions", "expected_upsert")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're good!!! For a very dumb reason, this worked in Databricks when it really shouldn't have. I fixed via:
Screen Shot 2021-02-19 at 2 50 36 PM

jtcohen6 added a commit to dbt-labs/docs.getdbt.com that referenced this pull request Feb 19, 2021
@jtcohen6 jtcohen6 requested a review from kwigley February 19, 2021 14:06
Copy link

@kwigley kwigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working through the tests with me! I'm happy with shipping this with the intention of spending more time with adapter integration tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants