Add workaround for temporary tables in remote database when running incremental model #326

guenp · 2024-01-30T08:24:35Z

A few months ago an issue was raised in the MotherDuck community Slack that CREATE TEMPORARY TABLE when using incremental models was causing slowdowns on dbt.

Summary of the underlying issue:

when using a remote MotherDuck duckdb instance, all temporary tables are created locally
dbt incremental models create a temporary table for appending tuples to an existing table
this means that whenever an incremental model is run on a MotherDuck database ("md:my_database"), a new table is created locally, tuples are transported from the server into the new local table, and then sent back to the server when the final INSERT INTO is executed. This adds a nontrivial round trip time causing said slowdown.

This PR contains a quick suggestion for a workaround until we support temporary tables in MotherDuck. It creates a "mock" temporary table in the database and deletes it afterwards.
Some thoughts for @jwills and @nicku33:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that? DuckDB currently stores temp tables in a schema named temp.
It might make sense to move the logic to the MotherDuck extension and instead drop the table server-side whenever the connection closes so we don't have to rely on dbt to run successfully after a mock temp table is created

jwills · 2024-01-30T15:08:36Z

The approach here makes sense to me-- how are we testing this out?

dbt/include/duckdb/macros/materializations/incremental.sql

guenp · 2024-01-30T21:45:27Z

The approach here makes sense to me-- how are we testing this out?

Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good?

Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.

jwills · 2024-01-31T00:04:11Z

That sounds great— and yes, I trust that you will not break your own product here. 😉

…

On Tue, Jan 30, 2024 at 13:45 Guen Prawiroatmodjo ***@***.***> wrote: The approach here makes sense to me-- how are we testing this out? Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good? Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you. — Reply to this email directly, view it on GitHub <#326 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAECWZE2WYL4XCJEANBPH23YRFSYDAVCNFSM6AAAAABCQZTNHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXHE2DQMJQG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tox.ini

…r md tests

…able workaround

dbt/adapters/duckdb/environments/__init__.py

dbt/include/duckdb/macros/materializations/incremental.sql

tests/conftest.py

guenp · 2024-01-31T22:46:10Z

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

jwills · 2024-01-31T22:47:31Z

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

I am good with that!

jwills · 2024-02-01T00:09:52Z

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

guenp · 2024-02-01T01:26:42Z

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

Ah yes, I totally forgot about that, thank you for the reminder! It's done.

I've now pushed all my changes. Mainly what I've added since you last reviewed, in addition to addressing your comments:

DuckDBAdapter.get_temp_relation_path(model). This method is used in the incremental.sql macro and will create a new unique name for a temp_relation that is to be dropped at the end of the macro, with identifier dbt_temp.<schema_name>__<target_relation_name>.
DuckDBAdapter.post_model_hook(config, context). This method fixes a memory leak issue that arises if the incremental macro doesn't run successfully (e.g. when it runs into a CompilationError as it does in this test) and never reaches the lines where it deletes the temp_relation. The method makes sure to drop the temp_relation if it still exists after the model failed (see the try-except block that intercepts this here).

The behavior now looks something like this:
(1) While running unit test:

(2) After unit test is finished and all test-related and "temp" tables are cleaned up:

@nicku33 and I decided against dropping dbt_temp in the end because it might interfere with concurrent processes.
We also discussed making the name dbt_temp configurable. I was thinking to implement this via the model config, but do you know if there may be a more global way to do configure this?

jwills · 2024-02-01T01:29:54Z

The model config is the right place to use for setting/overriding the temp schema name, and that setting can be set globally for all models if need be by specifying it in e.g. the dbt_project.yml file; see https://docs.getdbt.com/reference/model-configs#model-specific-configurations

guenp · 2024-02-01T01:43:56Z

Oops, I didn't actually run the unit tests with the default profile. 😁
It looks like it failed because of the post_model_hook! I need to drop for the day but will push a fix & also implement the temp schema name config.

guenp · 2024-02-01T17:49:51Z

All checks are green! I also added tests/functional/plugins/test_motherduck.py to the MotherDuck tests, I didn't realize your CI/CD ran against a live version of the service so this will help make sure I don't break our product ;)

dbt/adapters/duckdb/impl.py

dbt/include/duckdb/macros/materializations/incremental.sql

…o_drop within if-statement

jwills · 2024-02-01T20:57:42Z

Thank you very much @guenp!

guenp · 2024-02-01T21:02:07Z

Yay! 🎉 Could we hold off on pushing a release until @nicku33 has had the chance to test it on a real-life workload? We just want to make sure this workaround is good on our end as well. Thanks!

jwills · 2024-02-01T21:03:45Z

of course, will wait for you all to sign off on it first

guenp · 2024-02-15T22:28:56Z

We had the chance to do a full end to end test with "real life" data so this is signed off on our end!
FYI here's what we tested:

using a source from an internal share, make a reasonably sized agg model, something on the order of 1M rows for incremental. Run it once to get a datetime agg for, say, 14 days, then modify it in that incremental kicks in for a good size row set and has to update rows
same, but for missing rows. modify the target table so that upon incremental refresh, new rows are filled in
same but for deleting rows. modify the source (maybe a copy of a bunch of rows) so that the update produces the correct gaps
test the above for the two different incremental styles: delete + insert and append

guenp added 2 commits January 30, 2024 00:06

add workaround for temporary tables in remote databases

5d7f7fa

clarify inline comment

6e91597

jwills reviewed Jan 30, 2024

View reviewed changes

dbt/include/duckdb/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

add is_motherduck property to credentials

61b4a3e

guenp added 5 commits January 30, 2024 16:27

Add UT to test incremental model on MotherDuck

bd85518

consolidate MotherDuck plugin tests

4dc717c

clarify docstsring

fae6efd

clarify docstring

1adda92

use py311 for md tox env

83727cc

guenp commented Jan 31, 2024

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

guenp added 4 commits January 30, 2024 17:23

clean up temp tables for incrementals, add db creation and cleanup fo…

6393537

…r md tests

add some helpful inline comments

4758526

add more cleanup to UT, add schema temp to target database for temp t…

d7dc736

…able workaround

create temp schema if needed

c3f4376

jwills reviewed Jan 31, 2024

View reviewed changes

Create remote temporary tables in a separate schema dbt_temp

26b0923

guenp added 6 commits January 31, 2024 16:35

Don't use MD_CONNECT, instead use SET motherduck_token

66c496e

Don't drop temp schema after test ends

0044725

use adapter.is_motherduck

21ef4dc

Set md profile path back to md:test, make database_name fixture

f4ac34c

use credentials.is_motherduck in LocalEnvironment

4dcb585

Reverse change to tox.ini for CI

87a1e6a

guenp added 7 commits January 31, 2024 23:29

make temp schema name configurable, fix bugs for local in-memory tests

22b430d

formatting

be71a2d

Add test for temp schema name config

63eb849

address mypy issues

e3dc75c

add _temp_schema_name attribute to adapter

3ec5527

update docstring

f857608

Add plugin test to MotherDuck tox environment

fd6e033

guenp requested a review from jwills February 1, 2024 17:49

guenp commented Feb 1, 2024

View reviewed changes

dbt/adapters/duckdb/impl.py Outdated Show resolved Hide resolved

Update dbt/adapters/duckdb/impl.py

0ed05bf

jwills reviewed Feb 1, 2024

View reviewed changes

dbt/include/duckdb/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

remove superfluous need_drop_temp variable and add temp_relation to t…

15e61b4

…o_drop within if-statement

jwills merged commit c14627a into duckdb:master Feb 1, 2024
29 of 30 checks passed

guenp deleted the guen/eco-23-dbt-incremental-broken-on-md branch February 1, 2024 21:02

guenp mentioned this pull request Apr 12, 2024

Add pre-model hook for cleaning up remote temporary table on MotherDuck #375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workaround for temporary tables in remote database when running incremental model #326

Add workaround for temporary tables in remote database when running incremental model #326

guenp commented Jan 30, 2024

jwills commented Jan 30, 2024

guenp commented Jan 30, 2024

jwills commented Jan 31, 2024 via email

guenp commented Jan 31, 2024 •

edited

Loading

jwills commented Jan 31, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024 •

edited

Loading

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 15, 2024

Add workaround for temporary tables in remote database when running incremental model #326

Add workaround for temporary tables in remote database when running incremental model #326

Conversation

guenp commented Jan 30, 2024

jwills commented Jan 30, 2024

guenp commented Jan 30, 2024

jwills commented Jan 31, 2024 via email

guenp commented Jan 31, 2024 • edited Loading

jwills commented Jan 31, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024 • edited Loading

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 1, 2024

jwills commented Feb 1, 2024

guenp commented Feb 15, 2024

guenp commented Jan 31, 2024 •

edited

Loading

guenp commented Feb 1, 2024 •

edited

Loading