Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Snowflake's transient tables #946

Closed
jon-rtr opened this issue Aug 21, 2018 · 5 comments
Closed

Add support for Snowflake's transient tables #946

jon-rtr opened this issue Aug 21, 2018 · 5 comments
Assignees
Labels
good_first_issue Straightforward + self-contained changes, good for new contributors! snowflake

Comments

@jon-rtr
Copy link
Contributor

jon-rtr commented Aug 21, 2018

Feature

Feature description

DBT currently has support for permanent and temporary tables. As an option to save space consumed by Snowflake tables, we should also support transient tables: https://docs.snowflake.net/manuals/user-guide/tables-temp-transient.html#comparison-of-table-types

  • permanent: CREATE TABLE ... AS
  • transient: CREATE TRANSIENT TABLE ... AS
  • temporary: CREATE TEMPORARY TABLE ... AS

Who will this benefit?

Anyone using the Snowflake data warehouse.

Because transient tables do not have a Fail-safe period, they provide a good option for managing the cost of very large tables used to store transitory data; however, the data in these tables cannot be recovered after the Time Travel retention period passes.

@jon-rtr jon-rtr changed the title add support for snowflake's transient tables Add support for Snowflake's transient tables Aug 21, 2018
@drewbanin
Copy link
Contributor

@jon-rtr this is really cool! We've noticed that catalog queries get very slow on Snowflake after a series of dbt runs. Do you think this flag would also have any bearing on the catalog queries? My suspicion is that after permanent tables are dropped, their metadata still kicks around in the catalog.

Regardless, I think this is worth implementing on the basis of cost alone, and it should be a pretty straightforward update the to table materialization. Thanks for the report!

@drewbanin drewbanin added snowflake good_first_issue Straightforward + self-contained changes, good for new contributors! labels Aug 22, 2018
@jon-rtr
Copy link
Contributor Author

jon-rtr commented Aug 22, 2018

I can ask our Snowflake rep. Could you give some more details on what you mean by "catalog queries"? Do you have a good example?

What are your thoughts on how the solution should be architected? Do you think it should be a snowflake customization of the table materialization? Should there be a new flag to dbt run like --transient?
https://github.com/fishtown-analytics/dbt/blob/bb5d211c942a0c2ea4892d620c248c83c77d6475/dbt/include/global_project/macros/materializations/table/table.sql#L5
https://github.com/fishtown-analytics/dbt/blob/bb5d211c942a0c2ea4892d620c248c83c77d6475/dbt/include/global_project/macros/materializations/table/table.sql#L23

@drewbanin
Copy link
Contributor

dbt queries information_schema.tables to figure out if relations already exist. If a relation doesn't exist, then dbt will create it. If the relation does exist, then dbt will do a "smart" thing based on the type of the relation (view vs. table), flags provided to dbt (--non-destructive/--full-refresh), and the type of model being built (table/view/incremental).

I'm curious if selecting from information_schema.tables is made slower if a bunch of permanent tables are created and deleted in a given warehouse. I could imagine that transient tables might work differently under the hood, but that's just a guess on my end.

I think the right way to implement would be to add a config option for models. If transient is set to true, then dbt will use the transient option while building tables. dbt uses this macro to create tables on Snowflake. We can update this macro to account for model configs as we do for Redshift sort/dist keys here: https://github.com/fishtown-analytics/dbt/blob/development/dbt/include/global_project/macros/adapters/redshift.sql#L30

@drewbanin
Copy link
Contributor

I'll add: we're moving away from flags like --non-destructive and --full-refresh because they apply to the whole graph. Model configs are a way more granular way of apply custom configs to a specific set of models.

@jrandrews
Copy link

Making everything transient on Snowflake is probably the right option, although I can see cases for very large models that are being generated and then incrementally updated it might be nice to have an override option.

e.g. if I have some giant 100 billion row table that I could in theory regenerate from raw tables but it might take 24 hours to do it, I might want an option to make it permanent so the DBAs (maybe I am the DBA 🙂 ?) can use fail-safe for that particular table for a faster recovery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good_first_issue Straightforward + self-contained changes, good for new contributors! snowflake
Projects
None yet
Development

No branches or pull requests

3 participants