Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TTL for BigQuery tables #2697

Closed
kconvey opened this issue Aug 11, 2020 · 5 comments · Fixed by #2711
Closed

Support TTL for BigQuery tables #2697

kconvey opened this issue Aug 11, 2020 · 5 comments · Fixed by #2711
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!

Comments

@kconvey
Copy link
Contributor

kconvey commented Aug 11, 2020

Describe the feature

Add a BigQuery adapter specific config option ttl (or similar name) which denotes how long the produced table will exist, before being cleaned up automatically by BigQuery. This will take advantage of the expiration_timestamp and the BQ functions TIMESTAMP_ADD() CURRENT_TIMESTAMP(), and INTERVAL (with ttl as an argument)

Describe alternatives you've considered

  • Do this as a post-processing step

    • Would work, but this gives better guarantees that any table you create, if configured properly, will not stick around beyond it's TTL.
  • Configuring a model's TTL #1472 was the idea of having dbt keep track of how long models have existed, and only update once expired, and called itself TTL, but is pretty different than this proposal.

Additional context

AFAIK this is a BigQuery only feature (did some brief googling around and didn't find anything). This feature is already used for temporary tables in the BQ adapter, which are somewhat arbitrarily set to 12 hour TTLs. The same feature is not used for the other support databases, which have a TEMPORARY option which deletes tables at the end of a session, instead.

Who will this benefit?

  • Anyone who has requirements on how long their data can be kept around.
  • Developers configuring projects which may produce a lot of clutter (and can set it to expire in a few hours while developing).

Are you interested in contributing this feature?

This can be upstreamed from a locally maintained fork without too many changes (so definitely).

@kconvey kconvey added enhancement New feature or request triage labels Aug 11, 2020
@jtcohen6
Copy link
Contributor

@kconvey Thanks for the writeup! I'm supportive of adding this as a model config. What do you think about time_to_expiration?

As you rightly observe, we already sort of have it, in that temporary tables are actually non-temporary tables with an arbitrary 12-hour TTL. I'd actually like to change that, and to use real temporary tables (i.e. BigQuery scripts) wherever we currently use 12-hours-to-live tables. We tried to do it in #2140, but we found we would have needed to rework snapshots in a way that was out of scope. It's definitely out of scope for this feature, too, but it's still on my mind :)

@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Aug 14, 2020
@kconvey
Copy link
Contributor Author

kconvey commented Aug 17, 2020

Planning to create the PR for this pretty soon, but I'm curious how you feel about adding support in profiles.yml @jtcohen6 ? An obvious use case for this would be when someone is developing, but doesn't want to manually clean up their tables / have them persist, they can set a dev profile to have a time_to_expiration

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 17, 2020

That's a good use case. I'd be more inclined to support it, still with time_to_expiration as a model config, via dbt_project.yml code like:

models:
  +time_to_expiration: "{{ ('8' if target.name == 'dev' else '1000') | as_number }}"

I think configs belong in profiles.yml if:

  • they relate to the specifics mechanisms by which dbt connects to the database
  • they are likely to differ for each individual developer
  • they should not be checked into version control

@kconvey
Copy link
Contributor Author

kconvey commented Aug 17, 2020

Wasn't sure if that context was available in dbt_project.yml or not, but I'll play around with it a bit. It does seem like you could make the default something false to not set a time_to_expiration, which seems like it would be a requirement to have the right behavior in some cases (not everyone wants their table to expire at some point).

@jtcohen6
Copy link
Contributor

Yes, good call to include some kind of truthy condition. I think you could get the code above to return none if you instead wrote:

models:
  +time_to_expiration: "{{ ('8' if target.name == 'dev' else 'none') | as_native }}"

Better safe than sorry, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants