Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

Closed
3 tasks done
graciegoheen opened this issue Oct 9, 2023 · 7 comments · Fixed by #11163
Closed
3 tasks done
Labels
enhancement New feature or request Impact: Orch user docs [docs.getdbt.com] Needs better documentation
Milestone

Comments

@graciegoheen
Copy link
Contributor

graciegoheen commented Oct 9, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here #7012

Originally posted by @sp-tkerlavage in dbt-labs/dbt-snowflake#785 (comment)

We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a loaded_at_field). We should support a 3rd way to generate source freshness - via a loaded_at_query.

Option 1: freshness config added (get freshness from metadata warehouse tables)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day

Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
    loaded_at_field:

Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "max(loaded_at_field)" from executing a custom query)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
      loaded_at_query: my_freshness_calc # config name TBD
{% freshness my_freshness_calc(source, table) %}

some sql returns a structure similar to our current query

{% endfreshness %}

The built-in one is collect_freshness: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16

Describe alternatives you've considered

No response

Who will this benefit?

Folks using external tables as sources

Are you interested in contributing this feature?

No response

Anything else?

No response

@graciegoheen graciegoheen added the enhancement New feature or request label Oct 9, 2023
@github-actions github-actions bot changed the title [Feature] add another way to calculate source freshness - via a SQL query [CT-3194] [Feature] add another way to calculate source freshness - via a SQL query Oct 9, 2023
@graciegoheen
Copy link
Contributor Author

  • What happens if you supply both a loaded_at_field and a loaded_at_query
    • throw an error?
    • use the config "closest" to the source table definition?
  • We should consider the following case:
version: 2

sources:
  - name: jaffle_shop
    database: raw
    freshness: # default freshness
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at

    tables:
      - name: orders
        loaded_at_query: ...

      - name: customers # this will use the freshness defined above
...

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Oct 10, 2023

From @sp-tkerlavage

In Snowflake an external stage can have a directory which essentially stores a list of files in the external storage (S3 Bucket, etc) along with other useful information like relative_path, url, size, MD5, and last_modified. https://docs.snowflake.com/en/user-guide/data-load-dirtables

This directory table can be automatically refreshed when files are added via SQS https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto

The directory is not a standalone object, its just a layer on top of a stage. It can be queried like this:

SELECT relative_path, 
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@database.schema.stage_name' )

However, you would have to parse the relative path to extract the name of the source table. So if for example you have a bucket that is partitioned like this: <bucket>/source/source_table/yyyy/mm/dd/hh You would have to do something like this

SELECT relative_path, 
    SPLIT_PART(relative_path, '/', 1) source_database,
    SPLIT_PART(relative_path, '/', 2) source_table,
    SPLIT_PART(relative_path, '/', 3) year
    ...
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@prod_cdc.meta.dms' )

I actually have an incremental model for this directory parsing logic in my own project, so that may be a cleaner approach. However, I'm not sure if ref'ing a model in a freshness test is even possible. And even if it is, the model would have to be ran first. But again, I'm not entirely clear on the parsing order so it may not even be possible to ref a model in a freshness test on the face of it.

@dataders
Copy link
Contributor

in conversation with @graciegoheen, we decided that we aren't currently aware of enough distinct use cases for custom SQL-defined freshness beyond supporting freshness of external tables in dbt-snowflake.

For now, I'm closing this in favor of dbt-labs/dbt-snowflake#1061. If we come across more use cases in the future, we address them as they come up. Perhaps this framing will useful then

@dataders dataders closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
@graciegoheen
Copy link
Contributor Author

graciegoheen commented Nov 18, 2024

I'm going to re-open as I've heard from a couple folks that they use side control tables to track updates and status of their main large tables.

For example:

select max(start_ts)
from audit_tables
where table_name='my_source_table'
and status='completed'

@graciegoheen graciegoheen reopened this Nov 18, 2024
@ElBob
Copy link

ElBob commented Dec 6, 2024

+1 @graciegoheen, that's exactly what our org needs to be able to pull source freshness for some tables that are populated by an upstream team.

For context, the jobs that team runs to populate those tables execute once a day, but load the table in "micro loads" of 1-60 records every couple of seconds until the full load has completed. That can cause queries to show a partial dataset if you query the table during the load job, so source freshness against the table itself will tell us that the data has been updated before the load has completed.

An audit table (similar to what you posted) is available with metadata; if we could use that to determine the source freshness for the table with a query just like your example we could provide our consumers with much more accurate information, and use the freshness selectors to help protect people from these partial loads.

@moodgorning
Copy link

This could also be useful to surface partial errors in your data, IN our case we get a lot of data from our IOT nodes. If there is a case where we have an outage on one customer only that customer's data would fall behind. other customer's data would be fine, so the issue wouldn't trigger on source freshness. downstream transformations would then build on the stale data from that one lagging customer and include invalid data. A custom query would allow to check that everyone's data is fresh, and not just that at least someone's data is fresh

@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Dec 13, 2024
@graciegoheen graciegoheen added this to the v1.10 milestone Dec 18, 2024
@FishtownBuildBot
Copy link
Collaborator

Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#6695

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Orch user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants