-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based #8499
Labels
enhancement
New feature or request
unit tests
Issues related to built-in dbt unit testing functionality
Milestone
Comments
github-actions
bot
changed the title
[Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based
[CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based
Aug 25, 2023
1 task
this may roll to the next sprint |
3 tasks
2 tasks
1 task
2 tasks
2 tasks
3 tasks
Another limitation of the CTE approach is mentioned in dbt-labs/dbt-redshift#807, so it does not support the rather common aggregation functions. |
3 tasks
dbeatty10
added
the
unit tests
Issues related to built-in dbt unit testing functionality
label
Sep 20, 2024
2 tasks
3 tasks
2 tasks
2 tasks
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
New feature or request
unit tests
Issues related to built-in dbt unit testing functionality
From the discussion thread: #8275 (reply in thread)
There are two main high-level implementation approaches for unit testing in dbt:
given
outputs, and querying the result of the model SQL run against the persisted input fixtures. Once the unit test finishes, cleanup any persisted fixtures from the warehouse.I think both are technically feasible and would actually have pretty similar implementations under the hood: either using a materialization that leverages existing ephemeral logic for the 'CTE trickery' route, or actually materializing inputs and the 'actual' test model in the warehouse using the existing seed materialization.
Tradeoffs:
Actually materializing the input/actual datasets is a more accurate representation of how the models are run in production in comparison to the CTE-based approach, and would support a larger set of SQL/dbt functionality than CTEs. For example, syntax that is used sql_headers that may not be valid in a standalone query, or certain types in that can be inserted but not actually declared in a standalone query (dbt-labs/dbt-project-evaluator#290). Do any other limitations come to mind? The tradeoff being performance: actually materializing fixtures/expected/actual in the warehouse, querying them to obtain a diff, and deleting them reliably at the end of the test run all add up to additional latency.
Next steps
So far we've started with the CTE approach, mostly for sake of simplicity, but I do believe it'd be very worthwhile to spike the seed-based approach and quantify more precisely how much slower/complex that approach would be. @gshank also suggested exploring implementing both strategies and either selecting the strategy based on user configuration or the presence of certain conditions (e.g. a sql_header, or particular type on the model being tested). I think a non-CTE stategy would also be necessary to test complex or custom materializations end-to-end (#8275 (reply in thread)) .
Let's implement the seed-based strategy in a spike to understand:
Ultimately let's use those learnings to recommend whether we should implement unit tests with:
The text was updated successfully, but these errors were encountered: