-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new macros for diff calculation, and unit tests (#99) #101
Conversation
* Add macro for new hash-based comparison strategy * split out SF-focused version of macro * Fix change to complex object * Fix overuse of star * switch from compare rels to compare queries * provide wrapping parens * switch to array of columns for PK * split unit tests into own files, change unit tests to array pk * tidy up get_comp_bounds * fix arg rename * add quick_are_queries_identical and unit tests * Move data tests into own directory * Add test for multiple PKs * fix incorrect unit test configs * make data types for id and id_2 big enough nums * Mock event_time response * fix hardcoded value in quick_are_qs_identical * Add unit tests for null handling (still broken) * Rename columsn to be more unique * Steal surrogate key macro from utils * Use generated surrogate key across the board in place of PK * rm my profile reference * Update quick_are_queries_identical.sql * Add diagram explaining comparison bounds * Add comments explaining warehouse-specific optimisations * cross-db support * subq * no postgres or redshift for a sec * add default var values for compare wrappers * avoid lateral alias reference for BQ * BQ doesn't support count(arg1, arg2) * re-enable redshift * Alias subq for redshift * remove extra comma * add row status of nonunique_pk * remove redundant test and wrapper model * Create json-y tests for snowflake * Add workaround for redshift to support count num rows in status * skip incompatible tests * Fix redshift lack of bool_or support in window funcs * add skip exclusions for everything else * fix incorrect skip tag application * Move user configs to project.yml from profiles * Temporarily disable unpassable redshift tests * add temp skip to circle's config.yml * forgot tag: method * Temporarily skip reworked_compare_all_statuses_different_column_set * Skip another test redshift * disable unsupported tests BQ * postgres too? * Fixes for postgres * namespace macros * It's a postgres problem, not a redshift problem * Handle postgres 63 char limit * Add databricks * Rename tests to data_tests * Found a better workaround for missing count distinct window * actually call the macro * disable syntax-failing tests on dbx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first review with a few comments.
{% macro default__is_quick_are_queries_identical_supported() %} | ||
{{ return (False) }} | ||
{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the BigQuery one as described in the article, with BIT_XOR(FARM_FINGERPRINT(TO_JSON_STRING(t)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 yep you're spot on ty
macros/reworked_compare.sql
Outdated
select | ||
*, | ||
{{ audit_helper._count_num_rows_in_status() }} as dbt_audit_num_rows_in_status, | ||
dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key, dbt_audit_pk_row_num) as dbt_audit_sample_number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use row_number()
instead of dense_rank()
to be 100% sure that we won't get 2 rows with the same number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional - modified rows appear twice, once for the left version and once for the right. I want to be sure I'm pulling both of them so am using dense_rank instead of row_number. I've added a comment to explain this though!
macros/reworked_compare.sql
Outdated
@@ -0,0 +1,63 @@ | |||
{% macro reworked_compare(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be any macro called just compare()
today, so we could maybe use this name instead of reworked_compare
.
Agree that reworked_compare
doesn't sound optimal. If we don't want to use compare()
we should describe what this one does different (optimized, or faster, or whatever) rather than just reworked
.
{% macro is_quick_are_queries_identical_supported() %} | ||
{{ return (adapter.dispatch('is_quick_are_queries_identical_supported', 'audit_helper')()) }} | ||
{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not called anywhere.
Is it supposed to be called on its own?
Or, should we add it to default__quick_are_queries_identical
and raise an exception when we are running a query on a non supported adapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering this! My preference would be the latter to avoid requiring the caller to make a call to is_quick_are_queries_identical_supported
to safely call quick_are_queries_identical
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout! Will do the latter 👍
dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key, dbt_audit_pk_row_num) | ||
+ dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key desc, dbt_audit_pk_row_num desc) | ||
- 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we done some analysis on whether count(distinct)
is faster when supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet - will come back to this as we get closer to the compare changes beta, we can swap out behaviour as needed
This reverts commit d28f3e1.
* add BQ support for qucik are queries identical * explain why using dense_rank * remove the compile step to avoid compilation error * Don't throw incompatible quick compare error during parse * add where clause to check we're not assuming its absence * enable first basic struct tests * Skip raising exception during parsing * json_build_object doesn't work on rs * changed behaviour redshift * skip complex structs on rs for now * temp disable all complex structs * skip some currently failoing bq tests * Properly exclude tests to skip, add comments * dbx too * rename reworked_compare to compare_and_classify_query_results
@@ -0,0 +1,64 @@ | |||
{% macro compare_and_classify_query_results(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event_time needs to instead be {"a_query_event_time": 'some_col', "b_query_event_time": 'maybe_the_same_col_maybe_not'}
but it's a good way to quickly verify identical results if that's what you're expecting. | ||
*/ | ||
|
||
{% macro quick_are_queries_identical(query_a, query_b, columns=[], event_time=None) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there isn't a slow_is_queries_identical, let's drop the quick. It's cleaner
Actually not, to preserve its purpose - it would be bad to implement everywhere by running a very slow comparison and aggregating its results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to a two-part dict of event_times later
* disable tests for unrelated adapters * Avoid lateral column aliasing * First cross-db complex struct fixture * Add final fixtures * Initial work on dbx compatibility * remove lateral column alias dbx * cast everything as string before hashing * add comment, enable all tests again * rename to dbt_audit_in_a instead of in_a * Protect against missing PK columns * gitignore package-lock.yml * add dbx variant of simple structs
…t-labs/dbt-audit-helper into joellabes-audit-helper-revamp
* change to getting comparison bounds for queries not relations * add test for introspective queries
Description & motivation
It's possible to calculate diffs faster by using hashes (as described by the IL team here). Additionally, by calculating aggregate results and outputting them alongside a subset of summary results, it's possible to skip running other queries at all.
I also added a bunch of unit tests! They're really good, more people should be talking about this.
To do:
reworked_compare
Switch to main branch of dbt-core to pick up the unit test sorting fixes Fix: Order-insensitive unit test equality assertion for expected/actual with multiple nulls dbt-core#10202not doing this, the fix will be out next week and installing from git led to weird wheel issues I don't want to debug