Add new macros for diff calculation, and unit tests (#99) #101

joellabes · 2024-05-27T04:18:32Z

Description & motivation

It's possible to calculate diffs faster by using hashes (as described by the IL team here). Additionally, by calculating aggregate results and outputting them alongside a subset of summary results, it's possible to skip running other queries at all.

I also added a bunch of unit tests! They're really good, more people should be talking about this.

To do:

* Add macro for new hash-based comparison strategy * split out SF-focused version of macro * Fix change to complex object * Fix overuse of star * switch from compare rels to compare queries * provide wrapping parens * switch to array of columns for PK * split unit tests into own files, change unit tests to array pk * tidy up get_comp_bounds * fix arg rename * add quick_are_queries_identical and unit tests * Move data tests into own directory * Add test for multiple PKs * fix incorrect unit test configs * make data types for id and id_2 big enough nums * Mock event_time response * fix hardcoded value in quick_are_qs_identical * Add unit tests for null handling (still broken) * Rename columsn to be more unique * Steal surrogate key macro from utils * Use generated surrogate key across the board in place of PK * rm my profile reference * Update quick_are_queries_identical.sql * Add diagram explaining comparison bounds * Add comments explaining warehouse-specific optimisations * cross-db support * subq * no postgres or redshift for a sec * add default var values for compare wrappers * avoid lateral alias reference for BQ * BQ doesn't support count(arg1, arg2) * re-enable redshift * Alias subq for redshift * remove extra comma * add row status of nonunique_pk * remove redundant test and wrapper model * Create json-y tests for snowflake * Add workaround for redshift to support count num rows in status * skip incompatible tests * Fix redshift lack of bool_or support in window funcs * add skip exclusions for everything else * fix incorrect skip tag application * Move user configs to project.yml from profiles * Temporarily disable unpassable redshift tests * add temp skip to circle's config.yml * forgot tag: method * Temporarily skip reworked_compare_all_statuses_different_column_set * Skip another test redshift * disable unsupported tests BQ * postgres too? * Fixes for postgres * namespace macros * It's a postgres problem, not a redshift problem * Handle postgres 63 char limit * Add databricks * Rename tests to data_tests * Found a better workaround for missing count distinct window * actually call the macro * disable syntax-failing tests on dbx

b-per

I did a first review with a few comments.

macros/quick_are_queries_identical.sql

b-per · 2024-05-27T08:49:12Z

macros/quick_are_queries_identical.sql

+{% macro default__is_quick_are_queries_identical_supported() %}
+    {{ return (False) }}
+{% endmacro %}


What about the BigQuery one as described in the article, with BIT_XOR(FARM_FINGERPRINT(TO_JSON_STRING(t)))

🤦 yep you're spot on ty

macros/reworked_compare.sql

b-per · 2024-05-27T08:54:44Z

macros/reworked_compare.sql

+        select 
+            *,
+            {{ audit_helper._count_num_rows_in_status() }} as dbt_audit_num_rows_in_status,
+            dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key, dbt_audit_pk_row_num) as dbt_audit_sample_number


Could we use row_number() instead of dense_rank() to be 100% sure that we won't get 2 rows with the same number?

This is intentional - modified rows appear twice, once for the left version and once for the right. I want to be sure I'm pulling both of them so am using dense_rank instead of row_number. I've added a comment to explain this though!

b-per · 2024-05-27T09:10:25Z

macros/reworked_compare.sql

@@ -0,0 +1,63 @@
+{% macro reworked_compare(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %}


There doesn't seem to be any macro called just compare() today, so we could maybe use this name instead of reworked_compare.

Agree that reworked_compare doesn't sound optimal. If we don't want to use compare() we should describe what this one does different (optimized, or faster, or whatever) rather than just reworked.

b-per · 2024-05-27T09:37:18Z

macros/quick_are_queries_identical.sql

+{% macro is_quick_are_queries_identical_supported() %}
+    {{ return (adapter.dispatch('is_quick_are_queries_identical_supported', 'audit_helper')()) }}
+{% endmacro %}


This is not called anywhere.

Is it supposed to be called on its own?
Or, should we add it to default__quick_are_queries_identical and raise an exception when we are running a query on a non supported adapter?

Also wondering this! My preference would be the latter to avoid requiring the caller to make a call to is_quick_are_queries_identical_supported to safely call quick_are_queries_identical.

Good shout! Will do the latter 👍

b-per · 2024-05-27T09:38:05Z

macros/utils/_count_num_rows_in_status.sql

+    dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key, dbt_audit_pk_row_num)
+    + dense_rank() over (partition by dbt_audit_row_status order by dbt_audit_surrogate_key desc, dbt_audit_pk_row_num desc)
+    - 1


Have we done some analysis on whether count(distinct) is faster when supported?

Not yet - will come back to this as we get closer to the compare changes beta, we can swap out behaviour as needed

This reverts commit d28f3e1.

* add BQ support for qucik are queries identical * explain why using dense_rank * remove the compile step to avoid compilation error * Don't throw incompatible quick compare error during parse * add where clause to check we're not assuming its absence * enable first basic struct tests * Skip raising exception during parsing * json_build_object doesn't work on rs * changed behaviour redshift * skip complex structs on rs for now * temp disable all complex structs * skip some currently failoing bq tests * Properly exclude tests to skip, add comments * dbx too * rename reworked_compare to compare_and_classify_query_results

joellabes · 2024-05-28T20:30:04Z

macros/compare_and_classify_query_results.sql

@@ -0,0 +1,64 @@
+{% macro compare_and_classify_query_results(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %}


event_time needs to instead be {"a_query_event_time": 'some_col', "b_query_event_time": 'maybe_the_same_col_maybe_not'}

macros/compare_and_classify_query_results.sql

joellabes · 2024-05-28T20:43:17Z

macros/quick_are_queries_identical.sql

+but it's a good way to quickly verify identical results if that's what you're expecting. 
+*/
+
+{% macro quick_are_queries_identical(query_a, query_b, columns=[], event_time=None) %}


~~Since there isn't a slow_is_queries_identical, let's drop the quick. It's cleaner~~
Actually not, to preserve its purpose - it would be bad to implement everywhere by running a very slow comparison and aggregating its results

Change to a two-part dict of event_times later

* disable tests for unrelated adapters * Avoid lateral column aliasing * First cross-db complex struct fixture * Add final fixtures * Initial work on dbx compatibility * remove lateral column alias dbx * cast everything as string before hashing * add comment, enable all tests again * rename to dbt_audit_in_a instead of in_a * Protect against missing PK columns * gitignore package-lock.yml * add dbx variant of simple structs

…t-labs/dbt-audit-helper into joellabes-audit-helper-revamp

* change to getting comparison bounds for queries not relations * add test for introspective queries

requested changes made

b-per previously requested changes May 27, 2024

View reviewed changes

joellabes added 5 commits May 28, 2024 10:24

try to install core from main to get sorting fix

d28f3e1

Revert "try to install core from main to get sorting fix"

dfcdcc9

This reverts commit d28f3e1.

Rename files

98d0c8a

rename macro file

d744647

joellabes commented May 28, 2024

View reviewed changes

joellabes added 8 commits May 29, 2024 13:48

Add relation_focused macros

6835ec9

Add BQ-specific generate_set_results for hashes, enable json tests

da194ec

Rename private macros to have _ prefix

f0ebf2f

Merge branch 'joellabes-audit-helper-revamp' of https://github.com/db…

23a4115

…t-labs/dbt-audit-helper into joellabes-audit-helper-revamp

Fix get comparison bounds (#104)

593f217

* change to getting comparison bounds for queries not relations * add test for introspective queries

Make compare query columns multi pk (#105)

dee003b

rm packagelock.yml

2c3cc94

joellabes merged commit d10124a into main Jun 13, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new macros for diff calculation, and unit tests (#99) #101

Add new macros for diff calculation, and unit tests (#99) #101

joellabes commented May 27, 2024 •

edited

Loading

b-per left a comment

b-per May 27, 2024

joellabes May 27, 2024

b-per May 27, 2024

joellabes May 27, 2024

b-per May 27, 2024

b-per May 27, 2024

MichelleArk May 27, 2024

joellabes May 27, 2024

b-per May 27, 2024

joellabes May 28, 2024

joellabes May 28, 2024

joellabes May 28, 2024 •

edited

Loading

joellabes May 28, 2024 •

edited

Loading

		@@ -0,0 +1,63 @@
		{% macro reworked_compare(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %}

		@@ -0,0 +1,64 @@
		{% macro compare_and_classify_query_results(a_query, b_query, primary_key_columns=[], columns=[], event_time=None, sample_limit=20) %}

Add new macros for diff calculation, and unit tests (#99) #101

Add new macros for diff calculation, and unit tests (#99) #101

Conversation

joellabes commented May 27, 2024 • edited Loading

Description & motivation

b-per left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellabes May 28, 2024 • edited Loading

Choose a reason for hiding this comment

joellabes May 28, 2024 • edited Loading

Choose a reason for hiding this comment

joellabes commented May 27, 2024 •

edited

Loading

joellabes May 28, 2024 •

edited

Loading

joellabes May 28, 2024 •

edited

Loading