Add new macros for diff calculation, and unit tests #99

joellabes · 2024-05-14T04:43:27Z

Description & motivation

It's possible to calculate diffs faster by using hashes (as described by the IL team here). Additionally, by calculating aggregate results and outputting them alongside a subset of summary results, it's possible to skip running other queries at all.

I also added a bunch of unit tests! They're really good, more people should be talking about this.

Checklist

I have verified that these changes work locally
I have updated the README.md (if applicable)
I have added tests & descriptions to my models (and macros if applicable)

* Add new macros for diff calculation, and unit tests (#99) * Add macro for new hash-based comparison strategy * split out SF-focused version of macro * Fix change to complex object * Fix overuse of star * switch from compare rels to compare queries * provide wrapping parens * switch to array of columns for PK * split unit tests into own files, change unit tests to array pk * tidy up get_comp_bounds * fix arg rename * add quick_are_queries_identical and unit tests * Move data tests into own directory * Add test for multiple PKs * fix incorrect unit test configs * make data types for id and id_2 big enough nums * Mock event_time response * fix hardcoded value in quick_are_qs_identical * Add unit tests for null handling (still broken) * Rename columsn to be more unique * Steal surrogate key macro from utils * Use generated surrogate key across the board in place of PK * rm my profile reference * Update quick_are_queries_identical.sql * Add diagram explaining comparison bounds * Add comments explaining warehouse-specific optimisations * cross-db support * subq * no postgres or redshift for a sec * add default var values for compare wrappers * avoid lateral alias reference for BQ * BQ doesn't support count(arg1, arg2) * re-enable redshift * Alias subq for redshift * remove extra comma * add row status of nonunique_pk * remove redundant test and wrapper model * Create json-y tests for snowflake * Add workaround for redshift to support count num rows in status * skip incompatible tests * Fix redshift lack of bool_or support in window funcs * add skip exclusions for everything else * fix incorrect skip tag application * Move user configs to project.yml from profiles * Temporarily disable unpassable redshift tests * add temp skip to circle's config.yml * forgot tag: method * Temporarily skip reworked_compare_all_statuses_different_column_set * Skip another test redshift * disable unsupported tests BQ * postgres too? * Fixes for postgres * namespace macros * It's a postgres problem, not a redshift problem * Handle postgres 63 char limit * Add databricks * Rename tests to data_tests * Found a better workaround for missing count distinct window * actually call the macro * disable syntax-failing tests on dbx * try to install core from main to get sorting fix * Revert "try to install core from main to get sorting fix" This reverts commit d28f3e1. * Audit helper code review changes * add BQ support for qucik are queries identical * explain why using dense_rank * remove the compile step to avoid compilation error * Don't throw incompatible quick compare error during parse * add where clause to check we're not assuming its absence * enable first basic struct tests * Skip raising exception during parsing * json_build_object doesn't work on rs * changed behaviour redshift * skip complex structs on rs for now * temp disable all complex structs * skip some currently failoing bq tests * Properly exclude tests to skip, add comments * dbx too * rename reworked_compare to compare_and_classify_query_results * Rename files * rename macro file * Add relation_focused macros * Add BQ-specific generate_set_results for hashes, enable json tests * Implement hash comparisons for BQ and DBX (#103) * disable tests for unrelated adapters * Avoid lateral column aliasing * First cross-db complex struct fixture * Add final fixtures * Initial work on dbx compatibility * remove lateral column alias dbx * cast everything as string before hashing * add comment, enable all tests again * rename to dbt_audit_in_a instead of in_a * Protect against missing PK columns * gitignore package-lock.yml * add dbx variant of simple structs * Rename private macros to have _ prefix * Fix get comparison bounds (#104) * change to getting comparison bounds for queries not relations * add test for introspective queries * Make compare query columns multi pk (#105) * rm packagelock.yml

joellabes added 30 commits April 19, 2024 16:56

Add macro for new hash-based comparison strategy

b93fa49

split out SF-focused version of macro

d3dfa77

Fix change to complex object

1a6c35f

Fix overuse of star

4a7f120

switch from compare rels to compare queries

87afbe9

provide wrapping parens

e754ab7

switch to array of columns for PK

e6be75c

split unit tests into own files, change unit tests to array pk

60fe426

tidy up get_comp_bounds

886728d

fix arg rename

b53db58

add quick_are_queries_identical and unit tests

0d766d6

Merge branch 'dbt-labs:main' into master

63571ba

Move data tests into own directory

c8ccf59

Add test for multiple PKs

58751e6

fix incorrect unit test configs

022b91b

Merge branch 'master' of https://github.com/joellabes/dbt-audit-helper

831c595

make data types for id and id_2 big enough nums

bef6e18

Mock event_time response

0f1e09e

fix hardcoded value in quick_are_qs_identical

33e4c50

Add unit tests for null handling (still broken)

0df1b6f

Rename columsn to be more unique

9a75fc9

Steal surrogate key macro from utils

8157600

Use generated surrogate key across the board in place of PK

0e78f25

rm my profile reference

f59b411

Update quick_are_queries_identical.sql

ab7d8b9

Add diagram explaining comparison bounds

120ac18

Add comments explaining warehouse-specific optimisations

c275056

cross-db support

311fbdc

subq

ac63521

no postgres or redshift for a sec

ffae04f

joellabes added 6 commits May 17, 2024 11:03

remove extra comma

7e3e171

add row status of nonunique_pk

df95fca

remove redundant test and wrapper model

9523db8

Create json-y tests for snowflake

a506d72

Add workaround for redshift to support count num rows in status

a7542a8

skip incompatible tests

eb2cfcd

joellabes changed the base branch from main to joellabes-audit-helper-revamp May 18, 2024 03:47

joellabes added 7 commits May 18, 2024 15:59

Fix redshift lack of bool_or support in window funcs

10392b0

add skip exclusions for everything else

8c9690c

fix incorrect skip tag application

1cf1887

Move user configs to project.yml from profiles

319a967

Temporarily disable unpassable redshift tests

698aa99

add temp skip to circle's config.yml

a255d43

forgot tag: method

a9a47c1

graciegoheen mentioned this pull request May 21, 2024

[Bug] unit tests' comparisons are sometimes sensitive to the order of records returned dbt-labs/dbt-core#10167

Closed

2 tasks

joellabes added 13 commits May 22, 2024 13:46

Temporarily skip reworked_compare_all_statuses_different_column_set

ec2d142

Skip another test redshift

fe91fd1

disable unsupported tests BQ

77f6a50

postgres too?

df73001

Fixes for postgres

12e307d

namespace macros

f217168

It's a postgres problem, not a redshift problem

88f2be8

Handle postgres 63 char limit

ad6e9d8

Add databricks

669bb69

Rename tests to data_tests

0c192a9

Found a better workaround for missing count distinct window

317e4d7

actually call the macro

0d1a1de

disable syntax-failing tests on dbx

559f8d5

joellabes merged commit 9da3c51 into dbt-labs:joellabes-audit-helper-revamp May 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new macros for diff calculation, and unit tests #99

Add new macros for diff calculation, and unit tests #99

joellabes commented May 14, 2024

Add new macros for diff calculation, and unit tests #99

Add new macros for diff calculation, and unit tests #99

Conversation

joellabes commented May 14, 2024

Description & motivation

Checklist