lf/issue-49 compare all columns macro for testing #50

leoebfolsom · 2022-07-15T14:03:42Z

Description & motivation

This PR creates a macro compare_all_columns which can be included in a dbt test suite to test whether any values in any columns have changed.

The output of compare_all_columns allows for various testing configurations, including but not limited to:

Flagging only direct conflicts as test failures.
Ignoring very recent values.
Filtering and grouping of test results based on model attributes.

Issue this PR is meant to address: #49

To resolve some Circle CI testing challenges, upgraded Postgres version defined in .circleci/config.yml from 9.6.5 to 14.0.

Checklist

I have verified that these changes work locally
I have updated the README.md (if applicable)
I have added tests & descriptions to my models (and macros if applicable)

…tom dbt test

…olumns_macro_for_testing Merging the proposed update from PR 47 into this one since it is an upstream dependency.

leoebfolsom · 2022-07-18T13:41:22Z

I've run into this error when trying to build a test for compare_all_columns. It's not surprising to me that dbt won't allow me to have two seeds with the same name. However, this would be a perfect solution for what I'm trying to test: comparing the same table between two targets (dev and prod). Is there a recommended approach for this kind of thing?

dbt.exceptions.CompilationException: Compilation Error
  dbt found two seeds with the name "data_compare_relations__a_relation".
  
  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("data_compare_relations__a_relation").
  
  To fix this, change the name of one of these resources:
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/prod_schema/data_compare_relations__a_relation.csv)
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/data_compare_relations__a_relation.csv)
13:38:31  Encountered an error:
Compilation Error
  dbt found two seeds with the name "data_compare_relations__a_relation".
  
  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("data_compare_relations__a_relation").
  
  To fix this, change the name of one of these resources:
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/prod_schema/data_compare_relations__a_relation.csv)
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/data_compare_relations__a_relation.csv)

Exited with code exit status 2

I also asked this question in Slack.

…ts to logs automatically, to align it with the other existing macros

…testing error noise

joellabes

This is a really good start 🤩 I've picked a lot of holes in the specifics, but the bones of it are great and I like where it's going!

I'll come back to the docs and integration tests once the foundations are in place, don't want to get you to make a bunch of changes when other feedback might change the expectations there too.

Also I jumped all over the place during this review so if the references to other comments are unclear then lmk and I can fix them up!

macros/pop_columns.sql

macros/compare_column_values.sql

joellabes · 2022-07-25T08:35:21Z

macros/compare_all_columns.sql

+  {% set columns_to_compare=audit_helper.pop_columns(model_name, exclude_columns) %}
+
+  {% set old_etl_relation_query %}
+      select * from {{prod_schema}}.{{ model_name }}


This worries me a bit - I understand why it's necessary, but I am hoping there's a safer way.

In general, hardcoding table references is a no-no. In particular, I'm worried that people who have overridden generate_alias_name to be environment aware will run into problems.

Might ask around and see if there's anything to be done with api.Relation.create

Alternatively, it could be solved by the other comments I'm about to write which relate to whether this is sufficiently generic

Reading api.Relation.create docs and found this:

This object should always be used instead of interpolating values with {{ schema }}.{{ table }} directly.

so ... "I'm in these docs and I don't like it"

OK, I'm not totally sure if the way I've updated the code this fully addresses the issue you raise, but please have a look, I think it may do the trick.

joellabes · 2022-07-25T08:45:02Z

macros/compare_all_columns.sql

@@ -0,0 +1,59 @@
+{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}


As hinted in the below comment, I can see why all of these params are useful, but it feels like a very specific implementation to match what you're trying to do.

I think that a better macro signature would look much more like the existing compare_column_values macro:

Suggested change

{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}

{% macro compare_every_columns_values(a_query, b_query, primary_key) -%}

(Note: I considered adding direct_conflicts_only as well, then decided against it - there'll be a comment about that too!)

For you to use this internally at scale, you would then make a macro which generated your a_query and b_query and passed those into this macro:

{% set a_query = generate_comparison_query(..., is_prod=False) %} {% set b_query = generate_comparison_query(..., is_prod=True) %}

By doing this, you don't have to worry about accounting for how other projects generate their schema and model names - if your project always has the same model names everywhere then you can hardcode {{ prod_schema }}.{{ model_name }} and I promise not to complain 😉

Heard. You gave me some really good ideas, and I agree that boolean columns unlocks a lot of flexibility. I'm getting there, but still some work to do, and possibly some follow-up questions, on this one.

As you'll see in the readme and elsewhere, I'm trying out an approach where the test generates a row for every primary key x column combination. It's a lot of rows, but if you write a reasonable filter on your test, it gives you really good information, and you can even join the results to other tables!

The other option is that for someone who truly just wants a summary, they can turn the macro into a CTE in their test file and then summarize away.

No comment yet on the hardcoded model/schema names; I need to dig in more to figure out a viable alternative.

macros/compare_all_columns.sql

joellabes · 2022-07-25T08:49:00Z

macros/compare_all_columns.sql

+    */
+    ( {{ audit_query }} )
+    {% if not loop.last %}
+      union


A tiny thing, but union all means that the DB engine doesn't have to run a distinct evaluation over your result set, which in your case is guaranteed because you're bringing through a different column name every time. https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all

Suggested change

union

union all

Leaving this open so I can dig in more / learn rather than just making the change! Thanks!

macros/compare_all_columns.sql

README.md

…esting

… compare_all_columns and compare_relations

…all_columns with count columns

…t to analyze the results

…rimary key per column and boolean columns

…all_columns

…get redshift to pass locally

…alues_verbose.sql

…scing null values to false, and errors in seed data

… dbt test

joellabes

just the gitignore and one comment, then we're ready to rock. congrats 🤩

joellabes · 2022-09-07T03:54:41Z

integration_tests/logfile

@@ -0,0 +1,218 @@
+2022-08-31 12:57:58.197 PDT [2839] LOG:  starting PostgreSQL 14.5 (Homebrew) on x86_64-apple-darwin21.5.0, compiled by Apple clang version 13.1.6 (clang-1316.0.21.2.5), 64-bit


This should be in .gitignore 🙏

joellabes · 2022-09-07T03:56:00Z

README.md

+{{ 
+  audit_helper.compare_all_columns(
+    a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
+    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.


I'd probably add a note here saying you can also hard code a table name - for interactive queries where someone is writing one-off code, it's not unreasonable to hardcode a table string for expediency's sake. If you're building it into a CI cycle, then yes please make a proper source/relation/etc

joellabes

thanks for coming on this adventure with me!

Leo Folsom added 8 commits July 14, 2022 13:57

create a macro, test__compare_all_columns, which can be used in a cus…

02e725d

…tom dbt test

cant get my local project to recognize the new macro

4c60d83

Merge branch 'lf/compare-all-columns' into lf/issue-49--compare_all_c…

bb9862e

…olumns_macro_for_testing Merging the proposed update from PR 47 into this one since it is an upstream dependency.

rename macro to compare_all_columns

93c447b

fix jinja syntax, namely curly braces

9e77cbb

add to readme

ab84ac3

small update to readme

2b02847

integration test compare_all_columns

8fe9a54

Leo Folsom added 11 commits July 18, 2022 09:54

add conflict seed, remove logging stuff from macro

3f0a8d4

add additional seeds and rename seeds

92d3bd8

update readme to reflect that compare_all_columns doesn't write resul…

d74f587

…ts to logs automatically, to align it with the other existing macros

add exclude_columns optional arg to compare_all_columns

ee91af8

create separate pop_columns macro and use it in compare_all_columns

c9cbd4f

exclude argument not being recognized as i would expect, wip

9bc5578

tidy up refactoring of pop_columns

394cff9

add placeholder for new macros in integration_tests/models/schema.yml

bfd4062

add args for direct_conflict_only and exclude_recent_hours to remove …

8d071cb

…testing error noise

update readme with additional args

0f03b05

readme formatting

965df11

joellabes requested changes Jul 25, 2022

View reviewed changes

leoebfolsom mentioned this pull request Jul 26, 2022

add column_name to output of compare_column_values #47

Merged

3 tasks

Leo Folsom added 8 commits July 26, 2022 10:37

Merge branch 'main' into lf/issue-49--compare_all_columns_macro_for_t…

e524a3f

…esting

remove pop_columns and use get_filtered_columns_in_relation instead

80bc6ce

adjust implementation of get_filtered_columns_in_relation to work for…

40db5ff

… compare_all_columns and compare_relations

update readme, create compare_column_values_count to support compare_…

5c4c288

…all_columns with count columns

switch count approach to verbose and let the user decide how they wan…

4560b5c

…t to analyze the results

compare_column_values_verbose creates a tall table with one row per p…

361fc5f

…rimary key per column and boolean columns

spruce up readme

34802ba

fix whitespace in dbt_project.yml

478dfce

leoebfolsom added 21 commits August 29, 2022 09:52

roll back postgres version

48810d3

change postgres version but use circleci instead of cimg

9ba804f

switch back to cimg, try postgres 10.20

6a9def4

try circleci/postgres:10.20

4876f13

try specifying postgres db auth and environment

ae85d86

try adding adapter.quote to column_name in final subquery of compare_…

0be0b2e

…all_columns

undo that

8eba10c

cast column_to_compare to text to satisfy redshift; drop analyses to …

59fdea4

…get redshift to pass locally

restore analyses

f42fc59

save smoke test

422f48b

test out removing adapter.quote from final select in compare_column_v…

d3ae413

…alues_verbose.sql

remove all adapter quotes from compare_column_values_verbose

32c5195

change caps of seed data to get snowflake to work

b141af3

update cast text to cast string for bigquery

fb600af

try using adapter.quote

fa1c5a2

add if else to work around postgres casting issue

1337c95

add redshift to if statement regarding casting to text

5fdb982

fix null_in_a and null_in_b to exclude rows where the pk is missing

d26a841

solve bugs discovered while adding test data, mainly related to coale…

8c1e7c8

…scing null values to false, and errors in seed data

remove aws binary pkg file

e3deac3

update readme

2ecb37c

leoebfolsom requested a review from joellabes August 31, 2022 07:25

add a test demoing the use of a where clause in a compare_all_columns…

0e8340d

… dbt test

joellabes requested changes Sep 7, 2022

View reviewed changes

leoebfolsom added 3 commits September 6, 2022 21:25

add logfile to gitignore

155cd69

include note about hard-coded relations

791f99c

remove stale logfile

7677cac

leoebfolsom requested a review from joellabes September 7, 2022 04:40

joellabes approved these changes Sep 7, 2022

View reviewed changes

joellabes merged commit bd58775 into dbt-labs:main Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lf/issue-49 compare all columns macro for testing #50

lf/issue-49 compare all columns macro for testing #50

leoebfolsom commented Jul 15, 2022 •

edited

Loading

leoebfolsom commented Jul 18, 2022 •

edited

Loading

joellabes left a comment

joellabes Jul 25, 2022

joellabes Jul 25, 2022

leoebfolsom Aug 2, 2022

leoebfolsom Aug 3, 2022

joellabes Jul 25, 2022

leoebfolsom Jul 27, 2022

joellabes Jul 25, 2022

leoebfolsom Jul 27, 2022

joellabes left a comment

joellabes Sep 7, 2022

joellabes Sep 7, 2022

joellabes left a comment

		@@ -0,0 +1,59 @@
		{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}

	{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}
	{% macro compare_every_columns_values(a_query, b_query, primary_key) -%}

		@@ -0,0 +1,218 @@
		2022-08-31 12:57:58.197 PDT [2839] LOG: starting PostgreSQL 14.5 (Homebrew) on x86_64-apple-darwin21.5.0, compiled by Apple clang version 13.1.6 (clang-1316.0.21.2.5), 64-bit

lf/issue-49 compare all columns macro for testing #50

lf/issue-49 compare all columns macro for testing #50

Conversation

leoebfolsom commented Jul 15, 2022 • edited Loading

Description & motivation

Checklist

leoebfolsom commented Jul 18, 2022 • edited Loading

joellabes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellabes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellabes left a comment

Choose a reason for hiding this comment

leoebfolsom commented Jul 15, 2022 •

edited

Loading

leoebfolsom commented Jul 18, 2022 •

edited

Loading