Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lf/issue-49 compare all columns macro for testing #50

Conversation

leoebfolsom
Copy link
Contributor

@leoebfolsom leoebfolsom commented Jul 15, 2022

Description & motivation

This PR creates a macro compare_all_columns which can be included in a dbt test suite to test whether any values in any columns have changed.

The output of compare_all_columns allows for various testing configurations, including but not limited to:

  • Flagging only direct conflicts as test failures.
  • Ignoring very recent values.
  • Filtering and grouping of test results based on model attributes.

Issue this PR is meant to address: #49

To resolve some Circle CI testing challenges, upgraded Postgres version defined in .circleci/config.yml from 9.6.5 to 14.0.

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)

@leoebfolsom
Copy link
Contributor Author

leoebfolsom commented Jul 18, 2022

I've run into this error when trying to build a test for compare_all_columns. It's not surprising to me that dbt won't allow me to have two seeds with the same name. However, this would be a perfect solution for what I'm trying to test: comparing the same table between two targets (dev and prod). Is there a recommended approach for this kind of thing?

dbt.exceptions.CompilationException: Compilation Error
  dbt found two seeds with the name "data_compare_relations__a_relation".
  
  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("data_compare_relations__a_relation").
  
  To fix this, change the name of one of these resources:
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/prod_schema/data_compare_relations__a_relation.csv)
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/data_compare_relations__a_relation.csv)
13:38:31  Encountered an error:
Compilation Error
  dbt found two seeds with the name "data_compare_relations__a_relation".
  
  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("data_compare_relations__a_relation").
  
  To fix this, change the name of one of these resources:
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/prod_schema/data_compare_relations__a_relation.csv)
  - seed.audit_helper_integration_tests.data_compare_relations__a_relation (seeds/data_compare_relations__a_relation.csv)

Exited with code exit status 2

I also asked this question in Slack.

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good start 🤩 I've picked a lot of holes in the specifics, but the bones of it are great and I like where it's going!

I'll come back to the docs and integration tests once the foundations are in place, don't want to get you to make a bunch of changes when other feedback might change the expectations there too.

Also I jumped all over the place during this review so if the references to other comments are unclear then lmk and I can fix them up!

macros/pop_columns.sql Outdated Show resolved Hide resolved
macros/compare_column_values.sql Outdated Show resolved Hide resolved
{% set columns_to_compare=audit_helper.pop_columns(model_name, exclude_columns) %}

{% set old_etl_relation_query %}
select * from {{prod_schema}}.{{ model_name }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me a bit - I understand why it's necessary, but I am hoping there's a safer way.

In general, hardcoding table references is a no-no. In particular, I'm worried that people who have overridden generate_alias_name to be environment aware will run into problems.

Might ask around and see if there's anything to be done with api.Relation.create

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, it could be solved by the other comments I'm about to write which relate to whether this is sufficiently generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading api.Relation.create docs and found this:

This object should always be used instead of interpolating values with {{ schema }}.{{ table }} directly.

so ... "I'm in these docs and I don't like it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm not totally sure if the way I've updated the code this fully addresses the issue you raise, but please have a look, I think it may do the trick.

@@ -0,0 +1,59 @@
{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As hinted in the below comment, I can see why all of these params are useful, but it feels like a very specific implementation to match what you're trying to do.

I think that a better macro signature would look much more like the existing compare_column_values macro:

Suggested change
{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%}
{% macro compare_every_columns_values(a_query, b_query, primary_key) -%}

(Note: I considered adding direct_conflicts_only as well, then decided against it - there'll be a comment about that too!)

For you to use this internally at scale, you would then make a macro which generated your a_query and b_query and passed those into this macro:

{% set a_query = generate_comparison_query(..., is_prod=False) %}
{% set b_query = generate_comparison_query(..., is_prod=True) %}

By doing this, you don't have to worry about accounting for how other projects generate their schema and model names - if your project always has the same model names everywhere then you can hardcode {{ prod_schema }}.{{ model_name }} and I promise not to complain 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heard. You gave me some really good ideas, and I agree that boolean columns unlocks a lot of flexibility. I'm getting there, but still some work to do, and possibly some follow-up questions, on this one.

As you'll see in the readme and elsewhere, I'm trying out an approach where the test generates a row for every primary key x column combination. It's a lot of rows, but if you write a reasonable filter on your test, it gives you really good information, and you can even join the results to other tables!

The other option is that for someone who truly just wants a summary, they can turn the macro into a CTE in their test file and then summarize away.

No comment yet on the hardcoded model/schema names; I need to dig in more to figure out a viable alternative.

macros/compare_all_columns.sql Outdated Show resolved Hide resolved
*/
( {{ audit_query }} )
{% if not loop.last %}
union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tiny thing, but union all means that the DB engine doesn't have to run a distinct evaluation over your result set, which in your case is guaranteed because you're bringing through a different column name every time. https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all

Suggested change
union
union all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this open so I can dig in more / learn rather than just making the change! Thanks!

macros/compare_all_columns.sql Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@leoebfolsom leoebfolsom requested a review from joellabes August 31, 2022 07:25
Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the gitignore and one comment, then we're ready to rock. congrats 🤩

@@ -0,0 +1,218 @@
2022-08-31 12:57:58.197 PDT [2839] LOG: starting PostgreSQL 14.5 (Homebrew) on x86_64-apple-darwin21.5.0, compiled by Apple clang version 13.1.6 (clang-1316.0.21.2.5), 64-bit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in .gitignore 🙏

{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add a note here saying you can also hard code a table name - for interactive queries where someone is writing one-off code, it's not unreasonable to hardcode a table string for expediency's sake. If you're building it into a CI cycle, then yes please make a proper source/relation/etc

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for coming on this adventure with me! :shipit:

@joellabes joellabes merged commit bd58775 into dbt-labs:main Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants