Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2573] [Feature] Support column-level tests on nested data, natively #7613

Open
3 tasks done
adamcunnington-mlg opened this issue May 12, 2023 · 13 comments
Open
3 tasks done
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors

Comments

@adamcunnington-mlg
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Context

When adding column metadata, for example, to a model YML file, it is already possible to reference a nested column in order to add a description. E.g.

columns:
- name: foo.bar
  description: baz

Presumably, DBT is doing something clever and traversing dot notation. This works today and if configured as such and supported by the adapter, this description will be written back to the database. Notably, this works regardless of whether foo is a struct or an array (using BQ terminology here but the same concepts exist in other DBs). I assume this works regardless of depth too but I've not explicitly verified this. Generic tests will even work if foo is a struct (again, presumably with arbitrary depth).

However, generic tests won't work in the case of a repeated/array field - and this is not surprising - different SQL would be needed to unnest repeated fields. E.g. this will cause an error during the not_null test execution.

  - name: foo.bar  # foo is an array
    tests:
    - not_null

Someone from the community came up with an implementation for a nested not null test a while ago. It makes for some good inspiration but it has fallen short of what it should have been! The nested data logic should not be limited to not_null - it should be a generic wrapper test that calls out to some other generic test. This can be achieved by having the test wrap do nothing more than setting the model variable to be a sub-query. DBT actually does the exact same thing when you provide config/where. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test. The only limitation is not all columns can be passed to the downstream test which the downstream test would ideally like to have in the case of store_failures = True but this is a small (and reasonable limitation).

I have an alternative implementation to this test that works in a generic way and we are using it for several downstream generic tests. The only limitation is that the test has to have some logic like the below which explicitly calls the relevant test:

  {% if test == "accepted_values" %}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {% elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {% elif test == "relationships" %}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {% elif test == "unique" %}
{{ test_unique(model, "column") }}
  {%- endif %}

As far as I can tell, within a jinja-only context, this can't be gotten around for two reasons:

  1. Jinja doesn't allow you to pass/expand/unpack kwargs - e.g. some_test(**kwargs)
  2. My test variable cannot be used to dynamically determine the macro to call - jinja's not quite object-oriented enough for that.

The Feature Request

DBT core should do the SQL wrapper before it calls an existing generic test. It should do something similar to what my jinja is doing, but far easier in python, and basically parse the foo.bar.baz column name, ask the adapter for the column metadata and then traverse it and build the subquery before calling the test. It's pretty straight forward stuff - just a case of adding CROSS JOIN UNNEST(<x>) AS <y> whenever it encounters a repeated field. There's slightly more to it than that but the community post does a great job of explaining the logic and I'm happy to provide our improvement too.

Describe alternatives you've considered

Doing it myself via a custom generic test that is a clever wrapper - but as described in main part of the post, has some unavoidable limitations - as well as being a missed opportunity to benefit the rest of the community.

Who will this benefit?

Everyone that wants to test nested data. Huge impact!

Are you interested in contributing this feature?

Not familiar with dbt-core - hopefully someone can take the logic and just implement it in the right place, having thought through any implications.

Anything else?

No response

@adamcunnington-mlg adamcunnington-mlg added enhancement New feature or request triage labels May 12, 2023
@github-actions github-actions bot changed the title [Feature] Support column-level tests on nested data, natively [CT-2573] [Feature] Support column-level tests on nested data, natively May 12, 2023
@adamcunnington-mlg
Copy link
Author

I'm not sure whether a nested data test wrapper is also possible to produce - in a generic fashion - for table-level tests - but we're also exploring this.

Regardless, this FR currently focusses on column-level functionality - which is very doable - and I think very sensible!

@adamcunnington-mlg
Copy link
Author

adamcunnington-mlg commented May 18, 2023

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test) -%}
  {%- set match = modules.re.match("^(?:.+\s)?`?(?P<database>.+?)`?\.`?(?P<schema>.+?)`?\.`?(?P<table>.+?)`?(?:\s.+)?$",
                                   model|string) -%}
  {%- set relation = adapter.get_relation(match.database, match.schema, match.table) -%}
  {%- set column_name_hierarchy = column_name.split(".")|map("trim", "`")|list -%}
  {%- set columns = adapter.get_columns_in_relation(relation) -%}
  {%- set parent_column = columns|selectattr("name", "equalto", column_name_hierarchy[0])|list -%}
  {%- set ns = namespace(ancestor_columns=[], unnest_alias="",
                         from_sql="FROM (SELECT ROW_NUMBER() OVER () AS row_number, * FROM %s)" % model) -%}
  {%- for child_column in parent_column recursive -%}
    {%- set ns.child_column_name = child_column.quoted -%}
    {%- set ns.ancestor_columns = ns.ancestor_columns + [ns.child_column_name] -%}
    {%- set ns.unnest_alias = ns.ancestor_columns|join(".")|string -%}
    {%- if child_column.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql + " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_alias, ns.child_column_name) -%}
      {%- set ns.ancestor_columns = [ns.child_column_name] -%}
    {%- endif -%}
    {%- if child_column.fields -%}
      {{ loop(child_column.fields|selectattr("name", "equalto", column_name_hierarchy[loop.depth])|list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT row_number, %s AS %s %s)" % (ns.unnest_alias, ns.child_column_name, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}  -- column-level test
{{ test_accepted_values(model, ns.child_column_name, kwargs.get("values"), kwargs.get("quote", True)) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_be_in_set" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_in_set(model, ns.child_column_name, kwargs.get("value_set"),
                                                           quote_values=kwargs.get("quote_values"),
                                                           row_condition=kwargs.get("row_condition")) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_be_null" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_null(model, ns.child_column_name,
                                                         row_condition=kwargs.get("row_condition")) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_match_regex" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_match_regex(model, ns.child_column_name, kwargs.get("regex"),
                                                             row_condition=kwargs.get("row_condition"),
                                                             is_raw=kwargs.get("is_raw", False),
                                                             flags=kwargs.get("flags", "")) }}
  {%- elif test == "dbt_utils.expression_is_true" -%}  -- column-level test
{{ dbt_utils.test_expression_is_true(model, kwargs.get("expression"), ns.child_column_name) }}
  {%- elif test == "dbt_utils.mutually_exclusive_ranges" -%}  -- model-level test
    {%- set partition_by = ["row_number"] -%}
    {%- for partition_by_column_name in kwargs.get("partition_by", []) -%}
      {{ partition_by.append("%s.%s" % ns.child_column_name, partition_by_column_name) }}
    {%- endfor -%}
{{ dbt_utils.test_mutually_exclusive_ranges(model, lower_bound_column=kwargs.get("lower_bound_column"),
                                            upper_bound_column=kwargs.get("upper_bound_column"),
                                            partition_by=partition_by.join(","),
                                            gaps=kwargs.get("gaps", "allowed"),
                                            zero_length_range_allowed=kwargs.get("zero_length_range_allowed", False)) }}
  {%- elif test == "dbt_utils.unique_combination_of_columns" -%}  -- model-level test
    {%- set combination_of_columns = ["row_number"] -%}
    {%- for combination_of_columns_column_name in kwargs.get("combination_of_columns", []) -%}
      {{ combination_of_columns.append("%s.%s" % ns.child_column_name, combination_of_columns_column_name) }}
    {%- endfor -%}
{{ dbt_utils.test_unique_combination_of_columns(model, combination_of_columns,
                                                quote_columns=kwargs.get("quote_columns", False)) }}
  {%- elif test == "not_null" -%}  -- column-level test
{{ test_not_null(model, ns.child_column_name) }}
  {%- elif test == "relationships" -%}  -- column-level test
{{ test_relationships(model, ns.child_column_name, kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}  -- column-level test
{{ test_unique(model, ns.child_column_name) }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique

@adamcunnington-mlg
Copy link
Author

Heh, no one is interested in this?

@jtcohen6 jtcohen6 self-assigned this Jun 25, 2023
@jtcohen6
Copy link
Contributor

@adamcunnington-mlg Sorry for the delay - I'm not uninterested :)

dbt-core doesn't really know anything about nested fields right know. dbt-bigquery knows a little bit (more thanks to dbt-labs/dbt-bigquery#738), but still not very much. As you say, it's worked out nicely that BigQuery lets you:

  • add descriptions to a nested field by treating it like a normal column named foo.bar
  • query that nested field as if it were a normal column named foo.bar, if foo is a struct — but not, as you rightly point out, if foo is an array

We have some open issues (related to our work on "model contracts") to better handle when foo is an array, and to teach dbt-core more about nested fields (by upstreaming the logic from dbt-bigquery):

Your proposal strikes me as totally reasonable:

This can be achieved by having the test wrap do nothing more than setting the model variable to be a sub-query. dbt actually does the exact same thing when you provide config/where. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test.

Considerations:

  • As you mention, only the (un)nested field would appear in the query result, limiting the size but potentially also the usefulness of --store-failures. I agree that feels like a reasonable tradeoff. For the purposes of testing, we treat the repeated record as if it were its own standalone table.
  • We'd need to call an adapter method/macro within the subquery to do the unnesting. For BigQuery and Trino it's [cross join] unnest; for SparkSQL it's expode; I'm sure there are more. FWIW, I don't think we can support Snowflake's lateral flatten here, since it does not have a real concept of structured (typed) nested fields—only variant types (JSON strings) where anything goes.
  • If we were to implement this within dbt-core, we'd need to special case . as an indicator of field nesting. That's true on BigQuery/Spark/etc, but it doesn't have that denotation on Postgres/Redshift/Snowflake, and we'd risk doing the wrong thing with someone's "weird.Column.Naming.Convention".

As far as the implementation - it looks like you took inspiration from JT's excellent discourse post several years ago. I think the goal would be to:

  • Rewrite portions of this in Python (for readability)
  • Add the logic within our "generic test builder" — same place that get_where_subquery is called — where the goal would be to interpolate the subquery and modify the column_name accordingly

While this is a neat problem to think through, I don't see it as very high priority for us to implement. If a member of the community wants to take a crack at this, I'd be all for it.

@jtcohen6 jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors dbt tests Issues related to built-in dbt testing functionality and removed triage labels Jun 26, 2023
@jtcohen6 jtcohen6 removed their assignment Jun 26, 2023
@adamcunnington-mlg
Copy link
Author

adamcunnington-mlg commented Jun 28, 2023

@jtcohen6 thanks for the super-considered response as always! It makes complete sense and sounds like the right implementation route.

I agree on this not being priority given there is a workaround and perhaps nested types are still not used as much as they should and thus is impacting the community less than it ought!

Could you comment on how table-level tests might work? Would it be that once dbt-core is aware of nesting, table-level tests could simply have reference columns within arrays and dbt-core would handle the column-level sub-querying? This feels more complex when a table-level test references multiple columns that are within different arrays / depths. This is probably invalid (logically) but something would still need to "handle" this.

The one thing i did want to call out though is that foo.bar.baz syntax is convenient for referencing columns within a repeated column (array) but you could argue it's unfortunate and ambiguous vs. struct/nested references. Some jsonPath implementations would prefer something like foo[*].bar to be explicit that foo is an array, not a struct. Indeed, it's slightly strange that I can annotate both nested and array columns with foo.bar but I understand why this works thanks to the adapter-level magic. What do you think dbt-core should expect in the future?

@hugohahn009
Copy link

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%}
  {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%}
  {%- set ns.from_sql = "FROM %s" % model -%}
  {%- set hierarchy = column_name.split(".") -%}
  {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%}
  {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%}
  {%- for child in root recursive -%}
    {%- set child_name = "`" ~ child.name ~ "`" -%}
    {%- set ns.column_list = ns.column_list + [child_name] -%}
    {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%}
    {%- if child.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%}
      {%- set ns.column_list = [child_name] -%}
    {%- endif -%}
    {%- if child.fields -%}
      {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {%- elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {%- elif test == "relationships" -%}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}
{{ test_unique(model, "column") }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique

@adamcunnington-mlg thank you very much for this proposition. I'm stealing your code because it's too good ! Great job 👌 🙏

@Jeniffen
Copy link

Jeniffen commented Aug 2, 2023

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%}
  {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%}
  {%- set ns.from_sql = "FROM %s" % model -%}
  {%- set hierarchy = column_name.split(".") -%}
  {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%}
  {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%}
  {%- for child in root recursive -%}
    {%- set child_name = "`" ~ child.name ~ "`" -%}
    {%- set ns.column_list = ns.column_list + [child_name] -%}
    {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%}
    {%- if child.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%}
      {%- set ns.column_list = [child_name] -%}
    {%- endif -%}
    {%- if child.fields -%}
      {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {%- elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {%- elif test == "relationships" -%}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}
{{ test_unique(model, "column") }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique

This is amazing - Thank you very much, this saved my day! 🙏

@adamcunnington-mlg
Copy link
Author

P.s. I have updated the test wrapper code above as there are a few bug fixes and additional tests added since

@msiddiq1400
Copy link

plz this need to be fixed, already spent a day, figuring out what the issue was, who can fix this issue with DBT ? @adamcunnington-mlg

@msiddiq1400
Copy link

is there any way around this, a quick fix ? @adamcunnington-mlg

@adamcunnington-mlg
Copy link
Author

@msiddiq1400 what issue are you talking about - you've not specified.

@vittorfp
Copy link

This feature seems to have already been implemented.

@adamcunnington-mlg
Copy link
Author

@vittorfp really? where - can you share link to updated docs? was this in 1.8/1.9?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors
Projects
None yet
Development

No branches or pull requests

6 participants