-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2573] [Feature] Support column-level tests on nested data, natively #7613
Comments
I'm not sure whether a nested data test wrapper is also possible to produce - in a generic fashion - for table-level tests - but we're also exploring this. Regardless, this FR currently focusses on column-level functionality - which is very doable - and I think very sensible! |
Here's our wrapper:
and example usage: columns:
- name: foo.bar.baz
tests:
- nested_data_test_wrapper:
config:
where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
test: accepted_values
values:
- value1
- value2
- value3
- nested_data_test_wrapper:
test: unique |
Heh, no one is interested in this? |
@adamcunnington-mlg Sorry for the delay - I'm not uninterested :)
We have some open issues (related to our work on "model contracts") to better handle when
Your proposal strikes me as totally reasonable:
Considerations:
As far as the implementation - it looks like you took inspiration from JT's excellent discourse post several years ago. I think the goal would be to:
While this is a neat problem to think through, I don't see it as very high priority for us to implement. If a member of the community wants to take a crack at this, I'd be all for it. |
@jtcohen6 thanks for the super-considered response as always! It makes complete sense and sounds like the right implementation route. I agree on this not being priority given there is a workaround and perhaps nested types are still not used as much as they should and thus is impacting the community less than it ought! Could you comment on how table-level tests might work? Would it be that once dbt-core is aware of nesting, table-level tests could simply have reference columns within arrays and dbt-core would handle the column-level sub-querying? This feels more complex when a table-level test references multiple columns that are within different arrays / depths. This is probably invalid (logically) but something would still need to "handle" this. The one thing i did want to call out though is that |
@adamcunnington-mlg thank you very much for this proposition. I'm stealing your code because it's too good ! Great job 👌 🙏 |
This is amazing - Thank you very much, this saved my day! 🙏 |
P.s. I have updated the test wrapper code above as there are a few bug fixes and additional tests added since |
plz this need to be fixed, already spent a day, figuring out what the issue was, who can fix this issue with DBT ? @adamcunnington-mlg |
is there any way around this, a quick fix ? @adamcunnington-mlg |
@msiddiq1400 what issue are you talking about - you've not specified. |
This feature seems to have already been implemented. |
@vittorfp really? where - can you share link to updated docs? was this in 1.8/1.9? |
Is this your first time submitting a feature request?
Describe the feature
Context
When adding column metadata, for example, to a model YML file, it is already possible to reference a nested column in order to add a description. E.g.
Presumably, DBT is doing something clever and traversing dot notation. This works today and if configured as such and supported by the adapter, this description will be written back to the database. Notably, this works regardless of whether foo is a struct or an array (using BQ terminology here but the same concepts exist in other DBs). I assume this works regardless of depth too but I've not explicitly verified this. Generic tests will even work if foo is a struct (again, presumably with arbitrary depth).
However, generic tests won't work in the case of a repeated/array field - and this is not surprising - different SQL would be needed to unnest repeated fields. E.g. this will cause an error during the not_null test execution.
Someone from the community came up with an implementation for a nested not null test a while ago. It makes for some good inspiration but it has fallen short of what it should have been! The nested data logic should not be limited to not_null - it should be a generic wrapper test that calls out to some other generic test. This can be achieved by having the test wrap do nothing more than setting the
model
variable to be a sub-query. DBT actually does the exact same thing when you provideconfig/where
. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test. The only limitation is not all columns can be passed to the downstream test which the downstream test would ideally like to have in the case ofstore_failures = True
but this is a small (and reasonable limitation).I have an alternative implementation to this test that works in a generic way and we are using it for several downstream generic tests. The only limitation is that the test has to have some logic like the below which explicitly calls the relevant test:
As far as I can tell, within a jinja-only context, this can't be gotten around for two reasons:
test
variable cannot be used to dynamically determine the macro to call - jinja's not quite object-oriented enough for that.The Feature Request
DBT core should do the SQL wrapper before it calls an existing generic test. It should do something similar to what my jinja is doing, but far easier in python, and basically parse the
foo.bar.baz
column name, ask the adapter for the column metadata and then traverse it and build the subquery before calling the test. It's pretty straight forward stuff - just a case of addingCROSS JOIN UNNEST(<x>) AS <y>
whenever it encounters a repeated field. There's slightly more to it than that but the community post does a great job of explaining the logic and I'm happy to provide our improvement too.Describe alternatives you've considered
Doing it myself via a custom generic test that is a clever wrapper - but as described in main part of the post, has some unavoidable limitations - as well as being a missed opportunity to benefit the rest of the community.
Who will this benefit?
Everyone that wants to test nested data. Huge impact!
Are you interested in contributing this feature?
Not familiar with dbt-core - hopefully someone can take the logic and just implement it in the right place, having thought through any implications.
Anything else?
No response
The text was updated successfully, but these errors were encountered: