[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

leesiongchan · 2024-05-08T14:29:31Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null, # <--- if we change both of these null values to true/false, the error goes away.
              is_validated: null, # <--- if we change both of these null values to true/false, the error goes away.
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 1,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: true, is_validated: true }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: null, is_validated: null }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

When you run the test you will see error like this.

And here is the logs:

Input build

Expected result build

Interestingly, the incorrect casting happen in the dbt_internal_unit_test_expected cte but not the input cte.

Our current workaround is to use sql to cast the correct type, eg.

select
    '2024-01-01 10:01:00' as created_at,
    1 as user_id,
    null::boolean as is_closed,
    null::boolean as is_validated,
    'created' as user_action
union all
select '2024-01-01 10:02:00', 1, null, true, 'validated'

Expected Behavior

The test runner should be able to guess the data type correctly.

Steps To Reproduce

Provide unit test data with some null values in the first record.
Run the test
ERROR!

Relevant log output

Input build

Expected result build

Interestingly, the incorrect casting happen in the dbt_internal_unit_test_expected cte but not the input cte.

Environment

- OS: Ubuntu 24.04 WSL
- Python: 3.12.2
- dbt: 1.8.0-rc1

Which database adapter are you using with dbt?

redshift

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2024-05-09T20:52:40Z

Thanks for reporting this @leesiongchan !

I wasn't able to replicate this on my first try. Could you try out the files below and make any tweaks needed to trigger the error you saw?

Project files and commands

models/stg_mycfg__user_events.sql

select
    cast('2024-01-01 10:01:00' as timestamp) as created_at,
    1 as user_id,
    true as is_closed,
    false as is_validated,
    'created' as user_action

models/stg_mycfg__user_snapshot.sql

select 1 as id

models/int_mycfg__user_events_combined.sql

-- {{ ref("stg_mycfg__user_snapshot") }}
select * from {{ ref("stg_mycfg__user_events") }}

models/_unit.yml

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 3,
              is_closed: true,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 4,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 5,
              is_closed: false,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 6,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: true, is_validated: true }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: null, is_validated: null }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

Run this command:

dbt build -s +stg_mycfg__user_events+1

leesiongchan · 2024-05-10T13:31:30Z

You have to change your first record with some NULL values: eg.

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
        ...

dbeatty10 · 2024-05-10T13:43:56Z

@leesiongchan This worked for me -- is it not working for you?

unit_tests:
  - name: test_user_events_combined
    model: int_mycfg__user_events_combined
    given:
      - input: ref('stg_mycfg__user_events')
        rows:
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 1,
              is_closed: null,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:02:00",
              user_id: 2,
              is_closed: null,
              is_validated: true,
              user_action: validated,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 3,
              is_closed: true,
              is_validated: null,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 4,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 5,
              is_closed: false,
              is_validated: true,
              user_action: created,
            }
          - {
              created_at: "2024-01-01 10:01:00",
              user_id: 6,
              is_closed: true,
              is_validated: true,
              user_action: created,
            }
      - input: ref('stg_mycfg__user_snapshot')
        rows: []
    expect:
      rows:
        - { user_id: 1, is_closed: null, is_validated: null }
        - { user_id: 2, is_closed: null, is_validated: true }
        - { user_id: 3, is_closed: true, is_validated: null }
        - { user_id: 4, is_closed: true, is_validated: true }
        - { user_id: 5, is_closed: false, is_validated: true }
        - { user_id: 6, is_closed: true, is_validated: true }

leesiongchan · 2024-05-10T16:17:48Z

As long as the first record doesn't have any NULL values, the test runs as expected.

Oh... could this be dbt-redshift problem?

dbeatty10 · 2024-05-10T17:15:08Z

Ah! I can get the same error as you now.

It runs fine in dbt-postgres, but gives an error in dbt-redshift. See details below for a simplified example.

Reprex

models/some_input.sql

select
    1 as user_id,
    cast(true as boolean) as is_closed

models/my_model.sql

select * from {{ ref("some_input") }}

models/_unit.yml

unit_tests:

  - name: test__my_model
    model: my_model
    given:
      - input: ref('some_input')
        rows:
          - { user_id: 1, is_closed: null }
          - { user_id: 4, is_closed: true }
    expect:
      rows:
        - { user_id: 1, is_closed: null }
        - { user_id: 4, is_closed: true }

Run this command to see the error:

dbt build -s +my_model

Output:

17:09:03    Runtime Error in unit_test test__my_model (models/_unit.yml)
  An error occurred during execution of unit test 'test__my_model'. There may be an error in the unit test definition: check the data types.
   Database Error
    cannot cast type boolean to character varying

But if the order of the given inputs is swapped, then it runs successfully:

models/_unit.yml

unit_tests:

  - name: test__my_model
    model: my_model
    given:
      - input: ref('some_input')
        rows:
          - { user_id: 4, is_closed: true }
          - { user_id: 1, is_closed: null }
    expect:
      rows:
        - { user_id: 1, is_closed: null }
        - { user_id: 4, is_closed: true }

jnapoleon · 2024-05-11T19:34:38Z

I'm having a similar error where using redshift using csv's as the input and output

Looking at the expected output
It's casting something that should be numeric as a character
cast('0' as character varying(128)) as conversion_value,

When it should be casting it as whatever the output model schema is

dbeatty10 · 2024-05-13T13:03:59Z

Thanks for reporting an additional error scenario @jnapoleon 👍

Could you share the relevant example code as well so we can make sure it gets covered by any potential solution?

dbeatty10 · 2024-05-13T13:24:15Z

I tested the dbt project described in #821 against a variety of data platforms, and only Redshift failed. So depending on the nature of the solution, we may end up transferring this issue to dbt-redshift.

These all worked:

dbt-postgres
dbt-bigquery
dbt-snowflake
dbt-databricks

jnapoleon · 2024-05-15T14:21:21Z

Hey @dbeatty10 I've attached the logs

Let me know if you would rather me create a new issue?
log.txt

Thanks

John

dbeatty10 · 2024-05-15T16:32:24Z

Thanks for attaching your logs @jnapoleon 👍

Could you also provide a minimal code example like the "reprex" within #821 (comment)? That would allow us to determine if you are having the same issue or a different issue.

jnapoleon · 2024-05-15T18:22:03Z

@dbeatty10 I managed to resolve the issue, I don't think there was a bug with dbt.

There was some strange behaviour getting the data types using information schema from the created temporary table. It thought the date type was varchar for the temporary table but in the real final output model it was a numeric

The way I solved it was explicitly casting the fields as numeric in the final output model

joellabes · 2024-05-18T05:04:59Z

I am also seeing this. Because I'm doing cross-platform unit tests for the audit helper package, I have a diff of what snowflake generated vs redshift:

When nulls are present alongside numeric values, dbt is casting everything as a varchar(1), which makes unions between expected and actual impossible.

Here's the full file that ran in this CI run
reworked_compare_identical_tables_multiple_null_pk.txt

In my case, the input rows lead with the null but the expected output has a non-null value first.

unit_tests:
  - name: reworked_compare_identical_tables_multiple_null_pk
    model: unit_reworked_compare
    
    given:
      - input: ref('unit_test_model_a')
        rows:
          - { "id": , "col1": "abc", "col2": "def" }
          - { "id": , "col1": "hij", "col2": "klm" }
          - { "id": 3, "col1": "nop", "col2": "qrs" }
      - input: ref('unit_test_model_b')
        rows:
          - { "id": , "col1": "abc", "col2": "def" }
          - { "id": , "col1": "hij", "col2": "klm" }
          - { "id": 3, "col1": "nop", "col2": "qrs" }
        
    expect:
      rows:
        - {"dbt_audit_row_status": 'identical', 'id': 3, dbt_audit_num_rows_in_status: 1}
        - {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}
        - {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}

    overrides:
      vars:
        reworked_compare__columns: ['id', 'col1', 'col2']
        reworked_compare__event_time:
        reworked_compare__primary_key_columns: ['id']

VersusFacit · 2024-06-25T20:38:34Z

I made a little reprex for Redshift 💟 This will serve as a test in the PR to be

Note this does work on Postgres!

-- models/staging/a.sql
{{ config(materialized="table") }}

select 1 as id, 'a' as col1
union all
select 2, 'b'
union all
select 3, 'c'

-- models/staging/b.sql
{{config(materialized="table")}}

select * from {{ ref('a') }}

unit_tests:
  - name: test_simple

    model: b
    given:
      - input: ref('a')
        rows:
          - { "id":  , "col1": "d"}
          - { "id": , "col1": "e"}
          - { "id": 6, "col1": "f"}

    expect:
      rows:
        - {id:  , "col1": "d"}
        - {id:  , "col1": "e"}
        - {id: 6, "col1": "f"}

VersusFacit · 2024-07-02T10:40:37Z

Can't be merged without a unit test, so blocked until then. The work itself is expected to take 1 extra dev day and to be taken care of on Monday.

edit: moved forward with more deeply digging into the unit test framework for Core. Code coverage concerns (plus a holiday week and a support diversion) backed this up further, but we're not unblocked and the issue is in final review.

VersusFacit · 2024-07-12T11:15:35Z

As we wrap up this issue -- in final review -- I want to document that there is one case we are leaving "unsolved": No single row has all non-null values. An expected fixture like:

id,name
1,
,Jane

In this scenario, one can determine proper type for id from 1 and use Jane for column name. But Redshift's type interpolation on insertion relies on the first row inserted, and the null degrades to VARCHAR(1) by default. If you had type information, you could try to automate a column alter statement, but this is complex and there is no sensible entrypoint in the dbt Base Adapter to do so. (Moreover, even if we wanted to go down this path, we'd have to determine what this feature even should be)

Acknowledging this limitation, we have integrated a dbt.exceptions.ParsingError which is thrown when the expected fixture is lacking even just one row with all non-null values. That special row is to be used for first insertion into the database allowing Redshift (and others) to determine the proper types for column fields.

One object you may voice

Bite the bullet. O(m*n) scan the expected fixture and determine types from any canonical values in the columns. However, having looked into Base adapter, there is no expedient let alone cross-database way with the existing framework to address this without tearing out the library into a new set of modules. And what happens if the column is all nulls? There's really no way to determine what next. We'd just encourage users intentionally avoid that ambiguity at the start.

To do something more sophisticated is a lift we cannot prioritize right now.

The foibles of Redshift

Note, I did learn that this is a seemingly Redshift specific behavior. Other dbs like postgres and snowflake seem to have some more sophisticated type interpolation behaviors.

leesiongchan added bug Something isn't working triage labels May 8, 2024

leesiongchan changed the title ~~[Bug] Unit test inferred the wrong data type when first record has NULL value~~ [Bug] Unit test infers the wrong data type when the first record has some NULL values May 8, 2024

dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label May 9, 2024

dbeatty10 added awaiting_response and removed triage labels May 9, 2024

github-actions bot added triage and removed awaiting_response labels May 10, 2024

dbeatty10 added awaiting_response and removed triage labels May 10, 2024

github-actions bot added triage and removed awaiting_response labels May 10, 2024

dbeatty10 removed the triage label May 10, 2024

graciegoheen transferred this issue from dbt-labs/dbt-core May 13, 2024

dbeatty10 mentioned this issue May 17, 2024

[Bug] unit tests' comparisons are sometimes sensitive to the order of records returned dbt-labs/dbt-core#10167

Closed

2 tasks

colin-rogers-dbt assigned mikealfare and unassigned mikealfare Jun 5, 2024

colin-rogers-dbt assigned VersusFacit Jun 20, 2024

This was referenced Jun 26, 2024

Adap 821/unit test infer wrong datatype #852

Merged

[Bug] Unit test fixture null values cause typing issues dbt-labs/dbt-core#10365

Closed

VersusFacit closed this as completed in #852 Jul 12, 2024

mikealfare linked a pull request Jul 17, 2024 that will close this issue

Revert "Adap 821/unit test infer wrong datatype" #860

Merged

VersusFacit mentioned this issue Jul 22, 2024

Adap 821/unit tests infer wrong datatype for None values in fixtures #885

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

leesiongchan commented May 8, 2024 •

edited by dbeatty10

Loading

dbeatty10 commented May 9, 2024

leesiongchan commented May 10, 2024

dbeatty10 commented May 10, 2024

leesiongchan commented May 10, 2024 •

edited

Loading

dbeatty10 commented May 10, 2024

Reprex

jnapoleon commented May 11, 2024

dbeatty10 commented May 13, 2024

dbeatty10 commented May 13, 2024

jnapoleon commented May 15, 2024

dbeatty10 commented May 15, 2024

jnapoleon commented May 15, 2024

joellabes commented May 18, 2024

VersusFacit commented Jun 25, 2024 •

edited

Loading

VersusFacit commented Jul 2, 2024 •

edited

Loading

VersusFacit commented Jul 12, 2024 •

edited

Loading

[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

[Bug] Unit test infers the wrong data type when the first record has some NULL values #821

Comments

leesiongchan commented May 8, 2024 • edited by dbeatty10 Loading

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented May 9, 2024

leesiongchan commented May 10, 2024

dbeatty10 commented May 10, 2024

leesiongchan commented May 10, 2024 • edited Loading

dbeatty10 commented May 10, 2024

Reprex

jnapoleon commented May 11, 2024

dbeatty10 commented May 13, 2024

dbeatty10 commented May 13, 2024

jnapoleon commented May 15, 2024

dbeatty10 commented May 15, 2024

jnapoleon commented May 15, 2024

joellabes commented May 18, 2024

VersusFacit commented Jun 25, 2024 • edited Loading

VersusFacit commented Jul 2, 2024 • edited Loading

VersusFacit commented Jul 12, 2024 • edited Loading

One object you may voice

The foibles of Redshift

leesiongchan commented May 8, 2024 •

edited by dbeatty10

Loading

leesiongchan commented May 10, 2024 •

edited

Loading

VersusFacit commented Jun 25, 2024 •

edited

Loading

VersusFacit commented Jul 2, 2024 •

edited

Loading

VersusFacit commented Jul 12, 2024 •

edited

Loading