Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Test for uniqueness passes when duplicates are present in BigQuery column #11067

Open
2 tasks done
steffan-roughsedge opened this issue Nov 28, 2024 · 2 comments
Open
2 tasks done
Labels
awaiting_response bug Something isn't working

Comments

@steffan-roughsedge
Copy link

steffan-roughsedge commented Nov 28, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When running the out-of-the-box 'unique' test the test will pass even when there are duplicates in the column if:

  • The test is running against a BigQuery table
  • The column being tested has exactly the same name as the table

Expected Behavior

The test should fail if the there are duplicates in the column being tested

Steps To Reproduce

Create table called my-project.my-dataset.orders with a single column named 'orders'. Insert duplicate values into the 'orders' column. Run the DBT 'unique' test against this column. The test passes.

Relevant log output

No response

Environment

- OS:Mac OSX Sequoia
- Python: 3.11.2
- dbt: 1.8.3
- Database: BigQuery

Which database adapter are you using with dbt?

bigquery

Additional Context

Looking at the generated code (shown below) for the test the issue occurs because the table in the first CTE is not aliased, therefore the subsequent GROUP BY in the main SELECT is grouping by the whole table.

A similar issue was previously logged here, however it looks like it was closed and the issue I am experiencing still remains.

with dbt_test__target as (

  select orders as unique_field
  from `my-project`.`my-dataset`.`orders` -- Add an alias for this table here
  where orders is not null

)

select
    unique_field,
    count(*) as n_records

from dbt_test__target
group by unique_field
having count(*) > 1
@steffan-roughsedge steffan-roughsedge added bug Something isn't working triage labels Nov 28, 2024
@dbeatty10
Copy link
Contributor

Thanks for reaching out @steffan-roughsedge !

I haven't been able to reproduce this myself. Could you provide project files similar to below to help us reproduce the issue you are seeing?

The original issue (#2061) and draft PR (#2075) got closed as stale, but we'd review a new pull request if anyone is interested in solving this for dbt-bigquery for both unique and not_null. We'd just need to give care to make sure it doesn't break anyone, especially given how rare the bug scenario is.

The relevant default adapter code for each of the current implementations is located here:

Depending on the details of the solution, we'd either choose to adopt it as the new default for all adapters or just provided targeted solution for dbt-bigquery.

Either way, any implementation should include relevant tests that include the scenario described above with the model name and column sharing the same name.

Project files

Create these files:

models/orders.sql

select struct(1 as id) as orders

models/_models.yml

models:
  - name: orders
    columns:
      - name: orders
        tests:
          - unique
          - not_null

Run these commands:

dbt build --full-refresh

Get this output:

$ dbt build --full-refresh

04:18:40  Running with dbt=1.8.8
04:19:05  Registered adapter: bigquery=1.8.3
04:19:06  Found 1 model, 2 data tests, 473 macros
04:19:06  
04:19:40  Concurrency: 10 threads (target='blue')
04:19:40  
04:19:40  1 of 3 START sql view model dbt_dbeatty.orders ................................. [RUN]
04:19:42  1 of 3 OK created sql view model dbt_dbeatty.orders ............................ [CREATE VIEW (0 processed) in 1.57s]
04:19:42  2 of 3 START test not_null_orders_orders ....................................... [RUN]
04:19:42  3 of 3 START test unique_orders_orders ......................................... [RUN]
04:19:46  3 of 3 PASS unique_orders_orders ............................................... [PASS in 3.50s]
04:19:46  2 of 3 PASS not_null_orders_orders ............................................. [PASS in 4.08s]
04:19:46  
04:19:46  Finished running 1 view model, 2 data tests in 0 hours 0 minutes and 39.89 seconds (39.89s).
04:19:46  
04:19:46  Completed successfully
04:19:46  
04:19:46  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants