Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt Constraints / model contracts #574

Merged
merged 43 commits into from
Feb 17, 2023
Merged

Conversation

b-per
Copy link
Contributor

@b-per b-per commented Dec 22, 2022

resolves #558

Description

Adds the ability to provide a list of column for a model and force the model to a specific table schema. This PR also allows users to add a not null constraints on columns.

Related Adapter Pull Requests

Must be reviewed with passing tests

Checklist

@cla-bot cla-bot bot added the cla:yes label Dec 22, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-spark contributing guide.

@b-per
Copy link
Contributor Author

b-per commented Dec 23, 2022

There is a bit of complication with the Spark implementation:

  1. Spark doesn't support create or replace table as <SQL> with a schema. We get a Operation not allowed: Schema may not be specified in a Replace Table As Select (RTAS).
    • we could either drop the table first and then use create table as (with a schema and without replace) or do a create or replace table <schema> followed by an insert into (like it is done for Postgres/Redshift) and what I am doing here
  2. The previous point raises another issue though. Spark doesn't support begin/end for SQL transactions. Whatever approach we take from the point above will result with the table being either empty or dropped for some time, until the data insertion finishes
  3. Finally, our current implementation of the Spark adapter doesn't allow sending multiple SQL statements separated by ; as spark.sql() only allows 1 SQL statement. Options would be:
    1. split the statements by ; and do a spark.sql(statement) for each
    2. or modify the table materialization to do multiple call statement when the create and insert are required to be run separately (what I am trying here)

Overall all those limitations make the solution look quite brittle.

Also, our dbt-spark tests are still using the older paradigm with decorators. I got it working now but it makes it difficult to reuse similar tests across adapters

@b-per
Copy link
Contributor Author

b-per commented Dec 23, 2022

The CI tests fail when running

create or replace table test16717917161013806450_test_constraints.constraints_column_types
      
      
  
  (
    
      int_column int  not null  ,
      float_column float  ,
      bool_column boolean  ,
      date_column date  
  )
  

      
    using delta

and the error is Caused by: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html but I don't know what it means.

The same query runs fine in databricks on the hive_metastore

@sungchun12
Copy link
Contributor

I've been talking with the people who maintain dbt-databricks and looks like they implemented constraints using the meta tag. I haven't dived deep into this yet, but may be worth skimming to see if it resolves any of your comments above: databricks/dbt-databricks#71

@b-per
Copy link
Contributor Author

b-per commented Jan 3, 2023

The approach in the databricks adapter is to do a create table as followed by an alter to add constraints.

I can follow a similar approach here but what it means is that due to the lack of begin/end, it is possible that the table gets created properly but then the alter statement fails when adding the constraints. We will then have a table that has been refreshed with the new data despite the constraints defined in dbt not being followed. This is very different from the rest of the adapters.

@sungchun12
Copy link
Contributor

@b-per
Okay I see 2 paths here:

  1. Can you test out the below syntax to see if it'll work? I got this from chatgpt(think: looks right but may be wrong)
START TRANSACTION;
-- execute some SQL statements here
COMMIT;

image

  1. We use a brand new approach below.
  • Save original spark table as a temp table(if it exists. If it's the first time running this, skip this step.)
  • Create an empty spark table with constraints
  • Insert rows into empty spark table with constraints
  • If it fails at any of these steps, we use SQL to revert any changes vs. rely on transaction mechanics traditionally in things like Postgres.

@b-per
Copy link
Contributor Author

b-per commented Jan 4, 2023

I think that ChatGPT is a bit out of its league here 😄

I tried it and got a Operation not allowed: START TRANSACTION. This looks consistent with Databricks statement in their docs that:

  • Databricks manages transactions at the table level. Transactions always apply to one table at a time
  • Databricks does not have BEGIN/END constructs that allow multiple operations to be grouped together as a single transaction. Applications that modify multiple tables commit transactions to each table in a serial fashion
  • You can combine inserts, updates, and deletes against a table into a single write transaction using MERGE INTO.

So we'll have to go with approach 2.

But again, the lack of cross tables transactions is tricky. We can't:

  • save a backup of the original table and revert it if the new one fails
  • create the new table with a tmp name and, in a transaction, swap/rename the old one and the new one

The only think I could think of, in order not to have any time where the table doesn't exist, would be to:

  1. create the new table with a tmp name, making sure that creating the table, loading data and adding the constraints work
  2. do a create or replace table <original_table> deep clone <tmp_table> (docs). This seems to be pretty inefficient from an IO standpoint (copying the whole dataset) but I think that using a shallow clone might not work when we perform the last step
  3. drop the tmp table

@sungchun12
Copy link
Contributor

@b-per I like your approach better in creating the new table as a temp table and then replacing the original table once everything is correct.

When it comes to performance, I don't see a way around IO blockers because we bumped into the same problems for redshift and postgres implementations when it comes to inserting and copying data over. The tough thing for spark will be manually building out the rollback logic for specific steps, and we'll have to explore how far jinja can go in that department.

@jtcohen6 do you have pro tips in your spark experience with DDL strategies?

@jtcohen6
Copy link
Contributor

Sorry I missed this a few weeks ago! Gross.

We can verify column names (later: also data types) by running the model SQL query with where false limit 0, in advance of creating/replacing the real table.

To enforce not null + check constraints, without first dropping the already-existing table... we have only a few bad options, and no good ones.

Those options, as I see them:

  1. Add constraints after the table is created. If it raises an error, the model has already been built, but we'll still be able to report the error and skip downstream models from building. (This is comparable to dbt test today.) The constraints will also prevent new bad data from being inserted/merged into incremental models.
  2. Accept either the risk of significant downtime (drop + recreate), or the risk of the model taking significantly longer (create table in temp location + apply constraints + deep-clone to new location if constraints succeed).
  3. Verify the checks ourselves, by saving the model SQL as a temporary view, and running actual SQL against it (a la dbt test), and then only create/replace a table from that temporary view if all checks pass. This mostly loses the value of the constraints as actually enforced by the data platform...

My vote would be for option 1. I'd be willing to document this as a known limitation of constraints on Spark/Databricks. Also, I believe this approach matches up most closely with the current implementation in dbt-databricks, which "persists" constraints after table creation (similar to persist_docs).

@sungchun12
Copy link
Contributor

Let's go with option 1 as this has the most reasonable tradeoffs. The other options have more cost than benefit. It's good to know we have the databricks implementation as a reference! Thanks for thinking this through Jerco!

@sungchun12
Copy link
Contributor

sungchun12 commented Jan 31, 2023

@Fleid when your team reviews this, keep in mind spark's limitations in how constraints behaviors work as a whole compared to snowflake, redshift, and postgres which are all easier to reason about compared to spark

We'll need your help troubleshooting the failed databricks test.

@sungchun12 sungchun12 marked this pull request as ready for review January 31, 2023 16:23
{% for column_name in column_dict %}
{% set constraints_check = column_dict[column_name]['constraints_check'] %}
{% for constraint_check in constraints_check %}
{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: Constraint hash is a sensible default, since we need a unique identifier. We may also want to let users define their own custom name for the constraint.

{% set constraints_check = column_dict[column_name]['constraints_check'] %}
{% for constraint_check in constraints_check %}
{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%}
{% if not is_incremental() %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: dbt-labs/dbt-core#6755

(Maybe just drop a comment to that issue for now)

{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%}
{% if not is_incremental() %}
{% call statement() %}
alter table {{ relation }} add constraint {{ constraint_hash }} check ({{ column_name }} {{ constraint_check }});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should users define the check including the column name, or not? In the current implementation, it is included, so it wold be repeated here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am first trying to get the tests passing with it included but in my mind it doesn't make sense to add the column name. We technically today can put the check of a given column under another one which seems odd to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though you can add a check to another column, we shouldn't limit developers. The check inline with the respective column provides reasonable signals that there should be a 1:1 mapping even if we don't enforce it.

Comment on lines 184 to 185
{% set constraints_check = column_dict[column_name]['constraints_check'] %}
{% for constraint_check in constraints_check %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, constraints_check is a string, not a list, so we shouldn't loop over it here. That's why we're only seeing the first character (() show up in the CI test!

alter table test16752556745428623819_test_constraints.my_model add constraint 50006a0485dbd5df8255e85e2c79411f check (id ();

We are planning to change this in the future by unifying these into a single constraints attribute: dbt-labs/dbt-core#6750

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just looking at it, yes, and I will update my code.
In my mind it could (or should) be a list. I am not sure why we would only allow 1 value. (maybe worth considering for the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! It should be a list. TK in 6750

@dbeatty10 dbeatty10 changed the title Add support for constraints dbt Constraints / model contracts Feb 14, 2023
@jtcohen6 jtcohen6 requested a review from MichelleArk February 16, 2023 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

[CT-1684] [Feature] Add constraints, data type, and check enforcement on SQL Tables
4 participants