Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test data contract against dataframes or views #175

Closed
john-bunge-ed opened this issue May 6, 2024 · 12 comments
Closed

Test data contract against dataframes or views #175

john-bunge-ed opened this issue May 6, 2024 · 12 comments
Assignees

Comments

@john-bunge-ed
Copy link

Hi!

during my tests with this package I ran:

data_contract = DataContract(data_contract_file="datacontract.yaml")
run = data_contract.test()

I learnt that in datacontract.yaml I can specify a data source in the server block.

Question: Do I also have the option to tests the contract against a Python or PySpark dataframe at runtime, or against a (temporary) SQL view?

Motivation: I do not want to persist the data I am testing the contract against. I am asking this from a Databricks perspective, but this questions is also relevant if I have some data transformation pipeline in Python which runs locally, and I want to test the data contract against a specific dataframe created during that pipeline.

@jochenchrist
Copy link
Contributor

Currently, you cannot test a dataframe at runtime.

With Databricks, the recommended way is to use a Unity Catalog or Hive table.
And you can define the spark session:

data_contract = DataContract(
  data_contract_file="/Volumes/.../datacontract.yaml", 
  spark=spark
)
run = data_contract.test()
run.result

datacontract.yaml

servers:
  production:
    type: databricks
    host: dbc-xxx.cloud.databricks.com
    catalog: datacontract_test_2
    schema: orders_latest

@jochenchrist
Copy link
Contributor

If this does not fit your needs, could you motivate your use case a bit, so that we can discuss if it makes sense to implement testing a Spark dataframe?

@john-bunge-ed
Copy link
Author

Thanks a lot for the fast reply @jochenchrist

The use case is that after ingestion data is stored encrypted in a staging area. From there, data is then consumed by a workflow that decrypts it, makes some transformations, and stores it into Bronze tables (parts of it encrypted in a different way).

That is, the entire data as provided by the Provider is not persisted unencrypted, hence testing it against the contract needs to happen at runtime in said workflow.

@simonharrer
Copy link
Contributor

So, one answer here could be that you export the data contract to a pydantic model, which will test your data.

But I like the idea of having a possibility to directly execute the checks on im-memory data of a dataframe. We could add a new server type such as dataframe, that, when tested, requires the dataframe to be passed into the test method.

@john-bunge-ed
Copy link
Author

Sounds awesome @simonharrer. Happy my use case appears to make sense to you guys.

@john-bunge-ed
Copy link
Author

Hi @simonharrer, @jochenchrist
Do you guys have some sort of roadmap for this library? Any idea when this functionality of having "dataframe" as server type might be added to the library?
Best, John.

@jochenchrist
Copy link
Contributor

jochenchrist commented May 29, 2024

I'll be working on this ticket now :)

@jochenchrist jochenchrist self-assigned this May 29, 2024
jochenchrist added a commit to datacontract/datacontract-specification that referenced this issue May 29, 2024
@jochenchrist
Copy link
Contributor

Implemented and schedules for v0.10.7.

@john-bunge-ed can you test the current development version for your use case?

@john-bunge-ed
Copy link
Author

Hi @jochenchrist

This is great news! Thanks!

Will test this by the end of Friday and send you feedback!

@john-bunge-ed
Copy link
Author

john-bunge-ed commented May 31, 2024

Hi @jochenchrist

In Databricks (Runtime 14.3.), I did
pip install --pre datacontract-cli==0.10.6

from datacontract.data_contract import DataContract
datacontract_obj = DataContract(
    data_contract_file='datacontract.yaml',
    spark=spark)
datacontract_run = datacontract_obj.test()

Result:
image
No testing results are contained in my object datacontract_run

Same happens when I do
pip install datacontract-cli==0.10.6

Is there a release date for release v0.10.7?

@jochenchrist
Copy link
Contributor

Hi, I just triggered a new release 0.10.7.

@john-bunge-ed
Copy link
Author

My tests all passed. Works like a charm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants