-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test data contract against dataframes or views #175
Comments
Currently, you cannot test a dataframe at runtime. With Databricks, the recommended way is to use a Unity Catalog or Hive table.
datacontract.yaml
|
If this does not fit your needs, could you motivate your use case a bit, so that we can discuss if it makes sense to implement testing a Spark dataframe? |
Thanks a lot for the fast reply @jochenchrist The use case is that after ingestion data is stored encrypted in a staging area. From there, data is then consumed by a workflow that decrypts it, makes some transformations, and stores it into Bronze tables (parts of it encrypted in a different way). That is, the entire data as provided by the Provider is not persisted unencrypted, hence testing it against the contract needs to happen at runtime in said workflow. |
So, one answer here could be that you export the data contract to a pydantic model, which will test your data. But I like the idea of having a possibility to directly execute the checks on im-memory data of a dataframe. We could add a new server type such as |
Sounds awesome @simonharrer. Happy my use case appears to make sense to you guys. |
Hi @simonharrer, @jochenchrist |
I'll be working on this ticket now :) |
Implemented and schedules for v0.10.7. @john-bunge-ed can you test the current development version for your use case? |
This is great news! Thanks! Will test this by the end of Friday and send you feedback! |
Hi, I just triggered a new release |
My tests all passed. Works like a charm 👍 |
Hi!
during my tests with this package I ran:
I learnt that in
datacontract.yaml
I can specify a data source in the server block.Question: Do I also have the option to tests the contract against a Python or PySpark dataframe at runtime, or against a (temporary) SQL view?
Motivation: I do not want to persist the data I am testing the contract against. I am asking this from a Databricks perspective, but this questions is also relevant if I have some data transformation pipeline in Python which runs locally, and I want to test the data contract against a specific dataframe created during that pipeline.
The text was updated successfully, but these errors were encountered: