From de54159f04d483b258a84d6ff48bc4c510b9d2b7 Mon Sep 17 00:00:00 2001 From: aysegulcayir <49029525+aysegulcayir@users.noreply.github.com> Date: Fri, 26 Jul 2024 10:33:08 +0200 Subject: [PATCH 1/7] Update README.md --- README.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/README.md b/README.md index ff2299d..7e8fbc5 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,15 @@ dfs = [df] results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_rules, "showcase") ``` +# Code for Schema Validation + +If we want to make schema validation for column types of tables from Amsterdam Schema: + +fetch_schema_from_github() +This function fetches the schema from GitHub for each table defined in dq_rules. + +generate_dq_rules_from_schema() +This function generates data quality rules based on the schema fetched from GitHub. # Known exceptions The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will results in an error, as it does not have the permissions that Great Expectations requires. From 2415793596c8f8ab8e595f7cc946e68161d8db1a Mon Sep 17 00:00:00 2001 From: aysegulcayir <49029525+aysegulcayir@users.noreply.github.com> Date: Fri, 26 Jul 2024 11:58:40 +0200 Subject: [PATCH 2/7] Update README.md --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 7e8fbc5..a4209cf 100644 --- a/README.md +++ b/README.md @@ -35,11 +35,10 @@ results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_ If we want to make schema validation for column types of tables from Amsterdam Schema: -fetch_schema_from_github() -This function fetches the schema from GitHub for each table defined in dq_rules. - -generate_dq_rules_from_schema() -This function generates data quality rules based on the schema fetched from GitHub. +- Define validate_table_schema and validate_table_schema_url in dq_rules for table to be validated. +- Use Amsterdam schema url for validate_table_schema_url +The schema is fetched from GitHub for each table defined in dq_rules. +With these schema inputs taken from dq_rules json, expect_column_values_to_be_of_type validation rule is generated for each column based on the schema # Known exceptions The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will results in an error, as it does not have the permissions that Great Expectations requires. From 27fb4efb86013f19ee614738841344a3a526d4e3 Mon Sep 17 00:00:00 2001 From: aysegulcayir <49029525+aysegulcayir@users.noreply.github.com> Date: Fri, 26 Jul 2024 12:00:05 +0200 Subject: [PATCH 3/7] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a4209cf..5045193 100644 --- a/README.md +++ b/README.md @@ -36,9 +36,10 @@ results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_ If we want to make schema validation for column types of tables from Amsterdam Schema: - Define validate_table_schema and validate_table_schema_url in dq_rules for table to be validated. -- Use Amsterdam schema url for validate_table_schema_url +- Use Amsterdam schema url for validate_table_schema_url. + The schema is fetched from GitHub for each table defined in dq_rules. -With these schema inputs taken from dq_rules json, expect_column_values_to_be_of_type validation rule is generated for each column based on the schema +With these schema inputs taken from dq_rules json, expect_column_values_to_be_of_type validation rule is generated for each column based on the schema. # Known exceptions The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will results in an error, as it does not have the permissions that Great Expectations requires. From 71bc034850c3f0903960b36310b84abe7e45bd69 Mon Sep 17 00:00:00 2001 From: aysegulcayir <49029525+aysegulcayir@users.noreply.github.com> Date: Fri, 26 Jul 2024 13:22:12 +0200 Subject: [PATCH 4/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5045193..49d18dc 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ dfs = [df] results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_rules, "showcase") ``` -# Code for Schema Validation +# Validate the schema of a table If we want to make schema validation for column types of tables from Amsterdam Schema: From e91a78429c4c9ba6ac864aff06d7be24fef1dcba Mon Sep 17 00:00:00 2001 From: aysegulcayir <49029525+aysegulcayir@users.noreply.github.com> Date: Fri, 26 Jul 2024 13:26:22 +0200 Subject: [PATCH 5/7] Update README.md --- README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 49d18dc..70cac77 100644 --- a/README.md +++ b/README.md @@ -33,13 +33,16 @@ results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_ # Validate the schema of a table -If we want to make schema validation for column types of tables from Amsterdam Schema: +It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. -- Define validate_table_schema and validate_table_schema_url in dq_rules for table to be validated. -- Use Amsterdam schema url for validate_table_schema_url. - -The schema is fetched from GitHub for each table defined in dq_rules. -With these schema inputs taken from dq_rules json, expect_column_values_to_be_of_type validation rule is generated for each column based on the schema. +This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json). + +You will need: + +- validate_table_schema: the id field of the table from Amsterdam Schema +- validate_table_schema_url: the url of the table or dataset from Amsterdam Schema + +The schema definition is converted into column level expectations (expect_column_values_to_be_of_type) on run time. # Known exceptions The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will results in an error, as it does not have the permissions that Great Expectations requires. From 3269c35d27f69c0f5522241994490c727d2486cc Mon Sep 17 00:00:00 2001 From: ArthurKordes <75675106+ArthurKordes@users.noreply.github.com> Date: Fri, 26 Jul 2024 13:40:41 +0200 Subject: [PATCH 6/7] Update README.md --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index 70cac77..db0a8ad 100644 --- a/README.md +++ b/README.md @@ -32,13 +32,9 @@ results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_ ``` # Validate the schema of a table - -It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. - -This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json). +It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json). You will need: - - validate_table_schema: the id field of the table from Amsterdam Schema - validate_table_schema_url: the url of the table or dataset from Amsterdam Schema From cca36619578f5676bce1d5832de96669f58c5826 Mon Sep 17 00:00:00 2001 From: ArthurKordes <75675106+ArthurKordes@users.noreply.github.com> Date: Fri, 26 Jul 2024 13:40:56 +0200 Subject: [PATCH 7/7] Update pyproject.toml --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index 0e127c5..0eda86a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "dq-suite-amsterdam" -version = "0.5.0" +version = "0.5.1" authors = [ { name="Arthur Kordes", email="a.kordes@amsterdam.nl" }, { name="Aysegul Cayir Aydar", email="a.cayiraydar@amsterdam.nl" }