Common SQLCheckOperators Various Functionality Update #25164

denimalpaca · 2022-07-19T18:21:01Z

The SQLCheckOperators under the Common provider have some issues since their initial release, including:

Lack of batching
A bug in the SQLTableCheckOperator that forces users to use only fully-aggregated checks
A bug in the output of the test failure exception in the SQLColumnCheckOperator that only ever returns the final SQL query built

This PR remedies these issues.

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY.

Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging.

potiuk · 2022-07-21T10:27:16Z

Some tests are failing

denimalpaca · 2022-07-21T13:51:19Z

Some tests are failing

Looking into this today, seems a test is missing a context parameter.

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY.

Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging.

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

…thub.com:denimalpaca/airflow into sql_check_operators_various_functionality_update

The table alias should be in the self.sql query build statement as that is where the table it needs to alias is defined.

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY.

Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging.

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

airflow/providers/common/sql/operators/sql.py

…thub.com:denimalpaca/airflow into sql_check_operators_various_functionality_update

Fixed bug in test where the dataframe column names did not match the operator's expected dataframe column names. Added more info to the SQLColumnCheckOperator's batch arg. Fixed the location of table aliasing in SQLTableCheckOperator.

Gives a clearer name to the parameter and adds templating to the SQLTableCheckOperator.

potiuk

Now it looks great! Thanks @denimalpaca for being so responsive to our comments :)

ashb · 2022-08-19T10:53:17Z

airflow/providers/common/sql/operators/sql.py

                for check_name, value in self.checks.items()
            ]
        )
+        partition_clause_statement = f"WHERE {self.partition_clause}" if self.partition_clause else ""
+        self.sql = f"SELECT check_name, check_result FROM ({checks_sql}) "
+        f"AS check_table {partition_clause_statement};"


Whoops. Missing + here. This string is just a literal, not part of self.sql (and thus the entire partition_clause_statement isn't used!

Just got back from vacation, I see this got merged without a fix here. I can open a separate PR to address it.

Yeah. Ash (nick-name Hawk-Eye) noticed it after we merged and started voting on it. I decided to continue to release as both operators were new but I would love to get it fixed for 1.2.0.

I hope you had good vacation @denimalpaca :D

ashb · 2022-08-19T10:53:55Z

airflow/providers/common/sql/operators/sql.py


-        self.sql = f"SELECT {check_mins_sql} FROM (SELECT {checks_sql} FROM {self.table});"
-        records = hook.get_first(self.sql)
+        records = hook.get_pandas_df(self.sql)


Why did we change from getting records to getting this as a pandas dataframe? This now places a hard requirement on using pandas for this operator, where as previously pandas was almost entirely optional.

So, this was changed because with hook.get_first, there was an issue with how the SQL was being written that caused only fully aggregated checks to be returned, unless the syntax of the SQL query was changed, but that would require either a fetch_all or get_pandas_df call as the new SQL needs to returned multiple lines. It seemed much easier and possibly more efficient to use pandas here, but if a fetch_all seems more reasonable this can be changed. Happy to explain more about the specific issue if curious.

I'd be curious to hear more about it :D

To expand a bit more, a check like col_a + col_b >= col_c would not work when there were multiple checks in the operator as the previous SELECT statement would then fail and require a GROUP BY clause iirc. So the check would either have to be in its own operator, or be amended like so: SUM(col_a) + SUM(col_b) >= SUM(col_c) which isn't quite the same check. So the query needed to be updated, and the one that I wound up using returns multiple rows. So get_first is no longer useful, and in the moment of writing it seemed that handling things with a pandas dataframe might be easier in the long term, if more complicated uses of the pulled in data were implemented. But as of now I see how it's unneeded.

Thanks.... Is the operator easy to understand by the users? I am afraid this is something we need good and clear howto and examples for it because people will have hard time using it and rais too many questions.

Ideally most users won't need to learn about why .fetch_all is being used instead of .get_first 🙃 . I'm working with some users of the operator right now and seeing what's complicated to make sure the docs are robust. I also have a working example DAG showing how to use the operator (with several more planned) here.

denimalpaca added 3 commits July 18, 2022 12:08

Add batching to SQL Check Operators

dc01866

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

boring-cyborg bot added the area:providers label Jul 19, 2022

denimalpaca added 7 commits July 21, 2022 09:54

Add table alias to SQLTableCheckOperator query

66922f0

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

Fix formatting error in operator

554e8ba

Add batching to SQL Check Operators

cf90083

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

Add table alias to SQLTableCheckOperator query

3c300e7

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

Fix formatting error in operator

5645b5d

denimalpaca force-pushed the sql_check_operators_various_functionality_update branch from 554e8ba to 5645b5d Compare July 21, 2022 15:30

denimalpaca added 7 commits July 21, 2022 11:30

Merge branch 'sql_check_operators_various_functionality_update' of gi…

98a28c3

…thub.com:denimalpaca/airflow into sql_check_operators_various_functionality_update

Move alias to proper query build statement

ee697e3

The table alias should be in the self.sql query build statement as that is where the table it needs to alias is defined.

Add batching to SQL Check Operators

1754fdc

Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table.

Add table alias to SQLTableCheckOperator query

24ce964

Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution.

Fix formatting error in operator

987f6c2

denimalpaca force-pushed the sql_check_operators_various_functionality_update branch from 5645b5d to 987f6c2 Compare July 21, 2022 15:59

eladkal reviewed Jul 21, 2022

View reviewed changes

airflow/providers/common/sql/operators/sql.py Outdated Show resolved Hide resolved

denimalpaca added 6 commits July 21, 2022 12:07

Merge branch 'sql_check_operators_various_functionality_update' of gi…

a27fc2c

…thub.com:denimalpaca/airflow into sql_check_operators_various_functionality_update

Bug fixes and updates to test and operator

404eef5

Fixed bug in test where the dataframe column names did not match the operator's expected dataframe column names. Added more info to the SQLColumnCheckOperator's batch arg. Fixed the location of table aliasing in SQLTableCheckOperator.

Remove merge conflict lines

01bfb2f

Rename parameter batch to partition_clause

75d59d1

Gives a clearer name to the parameter and adds templating to the SQLTableCheckOperator.

Fix typo in docstring

0388ac4

Reformat operator file

719a830

potiuk approved these changes Jul 22, 2022

View reviewed changes

potiuk merged commit be7cb1e into apache:main Jul 22, 2022

eladkal modified the milestone: Airflow 2.3.4 Jul 24, 2022

This was referenced Aug 10, 2022

Status of testing Providers that were prepared on August 10, 2022 #25634

Closed

Status of testing Providers that were prepared on August 10, 2022 #25640

Closed

Status of testing Providers that were prepared on August 15, 2022 #25721

Closed

ashb reviewed Aug 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common SQLCheckOperators Various Functionality Update #25164

Common SQLCheckOperators Various Functionality Update #25164

denimalpaca commented Jul 19, 2022

potiuk commented Jul 21, 2022

denimalpaca commented Jul 21, 2022 •

edited

Loading

potiuk left a comment

ashb Aug 19, 2022

denimalpaca Aug 25, 2022

potiuk Aug 25, 2022

potiuk Aug 25, 2022 •

edited

Loading

ashb Aug 19, 2022

denimalpaca Aug 25, 2022

potiuk Aug 25, 2022

denimalpaca Aug 25, 2022

potiuk Aug 27, 2022

denimalpaca Aug 29, 2022 •

edited

Loading

Common SQLCheckOperators Various Functionality Update #25164

Common SQLCheckOperators Various Functionality Update #25164

Conversation

denimalpaca commented Jul 19, 2022

potiuk commented Jul 21, 2022

denimalpaca commented Jul 21, 2022 • edited Loading

potiuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

potiuk Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denimalpaca Aug 29, 2022 • edited Loading

Choose a reason for hiding this comment

denimalpaca commented Jul 21, 2022 •

edited

Loading

potiuk Aug 25, 2022 •

edited

Loading

denimalpaca Aug 29, 2022 •

edited

Loading