Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pyodbc.Row and databricks.Row JSON-serializable via new make_serializable method #32319

Merged
merged 28 commits into from
Nov 17, 2023

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Jul 2, 2023

The ODBCHook returns pyodbc.Row objects when used with SQLExecuteQueryOperator to do SELECT queries, which cause serialization errors in the XCom backend. This PR follows the discussion started here.

A good place to implement the transformation of Row into tuples is after the execution of the handler, in the run() method. There, the raw data structure is available. Doing it later would imply to deal with potential nested structure (ex: list of results).

Thus, I propose to add method in the DBApiHook to make the result of a query serializable. So that subclasses can override it to implement their own logic.

I see that many Hooks based on DbApiHook have a custom run method. But considering @potiuk comment about "the most standard the better", I propose to go for an extra internal method that Hooks can override, rather than copying the full run() method just to implement a small change. Considering also that the pyodbc.Row may not be the only case where a custom object causes issues (Databricks could be rewrote to handle that issue at Hook level). And considering that this pattern already exists with the serialize_cell method which get overidden by child hooks.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@Joffreybvn Joffreybvn requested a review from eladkal as a code owner July 2, 2023 16:51
@Joffreybvn Joffreybvn changed the title Make ODBCHook result JSON-serializable WIP: Make ODBCHook result JSON-serializable Jul 2, 2023
@Joffreybvn Joffreybvn marked this pull request as draft July 2, 2023 16:52
@Joffreybvn Joffreybvn changed the title WIP: Make ODBCHook result JSON-serializable Make ODBCHook result JSON-serializable Jul 2, 2023
@Joffreybvn Joffreybvn marked this pull request as ready for review July 2, 2023 20:24
@Joffreybvn Joffreybvn force-pushed the fix/make-odbc-result-serializable branch from fb0dc89 to e83d5c2 Compare July 3, 2023 19:56
airflow/providers/odbc/hooks/odbc.py Outdated Show resolved Hide resolved
airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Member

potiuk commented Jul 4, 2023

Actually - there is no need to make it backwards-compatible. We could make it a breaking change for ODBC provider and bump major version - and if you make it a named tuple with same attributes as Row, this will be "not too much breaking" change - it will mostly work for all the current users.

Also adding make_serializable in common.sql in this context is not needed, and it's actually very good, because otherwise you would have to add dependency on newer version of common.sql to make it works.

@Joffreybvn
Copy link
Contributor Author

I quickly wrapped up an update to this PR. It's not complete yet. Let me suggest this:

Actually - there is no need to make it backwards-compatible. We could make it a breaking change for ODBC provider and bump major version - and if you make it a named tuple with same attributes as Row, this will be "not too much breaking" change - it will mostly work for all the current users.

Also adding make_serializable in common.sql in this context is not needed, and it's actually very good, because otherwise you would have to add dependency on newer version of common.sql to make it works.

I admit I have never tried to json.dumps a namedtuple (will give a try on Airflow tomorrow). If that works, could we expand the context of this PR to Databricks - making the transformation to serializable objects at Hook level + solving the consequences of that change in the DatabricksSqlOperator ? It's doable with a simple tuple + description, and it's even more easier with namedtuple.

I'm suggesting that because I like the solution of adding a serialization method that other hooks can override. Like this, the right tool is added at the right place. And can be used for both ODBC and Databricks. Otherwise, doing this change without touching the common.sql hook feels like hacking and working around the run method (and it add a new flavor of this method in the codebase).

What about a PR for ODBC and Databricks, with the serializable transformation beind a flag, which will depend on a newer version of the common.sql package ? So that it's clean and doesn't break too much things.

@Joffreybvn Joffreybvn marked this pull request as draft July 4, 2023 22:20
@potiuk
Copy link
Member

potiuk commented Jul 5, 2023

I'm suggesting that because I like the solution of adding a serialization method that other hooks can override. Like this, the right tool is added at the right place. And can be used for both ODBC and Databricks. Otherwise, doing this change without touching the common.sql hook feels like hacking and working around the run method (and it add a new flavor of this method in the codebase).

If we do it like this, then we should add "additional-dependencies" in provider.yaml and add common.sql >= NEXT_COMMON_SQL_VERSION and bump the common.sql version to next MINOr version in their provider.yaml.

This would be effectively a new feature of common.sql that those two providers would depend on so making a new MINOR (i.e. feature) release of common.sql and make those two provider depends on it is the only way to do it. It's possible but a little dangerous because all other providers that depend on common.sql could start relying on this feature (i.e. add their own make_serializable) implementation in the futre, without adding the >= - so the only "future-proof" way of adding this would also be to add a pre-commit that will enforce that if a provider uses make_serializable, it also has common.sql >= NEW_COMMON_SQL_VERSION, or - alternatively - we add common.sql >= NEW_COMMON_SQL_VERSION in all providers that use it and make it a new expectation for all providers, regardless if they are using the new feature (this also seems to be workable solution and will likely also require a pre-commit to check it in relevant provider.yaml files).

So yeah. It's possible to do it the way you propose - but will require a bit of build/CI overhead to make it robust and prevent accidental mistakes in the future.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 20, 2023
@github-actions github-actions bot closed this Aug 26, 2023
@potiuk potiuk reopened this Oct 28, 2023
@eladkal eladkal removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 28, 2023
@Joffreybvn Joffreybvn force-pushed the fix/make-odbc-result-serializable branch 2 times, most recently from aa8dbd1 to e321923 Compare October 28, 2023 19:18
@Joffreybvn Joffreybvn changed the title Make ODBCHook result JSON-serializable Make pyodbc.Row and databricks.Row JSON-serializable via new make_serializable method Oct 29, 2023
@Joffreybvn Joffreybvn force-pushed the fix/make-odbc-result-serializable branch from 4dec7e5 to 27acf50 Compare November 1, 2023 15:16
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Nov 1, 2023

PR is ready for review.

Currently:

  • Implements a new _make_serializable method in the DbApiHook, which can be overidden by maintainers to make result of cursor.fetch() serializable
  • Implement this _make_serializable method in the ODBCHook, which solves the issue when pyodbc.Row objects (C++ objects, not serializable) are returned and make the XCom backend crash. Now, NamedTuple are returned.
  • Implement this _make_serializable method in the DatabricksSqlHook, and remove the fix previously implemented in the Databricks Operator.

I also added a static checks: It raises an error when a subclass of DbApiHook overrides _make_serializable, but depends on common.sql providers which do not implement yet this method. Current implementation assumes that the version of common.sql where this PR will be added is 1.8.1.

Here is a preview before bumping the providers' version:
Screenshot from 2023-11-01 15-23-12

@Joffreybvn Joffreybvn marked this pull request as ready for review November 1, 2023 15:46
@Joffreybvn Joffreybvn force-pushed the fix/make-odbc-result-serializable branch from 70285a4 to 6b3e2d8 Compare November 12, 2023 21:18
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool. It also seems to be fully backwards-compatible . @utkarsharma2 and @Lee-W - WDYT?

@Lee-W
Copy link
Member

Lee-W commented Nov 16, 2023

This is really cool. It also seems to be fully backwards-compatible . @utkarsharma2 and @Lee-W - WDYT?

Agree! Just read through the latest change again. This is awesome. Make the code much cleaner. I also think it's fully backwards-compatible.

@potiuk potiuk merged commit 064fc2b into apache:main Nov 17, 2023
71 checks passed
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Nov 20, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
@bolkedebruin
Copy link
Contributor

bolkedebruin commented Nov 26, 2023

I missed this - why was the non standard make_serializable chosen over the standard serialize and deserialize methods? It also misses versioning and adds to the pre-commits hooks checking for this non standard method? To me this looks like technical debt now :/

This also could have been implemented as a custom serializer and I am really inclined to have this reverted.

cc @potiuk @ephraimbuddy @Lee-W

@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Nov 26, 2023

I'm whiling to create a new PR and pause/drop this one.

But, during the implementation of this PR, I could not create/instantiate pyobc.Row objects (which is a C++ object) directly in Python - wanted to do it for the unittest-. Thus I assumed it is not possible... (but I never had this case, and maybe there's something to learn here !)

Quoting you:

  1. Add a serialize / deserialize method to the object
  2. Add a serializer/deserializer into airflow.serialization.serializers
  3. Decorate the object with @DataClass or @attr

Assuming creating a deserializer which returns a pyodbc.Row object is not possible, is it okay to return a Namedtuple ?
Isn't better to get rid of the C++ objects as early as possible, so that the user deal with NamedTuple all the way ?

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

I do not think it has anything to do with "standard" behaviour. This is an internal implementation of DBHooks - each DB hooks might make their own decision on how to make(already commonly-structured- via common.sql) the rows returned as serializable variant.

Could you please elaborate what do you mean by "missing versioning" and how you want to implement what it does?

It is nothing that Airflow's serialization should be concerned about - this is really "standardising" behaviour of the DBAPI implementation - Python's DBAPI does not have very "strong" guarantees about what is returned, and some of the implementation (like Databricks) chose to implement to return non-serializable objects while most of the other DBAPI implememtations chose to use Tuples of rows + Tuples of descriptions which are serializable.

The big problem we were trying to solve with Common.sql is to introduce common interface of what alll DBHooks will be returning.

In this case it's really not even something that IMHO "airflow" serialization should be concerned about (if this is what you are after). This is purely one-way serializing - we just want to make sure that whatever gets returned via DBAPI calls (and essentially via Hook) is:

a) standard according to what our DBHook should return (i.e. following the tuple structure + tuples of metadata describing the rows)
b) is serializable

In this sense - we do not need anything else - like versioning - we just need to find a way to make what gets returned can be returned and sent via XCom.

But maybe I misunderstand what you want to achieve @bolkedebruin ? Could you please show an example of what you want to get and how what we want to achieve by that one is done?

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Nov 26, 2023

Hey @Joffreybvn thanks for responding and apologies for being late to the game and being a bit rough. It isn't required to return the same during deserialization as is provided through serialization. In other words it is fine if you serialize a pyodbc.Row and deserialize to a NamedTuple.

@potiuk All XCom serialization and deserialization goes through serde.py. So if you put something into XCom and you want to have it returned it goes through there. The original issue's stack trace said:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.9/site-packages/airflow/utils/json.py", line 91, in default
    return serialize(o)
  File "/opt/app-root/lib64/python3.9/site-packages/airflow/serialization/serde.py", line 171, in serialize
    raise TypeError(f"cannot serialize object of type {cls}")
TypeError: cannot serialize object of type <class 'pyodbc.Row'>

By dropping a serializer/deserializer into serializers for pyodbc.Row (and one for DataBricks maybe) this could have been fixed. _make_serializable seems to create an intermediate format at the moment which imho seems unnecessary.

BUT if you intention is to have a common "flattened" format, that by specification is serializable by the backend (not just JSON btw), then a renaming of the method might make sense. Then maybe the solution is then renaming the method and changing the description of the method _make_serializable. As you mentioned, standardization is the intention then the return value should be guaranteed to be serializable:

from airflow.serialization.serde import U

_make_serializable(result: Any) -> U:

Which reads as the serialize methods of the serializers.

I realize that I might be overzealous and misreading due to all the 'serialize' here, but I did not see any consideration going into implementing this in the framework that is available for this (serde). My question is can this not be solved a lot simpler and without the need for a change to the Operators. serde handles XCom for you. Why is the common _make_serializable method required?

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

Why is the common _make_serializable method required?

For the reason described above - comon "airflow standard" returned output of DBApiHook. All the DBApiHook implementations return some kind of tuple-like output. Some of them also return some kind of description of metadata - sometimes embedded in the returned Row data as extra row content, sometimes not (there is no common standard for that).

And the `make_serializable" returns a standard output that is, well, serializable, in case the originally returned objects are well, not serializable....

Yes - maybe the name could be better but - Naming is the hardest thing in computer science right next to cache invalidation.

I think at this stage holding release of several providers and comon.sql is quite a bit too much to hold the release for that.

Do you think it is worth it to make extra overhead for the release process? What would be a better name in this case?

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

Also -I think really you are mixing the problem here.... It's not serialize/deserialize we are talking about here. And IMHO the name very well reflects the intention. We are not trying to SERIALIZE/DESERIALIZE things... We are trying to return the value from DBApiHook that can actually be directly serialized.

So "make_serializable" is a very good name (and intent) it makes whatever gets returned by external thing a common, serializable format..

Pretty good and matches the intention.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Nov 26, 2023

Also -I think really you are mixing the problem here.... It's not serialize/deserialize we are talking about here. And IMHO the name very well reflects the intention. We are not trying to SERIALIZE/DESERIALIZE things... We are trying to return the value from DBApiHook that can actually be directly serialized.

You are converting from one format to another. This is effectively serialization. This is why serde calls to serialize. By removing type information and not including versioning you cannot reinstate the original object. This is a force majeur for pyodbc.Row but for others that might not be the case.

So "make_serializable" is a very good name (and intent) it makes whatever gets returned by external thing a common, serializable format..

The question again is, why is "common format" required? Every DbAPI hook that has its own non-serializable format will need to include its own _make_serializable. This is what the serde serializers were invented for. Why is the common format required? What does it solve outside of serialization?

in other words I currently see the common format as an unnecessary intermediate format for serialization, which complicates matters as it loses versioning and typing information. If there is a use outside serialization of the common format then I can be convinced otherwise.

Pretty good and matches the intention.

To certain extend, yes, but the signature of the function does not match its intent.

Why is the common _make_serializable method required?

For the reason described above - comon "airflow standard" returned output of DBApiHook. All the DBApiHook implementations return some kind of tuple-like output. Some of them also return some kind of description of metadata - sometimes embedded in the returned Row data as extra row content, sometimes not (there is no common standard for that).

Why is this needed and where is this exposed - outside of serialization? If it is only for serialization and the default implementation is to return the input, then imho it should be solved as a serializer. Can we be more specific of what _make_serializable is going to return? Any -> Any does not cut it. Also if we include header information, which the odbc implementation seems to do, is that part of the specification? Let make sure we then check for that.

And the `make_serializable" returns a standard output that is, well, serializable, in case the originally returned objects are well, not serializable....

This is what custom serializers do. Currently, the implementation here does not enforce that (Any).

Yes - maybe the name could be better but - Naming is the hardest thing in computer science right next to cache invalidation.

:-)

I think at this stage holding release of several providers and comon.sql is quite a bit too much to hold the release for that.

I disagree. This is going to be in for a long time, lets make sure we get the design right.

Do you think it is worth it to make extra overhead for the release process? What would be a better name in this case?

The name is fine although I could argue for normalize or make_namedtuple, if we want to keep the function here. I can see some reasons why, but they are not overly convincing, I would appreciate an argumentation why not in serde's serializers.

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

You are converting from one format to another. This is effectively serialization. This is why serde calls to serialize. By removing type information and not including versioning you cannot reinstate the original object.

We have absolutely no intention of doing it. This is one way DBAPi -> returned value conversion. We have absolutely no need to add complexity by having (effectively) all DBApi providers to depend on serde implementation for that. That adds absolutely unnecessary coupling and serves no purpose from the point of view of standardising DBApi.

The question again is, why is "common format" required? Every DbAPI hook that has its own non-serializable format will need to include its own _make_serializable. This is what the serde serializers were invented for. Why is the common format required? What does it solve outside of serialization?

Yes there is a very good reason we have it.

It's nainly because of open-lineage integration (even if not implemented now then it's the way for it to be used in the fufure). Not everything revolves around serialization in Airflow. For DBApiHook, serialization is just after-thought. which is applied much later in the process. Serialization does not make a "standard" description of the returned data. Serialization looses semantic meaning of the returned data, where DBApIHook returns also meta-data that could be used to make column lineage semantically meaningful. Serialization is a lower-layer constructs that looses semantic meaning of the data returned by the Hook. This was the original intention of implementing common.sql and standardising what Hook returns. It's a building block on which future task-flow implementation of Open Lineage might be built for SQL data processing.

This is what custom serializers do. Currently, the implementation here does not enforce that (Any).

Yes. if you think the whole purpose of the interface is to serialize things. In our case we return DBApi-commong form of data that has merely a "property" of being serializable. It's an afterthought. merely a property to make it possible to use the outpit directly as operator output so that it can be stored in Xcom.

I disagree. This is going to be in for a long time, lets make sure we get the design right.

I think we have it right the DBApi, it's purpose and reasoning have been discussed there for about 1.5 year and went through several interations. Again _make_serializable is just an intrnal implementation details of making sure that whatever gets returned is ALSO serializable. You seem to be giving it much more meaning than it has.

The name is fine although I could argue for normalize or make_namedtuple, if we want to keep the function here. I can see some reasons why, but they are not overly convincing, I would appreciate an argumentation why not in serde's serializers.

To avoid unnecessary coupling - if I can think of a single good reason. Our providers have deliberately wide range of supported Airflow versions they should be compatible with. We base a lot of assumptions on it. We cannot assume that we have specific version of Airflow available for us. Using serde and implementing it now in common.sql provider would mean that we will only be able to use it for Airlfow 2.8 that would support it - and possibly even rely on some internal stuff in serde - it woudl requre a set of unit tests testing compatibillity of various provider versions with various airflow version. This is a very bad idea to couple the two when all that we are talking about is to convert "specific" objects returned by Databricks to "serde" module that serializes it, relyiing - possibly - on specific serde version in specific airflow version.

That would effectively make serde API and behaviour a Public API of Airflow that ALL DBHook providers should depend on - this coupling is not only unnecessary but also harmful and will slow us down. SERDE is not (and please correct me if I am wrong) not part of Public API of airflow. Are we willing to make 3rd-party providers depend on Serde? What APIs? Are they already public in https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html ? Because what you are really asking for is to make 3rd-party providers that would like to implement DBApiHook depend on it.

I personally think it's a bad idea to add this coupling.

Now that I explained why not. I have one question that maybe you can explain

Could you please explain what benefit implementing serde for this conversion would bring? I am looking for some good reasons why and, to be honest cannot find it. But since I explained why not, I'd love to hear why.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Nov 26, 2023

You are converting from one format to another. This is effectively serialization. This is why serde calls to serialize. By removing type information and not including versioning you cannot reinstate the original object.

We have absolutely no intention of doing it. This is one way DBAPi -> returned value conversion. We have absolutely no need to add complexity by having (effectively) all DBApi providers to depend on serde implementation for that. That adds absolutely unnecessary coupling and serves no purpose from the point of view of standardising DBApi.

The question again is, why is "common format" required? Every DbAPI hook that has its own non-serializable format will need to include its own _make_serializable. This is what the serde serializers were invented for. Why is the common format required? What does it solve outside of serialization?

Yes there is a very good reason we have it.

It's nainly because of open-lineage integration (even if not implemented now then it's the way for it to be used in the fufure). Not everything revolves around serialization in Airflow. For DBApiHook, serialization is just after-thought. which is applied much later in the process. Serialization does not make a "standard" description of the returned data. Serialization looses semantic meaning of the returned data, where DBApIHook returns also meta-data that could be used to make column lineage semantically meaningful
Serialization is a lower-layer constructs that looses semantic meaning of the data returned by the Hook.

While I think that extra metadata can be beneficial, it is not part of the spec at the moment. The odbc implementation does include it, the databricks hook doesn't. There is btw no reason that serialization by the standard framework can't do this.

This was the original intention of implementing common.sql and standardising what Hook returns. It's a building block on which future task-flow implementation of Open Lineage might be built for SQL data processing.

Well, I happy that you explain this, because that I could not get from the initial issue and the commits here and the comments (even when re-reading it now). The PR start with "Thus, I propose to add method in the DBApiHook to make the result of a query serializable". So yes, I read it as "do serialization". Sorry for that.

This is what custom serializers do. Currently, the implementation here does not enforce that (Any).

Yes. if you think the whole purpose of the interface is to serialize things. In our case we return DBApi-commong form of data that has merely a "property" of being serializable. It's an afterthought. merely a property to make it possible to use the outpit directly as operator output so that it can be stored in Xcom.

As mentioned above indeed the PR read this way.

I disagree. This is going to be in for a long time, lets make sure we get the design right.

I think we have it right the DBApi, it's purpose and reasoning have been discussed there for about 1.5 year and went through several interations. Again _make_serializable is just an intrnal implementation details of making sure that whatever gets returned is ALSO serializable. You seem to be giving it much more meaning than it has.

Nitpicking: I don't see any reference to that discussion here. And maybe I was therefore put into the wrong direction.

The name is fine although I could argue for normalize or make_namedtuple, if we want to keep the function here. I can see some reasons why, but they are not overly convincing, I would appreciate an argumentation why not in serde's serializers.

To avoid unnecessary coupling - if I can think of a single good reason. Our providers have deliberately wide range of supported Airflow versions they should be compatible with. We base a lot of assumptions on it. We cannot assume that we have specific version of Airflow available for us. Using serde and implementing it now in common.sql provider would mean that we will only be able to use it for Airlfow 2.8 that would support it - and possibly even rely on some internal stuff in serde - it woudl requre a set of unit tests testing compatibillity of various provider versions with various airflow version. This is a very bad idea to couple the two when all that we are talking about is to convert "specific" objects returned by Databricks to "serde" module that serializes it, relyiing - possibly - on specific serde version in specific airflow version.

The coupling wouldn't be explicit. A new Airflow release, yes that would be required at the moment (there is some thought of moving serializers to their respective providers). No imports, no changes required to the provider code though. Less unit-tests actually or the same.

That would effectively make serde API and behaviour a Public API of Airflow that ALL DBHook providers should depend on - this coupling is not only unnecessary but also harmful and will slow us down. SERDE is not (and please correct me if I am wrong) not part of Public API of airflow. Are we willing to make 3rd-party providers depend on Serde? What APIs? Are they already public in https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html ? Because what you are really asking for is to make 3rd-party providers that would like to implement DBApiHook depend on it.

Thanks for the consideration. Good point. You are correct that it is not in the public API, although you could you argue that the format of serialization is and being able to serialize/deserialize by definition also. Nevertheless, there is no direct use of the serde API required. So maybe this point is kind of mood?

Could you please explain what benefit implementing serde for this conversion would bring? I am looking for some good reasons why and, to be honest cannot find it. But since I explained why not, I'd love to hear why.

While with the above points I am convinced the implementation is okay, for the sake of completeness and assuming no need for intermediate format (lineage):

  1. Standardized architecture for serialization / deserialization. Tried and tested by now.
  2. Forward compatible. To me the 'common format' currently still reads as not forward compatible when we deem that the format needs to change.
  3. Available for serialization by others. If something now returns a pyodbc.Row into XCom for some reason they would need to implement their own serializer again and possibly the results are not interchangeable due to format differences. The risk is low for DbAPI though.
  4. Arguably less complex and cleaner. This is what the serialization/deserialization would have looked like for a pyodbc.Row in the framework.
def serialize(o: object) -> tuple[U, str, int, bool]:
    import pyodbc

    o = cast(pyodbc.Row, o)

    columns: list[tuple[str, type]] = [col[:2] for col in o.cursor_description]
    row_object = NamedTuple("Row", columns)  # type: ignore[misc]
    row = row_object(*o)

    return row, qualname(o), __version__, True


def deserialize(classname: str, version: int, data: object) -> Any:
    import pyodbc

    if version > __version__:
        raise TypeError("serialized version is newer than class version")

    if classname == qualname(pyodbc.Row):
        return data

    raise TypeError(f"do not know how to deserialize {classname}")

And that is all there is to it - apart from tests. No need to adjust the providers, but yes a Airflow update would be required.

@Joffreybvn sorry for the fuss! Good work.

def _make_serializable(result):
"""Transform the databricks Row objects into a JSON-serializable list of rows."""
if result is not None:
return [list(row) for row in result]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Joffreybvn I feel If the handler callable returns non-iterable then this would fail.

I call the run method like below and it's breaking my DAG

run( f"SELECT COUNT(*) FROM {table}", handler=lambda x: x.fetchone())

Error

packages/airflow/providers/databricks/hooks/databricks_sql.py", line 247, in <listcomp>
    return [list(row) for row in result]
TypeError: 'int' object is not iterable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Unfortunately - the DBAPIHook for it's backwards-compatibility reasons is pretty difficult to handle all cases - including the case where we either return list of results or one result only.

I tried to explain it in https://github.com/apache/airflow/blob/main/airflow/providers/common/sql/doc/adr/0002-return-common-data-structure-from-dbapihook-derived-hooks.md#decision (including all the possible variants of returned values) and I think our handler definition is not precise enough to explain that this can happen (we have not been as much into typing as we were back then).

@Joffreybvn -> I think this means we have to hold -on with databricks/pyodbc implementation to add support for those cases. Will you have time to fix it quickly ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) provider:common-sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants