Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARRAY type to SqlAlchemyReader #257

Closed
criccomini opened this issue Jun 1, 2023 · 19 comments
Closed

Add ARRAY type to SqlAlchemyReader #257

criccomini opened this issue Jun 1, 2023 · 19 comments

Comments

@criccomini
Copy link
Contributor

See https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.ARRAY for details. This isn't supported in SqlAlchemReader right now.

@cpard
Copy link

cpard commented Jun 6, 2023

SQLAlchemy has a weird behavior when it comes to ARRAY with more than one Dimension. Technically SQLAlchemy supports multidimensional arrays url. The way to do it is by passing dimensions=N in the constructor of ARRAY.

Trying to add support for Arrays in Recap, I model them as a Recap List and multidimensional arrays could be defined as lists of lists of lists ad infinitum (or almost anyway).

To do that, we need access to the dimensions attribute of the ARRAY object within the SqlAlchemyReader, which should be straight forward but here's where things get weird with SQLA.

It seems that when the metadata of the table are generated from the constructed engine, the dimensions attribute is gone. See the code below for a demonstration of this behavior.

from sqlalchemy import (
    MetaData,
    ARRAY,
    create_engine,
    Table,
    Column,
    Integer,
)

#add your credentials below, btw it has to be a Postgres because no other engine supports arrays in SQLA.
user = ""
password = ""
host = ""
database = ""
port =

connection_string = f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}"
meta = MetaData()

test = Table(
    "array_table",
    meta,
    Column("data", ARRAY(Integer, dimensions=2))
)

print("----------------Initial Table Metadata----------------")
print("Column Type: "+ str(test.columns[0].type))
print("Array Item Type: "+ str(test.columns[0].type.item_type))
print("Array Dimensions: "+ str(test.columns[0].type.dimensions))

print("----------------Initial Metadata Object Data----------------")
print("Column Type: "+ str(meta.tables['array_table'].columns['data'].type))
print("Array Item Type: "+ str(meta.tables['array_table'].columns['data'].type.item_type))
print("Array Dimensions: "+ str(meta.tables['array_table'].columns['data'].type.dimensions))

engine = create_engine(connection_string)
meta.create_all(engine)

print("----------------Copy Table using original Metadata----------------")
copy = Table('array_table', meta, autoload=True, autoload_with=engine)
print("Column Type: "+ str(copy.columns[0].type))
print("Array Item Type: "+ str(copy.columns[0].type.item_type))
print("Array Dimensions: "+ str(copy.columns[0].type.dimensions))

print("----------------Using Inspector----------------")
from sqlalchemy import inspect
inspector = inspect(engine)
print("Column Type: "+ str(inspector.get_columns('array_table')[0]['type']))
print("Array Item Type: "+ str(inspector.get_columns('array_table')[0]['type'].item_type))
print("Array Dimensions: "+ str(inspector.get_columns('array_table')[0]['type'].dimensions))

print("----------------Creating a Table but without the original Metadata Object available----------------")
meta_copy = MetaData()
copy = Table('array_table', meta_copy, autoload=True, autoload_with=engine)
print("Column Type: "+ str(copy.columns[0].type))
print("Array Item Type: "+ str(copy.columns[0].type.item_type))
print("Array Dimensions: "+ str(copy.columns[0].type.dimensions))

And a small output sample:

----------------Initial Table Metadata----------------
Column Type: ARRAY
Array Item Type: INTEGER
Array Dimensions: 2
----------------Initial Metadata Object Data----------------
Column Type: ARRAY
Array Item Type: INTEGER
Array Dimensions: 2
----------------Copy Table using original Metadata----------------
/Users/kostaspardalis/Projects/recap/recap/arraying.py:40: RemovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
  copy = Table('array_table', meta, autoload=True, autoload_with=engine)
Column Type: ARRAY
Array Item Type: INTEGER
Array Dimensions: 2
----------------Using Inspector----------------
Column Type: ARRAY
Array Item Type: INTEGER
Array Dimensions: None
----------------Creating a Table but without the original Metadata Object available----------------
Column Type: ARRAY
Array Item Type: INTEGER
Array Dimensions: None

As we can see in the last two parts of the output, using inspector or even the engine directly, results to the loss of the dimension information.

Not sure how to address this though. One option is to only support 1D arrays which can work but it feels lazy to me. There isn't any straightforward way to keep the original metadata object in our case, that could have helped as the information is preserved there but we can't expect Recap to be operating in environments that have access to that.

Maybe reach out to SQLAlchemy on this? It doesn't look like a desired behavior to me, what's the point of having less information after the engine has been created?

@criccomini WDYT?

@criccomini
Copy link
Contributor Author

Haven't had a chance to read full comment, but I notice you're linking to SA 2.0. Recap currently uses 1.4. I suspect there's a difference in array handling between the two.

I opted for 1.4 because I assumed many SA drivers are still on 1.4 not 2.0. If snowflake, hive meta store, and BQ SA drivers all work with 2.0, then we should probably just upgrade. Not sure if they're compatible, though.

@criccomini
Copy link
Contributor Author

Hmm, so the docs look the same for 1.4 and 2.0.

I poked around https://github.com/sqlalchemy/sqlalchemy/issues and StackOverflow, but didn't find much. I wonder if it's trying with SA 2.0 to see if this issue is fixed? If not, I agree with you, it seems like a bug (unless I'm misunderstanding something about the semantics of inspect).

Anyway, 2.0 is worth a shot, I suppose.

@criccomini
Copy link
Contributor Author

Another thought: perhaps it's the driver? Wonder if psycopg3 (with SA 2.0) or pg8000 yields similar behavior?

@cpard
Copy link

cpard commented Jun 6, 2023

yeah I linked to the wrong documentation version but I'm also developing locally using 1.4.15 of SA (the one that Recap comes with).

I'll try with SA 2.0 to see if anything changes.

The problem with SA is that the bug might be in many different places, it might be an Engine issue for example, which then has to do with the implementation for Postgres specifically etc. etc.

@criccomini
Copy link
Contributor Author

criccomini commented Jun 6, 2023

I wonder if it's because of this:

https://stackoverflow.com/questions/13888537/postgresql-multidimensional-arrays-in-sqlalchemy-not-sure-of-syntax

It's from 2012, but the fact that PG doesn't enforce dimensions might mean the driver isn't bothering.

The current implementation does not enforce the declared number of dimensions either. Arrays of a particular element type are all considered to be of the same type, regardless of size or number of dimensions. So, declaring the array size or number of dimensions in CREATE TABLE is simply documentation; it does not affect run-time behavior.

PG 14 docs say the same.

@cpard
Copy link

cpard commented Jun 7, 2023

I tried with SA 2.0 and it still behaves in the same way, the dimensions metadata are lost after the creation of the engine.

But even if it worked, both snowflake and BQ dialects require SA <2.0 :(

My limited experience so far with SA gives me mixed feelings about using it as an abstraction layer over the databases. Its type system is good but the implementation is too tightly intertwined with the different dialects and engines, which is ok when you care about the ORM part of it as you will be most probably working with a specific database anyway in your project. But our scope feels different to me.

For example, you can't really work with the ARRAY type without having a postgresql around and although they have a mock_engine, it's only good for generating SQL, inspection does not work there so it's not that useful at the end.

At the end, Recap cares only for the metadata right? If this is the case, does it make sense to inherit the complexity of SA tooling and functionality outside the type system?

@criccomini
Copy link
Contributor Author

Yea, I went back and forth with that. I think the alternative would be implementing individual converters for every single DB we care about using their dbapi driver directly. It's definitely doable, but a bit annoying. WDYT?x

@criccomini
Copy link
Contributor Author

K, I'm going to try a non-SQL Alchemy impl that goes straight from dbapi to Recap schema. Hoping ChatGPT can make the code less tedious to write.

@criccomini
Copy link
Contributor Author

@cpard for Hive Metastore, how want me to implement that? I can use PyHive, but that connects to Hive/Presto directly. It looks to me like Hive Metastore has a Thrift API. There's a https://github.com/quintoandar/hive-metastore-client client, but it doesn't seem to expose metadata info, just partitions. This is kinda messy.

@cpard
Copy link

cpard commented Jun 10, 2023

Yea, I went back and forth with that. I think the alternative would be implementing individual converters for every single DB we care about using their dbapi driver directly. It's definitely doable, but a bit annoying. WDYT?x

If we are going to constrain our interactions to the information_schema alone and this is going to be primarily read operations (selects), having such an expressive middleware like SA is probably an overkill and adds too much dependency.

How do you feel about using https://github.com/tobymao/sqlglot#sql-execution that has no dependencies and supports ~20 dialects (dbs). From there, we just restrict the interactions to the information_schema only as a start. I'll do some digging to see what extensions on top of information_schema people have made to the popular dbs, data warehouses and we'll see if it's worth adding a lot more on top.

WDYT?

@cpard
Copy link

cpard commented Jun 10, 2023

@cpard for Hive Metastore, how want me to implement that? I can use PyHive, but that connects to Hive/Presto directly. It looks to me like Hive Metastore has a Thrift API. There's a https://github.com/quintoandar/hive-metastore-client client, but it doesn't seem to expose metadata info, just partitions. This is kinda messy.

I think the best way to do it is by using the Thrift API and generate the client from there. I went through the Thrift file and there are metadata info in there, from types to column statistics. I checked here https://github.com/trinodb/hive-thrift/blob/master/src/main/thrift/hive_metastore.thrift

It's the version of the thrift spec Trino is currently using and I think it's the latest also.

The thrift interface will also give us out of the box access to AWS Glue, Databricks Unity (they just announced an HMS API for it), Google Dataproc.

Apparently, Iceberg - Tabular has a REST API for their catalog and I need to check Hudi - Onehouse but I'm pretty sure they support HMS / Glue / etc and don't have their own implementation of a catalog.

I'll give it a try today to generate a client using the thrift definition that Trino is also currently using.

@cpard
Copy link

cpard commented Jun 11, 2023

@criccomini check this #260

@criccomini
Copy link
Contributor Author

Apparently, Iceberg - Tabular has a REST API for their catalog

Yea, I got this one in #258

How do you feel about using https://github.com/tobymao/sqlglot#sql-execution that has no dependencies and supports ~20 dialects (dbs)

Yea, I've played with it a bit. Recap was using it at one point. I don't think we need it, though. The columns that we need in information_schema are standard (I checked this already), so a simple dbapi query should work for a bunch of DBs.

The type translation that sqlglot provides might be useful, but then I feel like we're just swapping SQLAlchemy for sqlglot, and the same set of problems.

Notably, Oracle, Teradata, Vertica, and SQLite do not have information_schema. But again, I think that's fine. Since we're going down the path of implementing each individually (similar to what sqlglot did for SQL), we can handle these four (and others) as needed.

@cpard
Copy link

cpard commented Jun 12, 2023

Apparently, Iceberg - Tabular has a REST API for their catalog

Yea, I got this one in #258

How do you feel about using https://github.com/tobymao/sqlglot#sql-execution that has no dependencies and supports ~20 dialects (dbs)

Yea, I've played with it a bit. Recap was using it at one point. I don't think we need it, though. The columns that we need in information_schema are standard (I checked this already), so a simple dbapi query should work for a bunch of DBs.

The type translation that sqlglot provides might be useful, but then I feel like we're just swapping SQLAlchemy for sqlglot, and the same set of problems.

Notably, Oracle, Teradata, Vertica, and SQLite do not have information_schema. But again, I think that's fine. Since we're going down the path of implementing each individually (similar to what sqlglot did for SQL), we can handle these four (and others) as needed.

Ah yeah, I didn't consider sqlglot for the translation functionality. I just liked that it can connect to all these dbs and execute queries without any additional dependencies.

A simple dbapi would be ideal!

criccomini added a commit that referenced this issue Jun 14, 2023
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi`
readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's.
Now, DB types are mapped straight to Recap types.

This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a
very basic test, but the Snowflake reader is untested. I plan to use `fakesnow`
for Snowflake in the future.
criccomini added a commit that referenced this issue Jun 14, 2023
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi`
readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's.
Now, DB types are mapped straight to Recap types.

This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a
very basic test, but the Snowflake reader is untested. I plan to use `fakesnow`
for Snowflake in the future.

Closes #259
criccomini added a commit that referenced this issue Jun 14, 2023
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi`
readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's.
Now, DB types are mapped straight to Recap types.

This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a
very basic test, but the Snowflake reader is untested. I plan to use `fakesnow`
for Snowflake in the future.

Closes #259
criccomini added a commit that referenced this issue Jun 14, 2023
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi`
readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's.
Now, DB types are mapped straight to Recap types.

This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a
very basic test, but the Snowflake reader is untested. I plan to use `fakesnow`
for Snowflake in the future.

Closes #259
@criccomini
Copy link
Contributor Author

@cpard I've updated Recap to use dbapi instead of SQLAlchemy here:

#261

There's a PG and Snowflake implementation. Both are very basic. I have tested the PG one a smidge, though; seems to work.

@cpard
Copy link

cpard commented Jun 14, 2023

That's awesome! I'll give it a try today!

@criccomini
Copy link
Contributor Author

Great, thanks! I haven't tested the Snowflake one at all, so it's likely to fail. Just LMK if/how it does and we can spruce it up.

I plan to write some tests for it, but I'm kinda hoping fakesnow gets this feature:

tekumara/fakesnow#2

@criccomini
Copy link
Contributor Author

Closing this issue as we're no longer using SQLAlchemy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants