Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL Support for ADBC Drivers #53869

Merged
merged 78 commits into from
Nov 22, 2023
Merged

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jun 26, 2023

No description provided.

@WillAyd WillAyd requested a review from mroeschke as a code owner June 26, 2023 20:27
pandas/io/sql.py Outdated Show resolved Hide resolved
with tm.assert_produces_warning(UserWarning, match="the 'timedelta'"):
df.to_sql("test_arrow", conn, if_exists="replace", index=False)


@pytest.mark.db
@pytest.mark.parametrize("conn", all_connectable)
def test_dataframe_to_sql_arrow_dtypes_missing(conn, request, nulls_fixture):
if conn == "postgresql_adbc_conn":
request.node.add_marker(
pytest.skip("int8/datetime not implemented yet in adbc driver")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as an 8-bit integer in postgres, so this test wouldn't round trip in its current form. I think this being an int8 is just a test detail and not anything we actually care for. Can probably bump to int16, though that separately begs the question of how we want to add/extend the ADBC types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the canonical way to add ADBC types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the mapping of an arrow type to a database-specific type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could in theory occur in the driver or could be handled by pandas itself. Since timestamp is a primitive in most databases I think that will be handled by the driver (plan on looking at this today/tomorrow myself for postgres)

Int8 is a little trickier. From the databases I've used I am not aware of 8 bit integers being all that common, so either the driver implements some compatability to another type (logically int16) or leaves it to the application to put its data into a supported type. My guess is the ADBC drivers may want to start explicitly and force end users to upcast to int16 if that's what they want, but @lidavidm knows best

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a strong convention (like timestamp strings in SQLite) I'd like to support it natively, otherwise I think it would be best for the client to be explicit about what it wants (and if there's something the driver can do to help, we can do it - e.g. I think it's reasonable to optionally ingest int8 as SMALLINT in the driver, so long as you aren't concerned about perfect roundtripping)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we kinda do this already for numpy types -> sqlalchemy types so not opposed to adding that in pandas. I was just curious what the API for defining these is like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int8 support was added in apache/arrow-adbc#858 and working on timestamp in apache/arrow-adbc#861; guessing these can make it into the next ADBC release and we can just start support at that as a minimum version to keep things easy

pandas/io/sql.py Outdated
stmt = f"SELECT * FROM {table_name}"

with self.con.cursor() as cur:
return cur(stmt).fetch_arrow_table().to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be nice to at minimum support dtype_backend and return arrow backed types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should always just return arrow backed types. Related to the other conversation around kwargs I am unsure of the best way to handle this. If we raise for non-default arguments this wouldn't work; alternately we could except the dtype_backend argument from raising for non-default arguments but arguably is heavy handed to require end users to specify that when they are already using the ADBC driver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to add types_mapper=pd.ArrowDtype for this to work.

Not sure how I'd feel about arrow backed only, this makes sense but we went in a different way for other readers...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah didn't realize that. Thanks for the heads up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #51846 for long-term

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the new library in install.rst would be good too

@WillAyd
Copy link
Member Author

WillAyd commented Jun 27, 2023

AFACIT the pyarrow drivers required pyarrow>=8.0.0 since they use the RecordBatchReader object, but our min version is still pinned at 6.0.0. @phofl I know you've upgraded us in the past with pyarrow - is it too soon to make the jump to 8.0?

@lidavidm
Copy link
Contributor

We could likely relax Python and PyArrow minimum versions further if needed

@phofl
Copy link
Member

phofl commented Jun 27, 2023

Technically we could upgrade for 2.1, but this depends on pdep 10 as well

@WillAyd
Copy link
Member Author

WillAyd commented Jun 27, 2023

What is the link between this and PDEP 10? The ADBC dbapi requires pyarrow to use but I don't think that impacts how we need to manage that on our end? Or am I overlooking something simple?

@phofl
Copy link
Member

phofl commented Jun 27, 2023

PDEP10 has some language around minimum supported Arrow versions

@mroeschke mroeschke added the IO SQL to_sql, read_sql, read_sql_query label Jun 27, 2023
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 28, 2023

But note that the PDEP-10 text is even more conservative, so based on that we would not be bumping to pyarrow 8 as a minimum version any time soon (only next year).

AFACIT the pyarrow drivers required pyarrow>=8.0.0 since they use the RecordBatchReader object, but our min version is still pinned at 6.0.0.

IMO there is really no problem in requiring a more recent pyarrow version for certain new features, such as pyarrow 8 for using adbc in our SQL IO, while the general minimum required version is lower.

pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member Author

WillAyd commented Jun 28, 2023

IMO there is really no problem in requiring a more recent pyarrow version for certain new features, such as pyarrow 8 for using adbc in our SQL IO, while the general minimum required version is lower.

Yea I agree. That definitely adds complexity to test compat/xfailing, but I think can be reasonably implemented in another pass at this

with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn)

The Arrow type system offers a wider array of types that can more closely match
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important enough of a point to make in the changelog, but should also probably put in io.rst. Just wasn't sure how to best structure that yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to include in io.rst. I think it might be good to add a separate section in io.rst to talk about type mapping (if there isn't one already) and also include sqlalchemy type mapping

@WillAyd
Copy link
Member Author

WillAyd commented Nov 14, 2023

Alright all green and I think I've address comments. Let me know if anything is missing

pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments otherwise looks good

@WillAyd
Copy link
Member Author

WillAyd commented Nov 17, 2023

OK all green and I think I addressed your feedback @mroeschke

@@ -335,7 +335,8 @@ lxml 4.9.2 xml XML parser for read
SQL databases
^^^^^^^^^^^^^

Installable with ``pip install "pandas[postgresql, mysql, sql-other]"``.
Traditional drivers are installable with ``pip install "pandas[postgresql, mysql, sql-other]"``. ADBC drivers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer anymore since you added them to the postgresql and sql-other extra in the pyproject.toml

@@ -345,6 +346,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat
sql-other
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy
pymysql 1.0.2 mysql MySQL engine for sqlalchemy
adbc-driver-postgresql 0.8.0 ADBC Driver for PostgreSQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the postgresql extra now

@@ -345,6 +346,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat
sql-other
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy
pymysql 1.0.2 mysql MySQL engine for sqlalchemy
adbc-driver-postgresql 0.8.0 ADBC Driver for PostgreSQL
adbc-driver-sqlite 0.8.0 ADBC Driver for SQLite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the sql-other extra now

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments otherwise LGTM. Any other thoughts @jorisvandenbossche

@mroeschke mroeschke merged commit 18f7daf into pandas-dev:main Nov 22, 2023
36 of 40 checks passed
@mroeschke
Copy link
Member

Nice! Thanks @WillAyd. Can have follow up PRs if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants