Skip to content

Commit

Permalink
Add SQL Support for ADBC Drivers (#53869)
Browse files Browse the repository at this point in the history
* close to complete implementation

* working implementation for postgres

* sqlite implementation

* Added ADBC to CI

* Doc updates

* Whatsnew update

* Better optional dependency import

* min versions fix

* import updates

* docstring fix

* doc fixup

* Updates for 0.6.0

* fix sqlite name escaping

* more cleanups

* more 0.6.0 updates

* typo

* remove warning

* test_sql expectations

* revert whatsnew issues

* pip deps

* Suppress pyarrow warning

* Updated docs

* mypy fixes

* Remove stacklevel check from test

* typo fix

* compat

* Joris feedback

* Better test coverage with ADBC

* cleanups

* feedback

* checkpoint

* more checkpoint

* more skips

* updates

* implement more

* bump to 0.7.0

* fixups

* cleanups

* sqlite fixups

* pyarrow compat

* revert to using pip instead of conda

* documentation cleanups

* compat fixups

* Fix stacklevel

* remove unneeded code

* commit after drop in fixtures

* close cursor

* fix table dropping

* Bumped ADBC min to 0.8.0

* documentation

* doc updates

* more fixups

* documentation fixups

* fixes

* more documentation

* doc spacing

* doc target fix

* pyarrow warning compat

* feedback

* updated io documentation

* install updates
  • Loading branch information
WillAyd authored Nov 22, 2023
1 parent 3f6b08b commit 18f7daf
Show file tree
Hide file tree
Showing 15 changed files with 1,152 additions and 85 deletions.
2 changes: 2 additions & 0 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-311-downstream_compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ dependencies:
- pyyaml
- py
- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- dataframe-api-compat>=0.1.7
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-39-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ dependencies:
- zstandard=0.19.0

- pip:
- adbc-driver-postgresql==0.8.0
- adbc-driver-sqlite==0.8.0
- dataframe-api-compat==0.1.7
- pyqt5==5.15.8
- tzdata==2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-39.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
3 changes: 3 additions & 0 deletions ci/deps/circle-310-arm64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,6 @@ dependencies:
- xlrd>=2.0.1
- xlsxwriter>=3.0.5
- zstandard>=0.19.0
- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
4 changes: 3 additions & 1 deletion doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ lxml 4.9.2 xml XML parser for read
SQL databases
^^^^^^^^^^^^^

Installable with ``pip install "pandas[postgresql, mysql, sql-other]"``.
Traditional drivers are installable with ``pip install "pandas[postgresql, mysql, sql-other]"``

========================= ================== =============== =============================================================
Dependency Minimum Version pip extra Notes
Expand All @@ -345,6 +345,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat
sql-other
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy
pymysql 1.0.2 mysql MySQL engine for sqlalchemy
adbc-driver-postgresql 0.8.0 postgresql ADBC Driver for PostgreSQL
adbc-driver-sqlite 0.8.0 sql-other ADBC Driver for SQLite
========================= ================== =============== =============================================================

Other data sources
Expand Down
114 changes: 104 additions & 10 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5565,9 +5565,23 @@ SQL queries
-----------

The :mod:`pandas.io.sql` module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
your database. Examples of such drivers are `psycopg2 <https://www.psycopg.org/>`__
facilitate data retrieval and to reduce dependency on DB-specific API.

Where available, users may first want to opt for `Apache Arrow ADBC
<https://arrow.apache.org/adbc/current/index.html>`_ drivers. These drivers
should provide the best performance, null handling, and type detection.

.. versionadded:: 2.2.0

Added native support for ADBC drivers

For a full list of ADBC drivers and their development status, see the `ADBC Driver
Implementation Status <https://arrow.apache.org/adbc/current/driver/status.html>`_
documentation.

Where an ADBC driver is not available or may be missing functionality,
users should opt for installing SQLAlchemy alongside their database driver library.
Examples of such drivers are `psycopg2 <https://www.psycopg.org/>`__
for PostgreSQL or `pymysql <https://github.com/PyMySQL/PyMySQL>`__ for MySQL.
For `SQLite <https://docs.python.org/3/library/sqlite3.html>`__ this is
included in Python's standard library by default.
Expand Down Expand Up @@ -5600,6 +5614,18 @@ In the following example, we use the `SQlite <https://www.sqlite.org/index.html>
engine. You can use a temporary SQLite database where data are stored in
"memory".

To connect using an ADBC driver you will want to install the ``adbc_driver_sqlite`` using your
package manager. Once installed, you can use the DBAPI interface provided by the ADBC driver
to connect to your database.

.. code-block:: python
import adbc_driver_sqlite.dbapi as sqlite_dbapi
# Create the connection
with sqlite_dbapi.connect("sqlite:///:memory:") as conn:
df = pd.read_sql_table("data", conn)
To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine
object from database URI. You only need to create the engine once per database you are
connecting to.
Expand Down Expand Up @@ -5675,9 +5701,74 @@ writes ``data`` to the database in batches of 1000 rows at a time:
SQL data types
++++++++++++++

:func:`~pandas.DataFrame.to_sql` will try to map your data to an appropriate
SQL data type based on the dtype of the data. When you have columns of dtype
``object``, pandas will try to infer the data type.
Ensuring consistent data type management across SQL databases is challenging.
Not every SQL database offers the same types, and even when they do the implementation
of a given type can vary in ways that have subtle effects on how types can be
preserved.

For the best odds at preserving database types users are advised to use
ADBC drivers when available. The Arrow type system offers a wider array of
types that more closely match database types than the historical pandas/NumPy
type system. To illustrate, note this (non-exhaustive) listing of types
available in different databases and pandas backends:

+-----------------+-----------------------+----------------+---------+
|numpy/pandas |arrow |postgres |sqlite |
+=================+=======================+================+=========+
|int16/Int16 |int16 |SMALLINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int32/Int32 |int32 |INTEGER |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int64/Int64 |int64 |BIGINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|float32 |float32 |REAL |REAL |
+-----------------+-----------------------+----------------+---------+
|float64 |float64 |DOUBLE PRECISION|REAL |
+-----------------+-----------------------+----------------+---------+
|object |string |TEXT |TEXT |
+-----------------+-----------------------+----------------+---------+
|bool |``bool_`` |BOOLEAN | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns] |timestamp(us) |TIMESTAMP | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | |
+-----------------+-----------------------+----------------+---------+
| |date32 |DATE | |
+-----------------+-----------------------+----------------+---------+
| |month_day_nano_interval|INTERVAL | |
+-----------------+-----------------------+----------------+---------+
| |binary |BINARY |BLOB |
+-----------------+-----------------------+----------------+---------+
| |decimal128 |DECIMAL [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |list |ARRAY [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |struct |COMPOSITE TYPE | |
| | | [#f1]_ | |
+-----------------+-----------------------+----------------+---------+

.. rubric:: Footnotes

.. [#f1] Not implemented as of writing, but theoretically possible
If you are interested in preserving database types as best as possible
throughout the lifecycle of your DataFrame, users are encouraged to
leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql`

.. code-block:: ipython
# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")
This will prevent your data from being converted to the traditional pandas/NumPy
type system, which often converts SQL types in ways that make them impossible to
round-trip.

In case an ADBC driver is not available, :func:`~pandas.DataFrame.to_sql`
will try to map your data to an appropriate SQL data type based on the dtype of
the data. When you have columns of dtype ``object``, pandas will try to infer
the data type.

You can always override the default type by specifying the desired SQL type of
any of the columns by using the ``dtype`` argument. This argument needs a
Expand All @@ -5696,7 +5787,9 @@ default ``Text`` type for string columns:

Due to the limited support for timedelta's in the different database
flavors, columns with type ``timedelta64`` will be written as integer
values as nanoseconds to the database and a warning will be raised.
values as nanoseconds to the database and a warning will be raised. The only
exception to this is when using the ADBC PostgreSQL driver in which case a
timedelta will be written to the database as an ``INTERVAL``

.. note::

Expand All @@ -5711,7 +5804,7 @@ default ``Text`` type for string columns:
Datetime data types
'''''''''''''''''''

Using SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing
Using ADBC or SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing
datetime data that is timezone naive or timezone aware. However, the resulting
data stored in the database ultimately depends on the supported data type
for datetime data of the database system being used.
Expand Down Expand Up @@ -5802,15 +5895,16 @@ table name and optionally a subset of columns to read.
.. note::

In order to use :func:`~pandas.read_sql_table`, you **must** have the
SQLAlchemy optional dependency installed.
ADBC driver or SQLAlchemy optional dependency installed.

.. ipython:: python
pd.read_sql_table("data", engine)
.. note::

Note that pandas infers column dtypes from query outputs, and not by looking
ADBC drivers will map database types directly back to arrow types. For other drivers
note that pandas infers column dtypes from query outputs, and not by looking
up data types in the physical database schema. For example, assume ``userid``
is an integer column in a table. Then, intuitively, ``select userid ...`` will
return integer-valued series, while ``select cast(userid as text) ...`` will
Expand Down
91 changes: 91 additions & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,97 @@ a Series. (:issue:`55323`)
)
series.list[0]
.. _whatsnew_220.enhancements.adbc_support:

ADBC Driver support in to_sql and read_sql
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`read_sql` and :meth:`~DataFrame.to_sql` now work with `Apache Arrow ADBC
<https://arrow.apache.org/adbc/current/index.html>`_ drivers. Compared to
traditional drivers used via SQLAlchemy, ADBC drivers should provide
significant performance improvements, better type support and cleaner
nullability handling.

.. code-block:: ipython
import adbc_driver_postgresql.dbapi as pg_dbapi
df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
],
columns=['a', 'b', 'c']
)
uri = "postgresql://postgres:postgres@localhost/postgres"
with pg_dbapi.connect(uri) as conn:
df.to_sql("pandas_table", conn, index=False)
# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn)
The Arrow type system offers a wider array of types that can more closely match
what databases like PostgreSQL can offer. To illustrate, note this (non-exhaustive)
listing of types available in different databases and pandas backends:

+-----------------+-----------------------+----------------+---------+
|numpy/pandas |arrow |postgres |sqlite |
+=================+=======================+================+=========+
|int16/Int16 |int16 |SMALLINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int32/Int32 |int32 |INTEGER |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int64/Int64 |int64 |BIGINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|float32 |float32 |REAL |REAL |
+-----------------+-----------------------+----------------+---------+
|float64 |float64 |DOUBLE PRECISION|REAL |
+-----------------+-----------------------+----------------+---------+
|object |string |TEXT |TEXT |
+-----------------+-----------------------+----------------+---------+
|bool |``bool_`` |BOOLEAN | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns] |timestamp(us) |TIMESTAMP | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | |
+-----------------+-----------------------+----------------+---------+
| |date32 |DATE | |
+-----------------+-----------------------+----------------+---------+
| |month_day_nano_interval|INTERVAL | |
+-----------------+-----------------------+----------------+---------+
| |binary |BINARY |BLOB |
+-----------------+-----------------------+----------------+---------+
| |decimal128 |DECIMAL [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |list |ARRAY [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |struct |COMPOSITE TYPE | |
| | | [#f1]_ | |
+-----------------+-----------------------+----------------+---------+

.. rubric:: Footnotes

.. [#f1] Not implemented as of writing, but theoretically possible
If you are interested in preserving database types as best as possible
throughout the lifecycle of your DataFrame, users are encouraged to
leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql`

.. code-block:: ipython
# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")
This will prevent your data from being converted to the traditional pandas/NumPy
type system, which often converts SQL types in ways that make them impossible to
round-trip.

For a full list of ADBC drivers and their development status, see the `ADBC Driver
Implementation Status <https://arrow.apache.org/adbc/current/driver/status.html>`_
documentation.

.. _whatsnew_220.enhancements.other:

Other enhancements
Expand Down
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ dependencies:
- pygments # Code highlighting

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- dataframe-api-compat>=0.1.7
- sphinx-toggleprompt # conda-forge version has stricter pins on jinja2
- typing_extensions; python_version<"3.11"
Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# Update install.rst & setup.cfg when updating versions!

VERSIONS = {
"adbc-driver-postgresql": "0.8.0",
"adbc-driver-sqlite": "0.8.0",
"bs4": "4.11.2",
"blosc": "1.21.3",
"bottleneck": "1.3.6",
Expand Down
Loading

0 comments on commit 18f7daf

Please sign in to comment.