SQL: read_sql functions should not inspect all tables of database #7396

jorisvandenbossche · 2014-06-08T16:44:55Z

Related: comment of @balancap: #6416 (comment) and issue #7380 where you get a warning about a not understood type (probably in another table).

Situation now: in the creation of a PandasSQLAlchemy object (https://github.com/pydata/pandas/blob/v0.14.0/pandas/io/sql.py#L777), a full MetaData object is created and reflected (this means: all tables in the database are inspected and the schema's are stored as sqlalchemy Table objects). This is done each time read_sql/read_sql_query/read_sql_table is called.

Consequence:

this can be costly when having a very large database or having a distant server (ENH: SQL through SQLAlchemy - performance #6416 (comment)).
this can trigger warnings that does not have to do anything with your current query, or the current table you want to read, when eg one of the types in other tables is not known to sqlalchemy (eg a postgis geometry column).

Possible solution:

I think the read_sql functions never should inspect all tables, but only the specified table (and read_sql_query even not that table, as with a query this information is not used, only for read_sql_table)
This can maybe be achieved with using the only keyword in meta.reflect(engine, only=...)
For the OO API interface, we can discuss what should be the default (inspect all tables or not)

@mangecoeur @danielballan @hayd

The text was updated successfully, but these errors were encountered:

danielballan · 2014-06-09T12:32:15Z

Sounds good. As for the OO API, if we take a cue from HDFStore it should not inspect all tables. HDFStore.keys() can run slow for > 10 keys, so it is not run on instantiation.

In the case of HDFStore, the keys can be inspected once and cached, because only one user at a time can open the file for writing. Pandas doesn't cache them, but it's possible. (I'll mention that @nkeim has implemented some code for this; maybe others have too.) But for SQL tables, with multi-user access as a full feature and a common use case, any relevant inspection should be done at query-execution time, not at connection time. Maybe the only keyword can make this fast.

jorisvandenbossche · 2014-06-12T12:48:55Z

OK, for the query functions it is clear I think they should not inspect the full table. I have a PR coming to fix that.

For the OO API, we should discuss that further together with the rest of the interface how this should look like.

mangecoeur · 2014-06-17T07:39:41Z

Sounds good, I think the OO api can be stabilised after the functional one is sorted.

jorisvandenbossche added the SQL label Jun 8, 2014

jorisvandenbossche added this to the 0.14.1 milestone Jun 8, 2014

This was referenced Jun 8, 2014

ENH: SQL through SQLAlchemy - performance #6416

Closed

read_sql gives SQLAlchemy warning 'Could not instantiate type ...' #7380

Closed

jorisvandenbossche mentioned this issue Jun 12, 2014

[ENH] Allow passing MetaData (or meta kwargs) to high-level SQL functions #7441

Open

jorisvandenbossche mentioned this issue Jun 30, 2014

SQL: don't reflect full database GH7396 #7581

Merged

jorisvandenbossche closed this as completed in #7581 Jul 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: read_sql functions should not inspect all tables of database #7396

SQL: read_sql functions should not inspect all tables of database #7396

jorisvandenbossche commented Jun 8, 2014

danielballan commented Jun 9, 2014

jorisvandenbossche commented Jun 12, 2014

mangecoeur commented Jun 17, 2014

SQL: read_sql functions should not inspect all tables of database #7396

SQL: read_sql functions should not inspect all tables of database #7396

Comments

jorisvandenbossche commented Jun 8, 2014

danielballan commented Jun 9, 2014

jorisvandenbossche commented Jun 12, 2014

mangecoeur commented Jun 17, 2014