Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: read_sql functions should not inspect all tables of database #7396

Closed
jorisvandenbossche opened this issue Jun 8, 2014 · 3 comments · Fixed by #7581
Closed

SQL: read_sql functions should not inspect all tables of database #7396

jorisvandenbossche opened this issue Jun 8, 2014 · 3 comments · Fixed by #7581
Labels
IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Related: comment of @balancap: #6416 (comment) and issue #7380 where you get a warning about a not understood type (probably in another table).

Situation now: in the creation of a PandasSQLAlchemy object (https://github.com/pydata/pandas/blob/v0.14.0/pandas/io/sql.py#L777), a full MetaData object is created and reflected (this means: all tables in the database are inspected and the schema's are stored as sqlalchemy Table objects). This is done each time read_sql/read_sql_query/read_sql_table is called.

Consequence:

  • this can be costly when having a very large database or having a distant server (ENH: SQL through SQLAlchemy - performance #6416 (comment)).
  • this can trigger warnings that does not have to do anything with your current query, or the current table you want to read, when eg one of the types in other tables is not known to sqlalchemy (eg a postgis geometry column).

Possible solution:

  • I think the read_sql functions never should inspect all tables, but only the specified table (and read_sql_query even not that table, as with a query this information is not used, only for read_sql_table)
  • This can maybe be achieved with using the only keyword in meta.reflect(engine, only=...)
  • For the OO API interface, we can discuss what should be the default (inspect all tables or not)

@mangecoeur @danielballan @hayd

@danielballan
Copy link
Contributor

Sounds good. As for the OO API, if we take a cue from HDFStore it should not inspect all tables. HDFStore.keys() can run slow for > 10 keys, so it is not run on instantiation.

In the case of HDFStore, the keys can be inspected once and cached, because only one user at a time can open the file for writing. Pandas doesn't cache them, but it's possible. (I'll mention that @nkeim has implemented some code for this; maybe others have too.) But for SQL tables, with multi-user access as a full feature and a common use case, any relevant inspection should be done at query-execution time, not at connection time. Maybe the only keyword can make this fast.

@jorisvandenbossche
Copy link
Member Author

OK, for the query functions it is clear I think they should not inspect the full table. I have a PR coming to fix that.

For the OO API, we should discuss that further together with the rest of the interface how this should look like.

@mangecoeur
Copy link
Contributor

Sounds good, I think the OO api can be stabilised after the functional one is sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants