Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

Open
2 tasks done
match-gabeflores opened this issue Apr 11, 2023 · 6 comments
Open
2 tasks done
Labels
Bug IO SQL to_sql, read_sql, read_sql_query

Comments

@match-gabeflores
Copy link

match-gabeflores commented Apr 11, 2023

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/64610269/sqlalchemy-hangs-during-insert-while-querying-information-schema-tables

Question about pandas

I'd like to bring attention to this issue where to_sql is being blocked by an unrelated process because pandas requires checking information_schema metadata tables.

My example is MSSQL (SQL Server) specific but I think it could apply to other DB systems.

steps to reproduce:

  • run python code first (automatically creates table)
  • run create table (new unrelated table) in an open transaction
    • key point is that this is unrelated to the table you are inserting in pandas
  • re-run python code to insert 10 records. this gets blocked bc it can't query information_schema

Other scenarios that lock information_schema cause this blocking in pandas (rebuilding clustered columnstore index on large table)

Notes

In my opinion, we should be able to insert into an existing table no matter if there are other processes going on.

Root cause

**SQL code to create unrelated table **


begin tran 
-- in same database 
create table dbo.MyOtherTableUnrelatedProcess
(
	a int
)
--rollback

code i used for testing the python piece

import pandas as pd
import sqlalchemy as sa
import urllib

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Isabel', 'Jack']
})

host = 'myhost'
schema = 'mydb'

params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};"
                                 "SERVER=" + host + ";"
                                "DATABASE=" + schema + ";"
                               "trusted_connection=yes")

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)

print('writing')
# note: MyMainTable is blocked because of the open transaction on unrelated process. to_sql can't querying information_schema
df.to_sql('MyMainTable', con=engine, if_exists='append', index=False)  
print('finished')

checking the server, this is the hung query called by pandas/sqlalchemy

SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(@P1 AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(@P2 AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]

Pandas is blocked in loading into the table
image

@match-gabeflores match-gabeflores added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Apr 11, 2023
@phofl
Copy link
Member

phofl commented Apr 11, 2023

Thanks for your report. Investigations welcome.

@phofl phofl added the IO SQL to_sql, read_sql, read_sql_query label Apr 11, 2023
@bjornasm
Copy link

bjornasm commented Apr 24, 2023

Interesting, this solved my problem - getting table "xyz" already exists despite having selected the option if_exists='append'. Thx for the help - at least there should be added a more helpful error message: "Is the database preoccupied? Try waiting for conflicting process is closed before trying again." or similar if the correct if_exists flag is selected?

@jacobshaw42
Copy link

Not sure if anyone else has tracked down the issue.

This is caused by check_case_sensitive in pandas.io.sql.py

This is because of the get_table_names method in sqlalchemy.dialects.mssql.base.py not using a nolock hint

@match-gabeflores
Copy link
Author

match-gabeflores commented Aug 30, 2023

Thanks @jacobshaw42 for tracking it down!

highlighting both lines below.

Do you want to create an issue on SqlAlchemy repo? Otherwise, I can.

edit: i think it is something Pandas should solve, actually. it should only do the case sensitivity check for database with case sensitivity. i'm also unsure if it should be doing it at all.
or if SQLAlchemy can read from the Info_schema tables without getting blocked (maybe using READ UNCOMMITTED)?

@match-gabeflores
Copy link
Author

match-gabeflores commented Aug 31, 2023

I tested changing the isolation level to Read Uncommited and confirmed that to_sql was able to finish running.

It still would be good to have a better solution - still unclear if SQLAlchemy needs to fix or Pandas?

  • Maybe Pandas should only do this case-sensitivity check for databases that are case-sensitive?
# https://docs.sqlalchemy.org/en/20/dialects/mssql.html#transaction-isolation-level
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params)
                          , fast_executemany=True
                           , isolation_level="READ UNCOMMITTED"
                          )

@jacobshaw42
Copy link

@match-gabeflores

Thanks for the confirmation.

I am actually use dask, which takes the uri instead of the full engine. Attempted to pass

engine_kwargs={"fast_executemany":True, "isolation_level":"READ UNCOMMITTED"}

to the dask to sql method, but it still got blocked.

I will create and issue under sqlalchemy

@mroeschke mroeschke added Bug and removed Usage Question Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

5 participants