QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

match-gabeflores · 2023-04-11T16:12:23Z

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/64610269/sqlalchemy-hangs-during-insert-while-querying-information-schema-tables

Question about pandas

I'd like to bring attention to this issue where to_sql is being blocked by an unrelated process because pandas requires checking information_schema metadata tables.

My example is MSSQL (SQL Server) specific but I think it could apply to other DB systems.

steps to reproduce:

run python code first (automatically creates table)
run create table (new unrelated table) in an open transaction
- key point is that this is unrelated to the table you are inserting in pandas
re-run python code to insert 10 records. this gets blocked bc it can't query information_schema

Other scenarios that lock information_schema cause this blocking in pandas (rebuilding clustered columnstore index on large table)

Notes

In my opinion, we should be able to insert into an existing table no matter if there are other processes going on.

for someone who knows the pandas internals, maybe they can chime in?
see relevant Stack Overflow question
Sqlalchemy GH issue - SQLAlchemy hangs during insert while querying [INFORMATION_SCHEMA].[TABLES] when an open transaction exists that has created a new table sqlalchemy/sqlalchemy#5679

Root cause

This only occurs if the table name contains any uppercase letters.
Per comments below, it is related to pandas case_sensitivity check line number.
- Due to this PR for mysql
- See this SQL Alchemy comment

SQL code to create unrelated table


begin tran 
-- in same database 
create table dbo.MyOtherTableUnrelatedProcess
(
	a int
)
--rollback

code i used for testing the python piece

import pandas as pd
import sqlalchemy as sa
import urllib

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Isabel', 'Jack']
})

host = 'myhost'
schema = 'mydb'

params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};"
                                 "SERVER=" + host + ";"
                                "DATABASE=" + schema + ";"
                               "trusted_connection=yes")

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)

print('writing')
# note: MyMainTable is blocked because of the open transaction on unrelated process. to_sql can't querying information_schema
df.to_sql('MyMainTable', con=engine, if_exists='append', index=False)  
print('finished')

checking the server, this is the hung query called by pandas/sqlalchemy

SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(@P1 AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(@P2 AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]

Pandas is blocked in loading into the table

The text was updated successfully, but these errors were encountered:

phofl · 2023-04-11T21:22:07Z

Thanks for your report. Investigations welcome.

bjornasm · 2023-04-24T14:24:24Z

Interesting, this solved my problem - getting table "xyz" already exists despite having selected the option if_exists='append'. Thx for the help - at least there should be added a more helpful error message: "Is the database preoccupied? Try waiting for conflicting process is closed before trying again." or similar if the correct if_exists flag is selected?

jacobshaw42 · 2023-08-30T20:46:50Z

Not sure if anyone else has tracked down the issue.

This is caused by check_case_sensitive in pandas.io.sql.py

This is because of the get_table_names method in sqlalchemy.dialects.mssql.base.py not using a nolock hint

match-gabeflores · 2023-08-30T21:05:12Z

Thanks @jacobshaw42 for tracking it down!

highlighting both lines below.

pandas/pandas/io/sql.py

Line 1943 in 7212ecd

if not name.isdigit() and not name.islower():
- due to this PR - SQL: add warning for mismatch in provided and written table name (GH7815) #8180
- and BUG/ENH: to_sql fails to detect tables via if_exists on mysql with capitalised table names #7815
https://github.com/sqlalchemy/sqlalchemy/blob/b468775b5c8eb7ac17638d02fdca98b73c02bb94/lib/sqlalchemy/dialects/mssql/base.py#L3348
Note, this only gets blocked if the table name has an uppercase letter or all characters are digits.

Do you want to create an issue on SqlAlchemy repo? Otherwise, I can.

edit: i think it is something Pandas should solve, actually. it should only do the case sensitivity check for database with case sensitivity. i'm also unsure if it should be doing it at all.
or if SQLAlchemy can read from the Info_schema tables without getting blocked (maybe using READ UNCOMMITTED)?

match-gabeflores · 2023-08-31T20:23:59Z

I tested changing the isolation level to Read Uncommited and confirmed that to_sql was able to finish running.

It still would be good to have a better solution - still unclear if SQLAlchemy needs to fix or Pandas?

Maybe Pandas should only do this case-sensitivity check for databases that are case-sensitive?

# https://docs.sqlalchemy.org/en/20/dialects/mssql.html#transaction-isolation-level
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params)
                          , fast_executemany=True
                           , isolation_level="READ UNCOMMITTED"
                          )

jacobshaw42 · 2023-08-31T20:52:55Z

@match-gabeflores

Thanks for the confirmation.

I am actually use dask, which takes the uri instead of the full engine. Attempted to pass

engine_kwargs={"fast_executemany":True, "isolation_level":"READ UNCOMMITTED"}

to the dask to sql method, but it still got blocked.

I will create and issue under sqlalchemy

match-gabeflores added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Apr 11, 2023

phofl added the IO SQL to_sql, read_sql, read_sql_query label Apr 11, 2023

match-gabeflores mentioned this issue May 30, 2023

QST: Deadlock when using df.to_sql for none-related tables #36542

Closed

mroeschke added Bug and removed Usage Question Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

match-gabeflores commented Apr 11, 2023 •

edited

Loading

phofl commented Apr 11, 2023

bjornasm commented Apr 24, 2023 •

edited

Loading

jacobshaw42 commented Aug 30, 2023

match-gabeflores commented Aug 30, 2023 •

edited

Loading

match-gabeflores commented Aug 31, 2023 •

edited

Loading

jacobshaw42 commented Aug 31, 2023

QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

QST: to_sql is unnecessarily blocked because of dependency on information_schema tables #52601

Comments

match-gabeflores commented Apr 11, 2023 • edited Loading

Research

Link to question on StackOverflow

Question about pandas

steps to reproduce:

Notes

Root cause

**SQL code to create unrelated table **

code i used for testing the python piece

checking the server, this is the hung query called by pandas/sqlalchemy

phofl commented Apr 11, 2023

bjornasm commented Apr 24, 2023 • edited Loading

jacobshaw42 commented Aug 30, 2023

match-gabeflores commented Aug 30, 2023 • edited Loading

match-gabeflores commented Aug 31, 2023 • edited Loading

jacobshaw42 commented Aug 31, 2023

match-gabeflores commented Apr 11, 2023 •

edited

Loading

SQL code to create unrelated table

bjornasm commented Apr 24, 2023 •

edited

Loading

match-gabeflores commented Aug 30, 2023 •

edited

Loading

match-gabeflores commented Aug 31, 2023 •

edited

Loading