-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: SQL writing can lead to data loss #6509
Comments
I can see that using It is a fact that the dtype info is lost (and data are maybe unnecessarily converted), but the column type info to write the data to sql is constructed from the column types of the dataframe itself (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L484).
This will not work for sql, as the resulting types are int64 instead of int16/int32. But this is due to our mapping of the dtypes to sqlalchemy types, where just all int types are mapped to Integer (no distinction between smallint/int/bigint). But I am not that familiar with the sql conversion code. @mangecoeur, can you take a look at this? |
One scenario where this can occur is when a DataFrame contains both You cannot generate this by mixing Lets take this for a spin: import pandas as pd
import numpy as np
s1 = pd.Series(2**25 + 1,dtype=np.int32)
s2 = pd.Series(0.0,dtype=np.float32)
df = pd.DataFrame({'s1': s1, 's2': s2})
for r in df.iterrows():
iterrows = list(r[1])
for t in df.df.itertuples():
itertuples = list(t[1:])
for v in zip(iterrows ,itertuples):
print v[0] - v[1] The output is
Which occurs cein the last bit in the |
FWIW even storing as |
yep..only shows up on 'bigger' values of ints when mixed....IIRC @bashtage you caught this in to_stata...... easy fix is to user other way is to use a the problem with |
@bashtage Thanks for the clarification! Now I see, and I quickly tried using itertuples and this fixes this indeed (not the int32->int64 type conversion, but the data loss) |
I can do a PR. The other question remains, whether we should try to preserve more of the dtypes (using smallint/int/bigint). |
I tried to fix this on my side, also in the hope of more efficient inserts - see https://github.com/mangecoeur/pandas/tree/sql-dataloss-6509 but I'm getting a lot of test failiures for some reason, something to do with indexes being included or not, is anyone else seeing this after changing to itertuples? |
@mangecoeur If I only change the iterrows to itertuples (I think you also did some other changes in the branch you linked), then the tests run for me locally: https://github.com/jorisvandenbossche/pandas/compare/sql-itertuples |
@jorisvandenbossche in theory the only other thing I changed was trying to tweak the tests because I thought maybe we were writing the frame index where we should not be. Our versions are otherwise almost identical except that I turned the for loop into a list comprehension. if yours works we should go with that... |
@mangecoeur Travis is happy, so will do a PR |
The problem was that Pandas' iterrows() converted int columns to float. The solution was using itertuples() instead. Check pandas-dev/pandas#6509
to_sql
usesiterrows
which leads to data conversion, and in the case of mixed data types, can lead to data loss.(The same issue applies when using
df.to_sql
)I found the same bug in
to_stata
and have submitted a PR on it.to_sql
is the only other location in pandas the usesiterrows
, which should probably be avoided in all non-user code.It should be an easy fix using something like
itertuple
- I don't use the SQL code so I'm not in a good position to fix this problem.The text was updated successfully, but these errors were encountered: