Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: SQL writing can lead to data loss #6509

Closed
bashtage opened this issue Feb 28, 2014 · 11 comments · Fixed by #6591
Closed

BUG: SQL writing can lead to data loss #6509

bashtage opened this issue Feb 28, 2014 · 11 comments · Fixed by #6591
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@bashtage
Copy link
Contributor

to_sql uses iterrows which leads to data conversion, and in the case of mixed data types, can lead to data loss.

s1 = pd.Series([0],dtype=np.float32)
s2 = pd.Series([2**27 + 1],dtype=np.int32)
df = pd.DataFrame({'float':s1, 'int':s2})

for row in df.iterrows():
    print row[1][1] - (2**27 + 1)

(The same issue applies when using df.to_sql)

I found the same bug in to_stata and have submitted a PR on it.

to_sql is the only other location in pandas the uses iterrows, which should probably be avoided in all non-user code.

It should be an easy fix using something like itertuple - I don't use the SQL code so I'm not in a good position to fix this problem.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

@jorisvandenbossche
Copy link
Member

I can see that using itertuples instead of iterrows can maybe be the better solution, but I don't see how this can lead to data loss. Because the iterrows is used internally in the sql code, so between using iterrows and inserting the data in an sql table, no operations are performed on the data.

It is a fact that the dtype info is lost (and data are maybe unnecessarily converted), but the column type info to write the data to sql is constructed from the column types of the dataframe itself (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L484).
But it is true that the dtypes are not conserved between writing and reading to/from eg postgresql. So eg the example from your tests for to_stata will not work for sql:

s1 = pd.Series(2**9, dtype=np.int16)
s2 = pd.Series(2**17, dtype=np.int32)
s3 = pd.Series(2**33, dtype=np.int64)
original = pd.DataFrame({'int16': s1, 'int32': s2, 'int64': s3})
original.index.name = 'index'

formatted = original
formatted['int64'] = formatted['int64'].astype(np.float64)

original.to_sql("test_read_write2", engine)
written_and_read_again = sql.read_table("test_read_write", engine)

This will not work for sql, as the resulting types are int64 instead of int16/int32. But this is due to our mapping of the dtypes to sqlalchemy types, where just all int types are mapped to Integer (no distinction between smallint/int/bigint).

But I am not that familiar with the sql conversion code. @mangecoeur, can you take a look at this?

@bashtage
Copy link
Contributor Author

One scenario where this can occur is when a DataFrame contains both int32 and float32. When you use iterrows pandas decides to cast these to float32 which means that the integer no longer has 32 bits of precision (it now has about 25).

You cannot generate this by mixing int data - only by mixing int with float, and then with either float32 and int32 or float64 and int64.

Lets take this for a spin:

import pandas as pd
import numpy as np

s1 = pd.Series(2**25 + 1,dtype=np.int32)
s2 = pd.Series(0.0,dtype=np.float32)
df = pd.DataFrame({'s1': s1, 's2': s2})

for r in df.iterrows():
    iterrows = list(r[1])

for t in df.df.itertuples():
    itertuples = list(t[1:])

for v in zip(iterrows ,itertuples):
    print v[0] - v[1]

The output is

-1.0
0.0

Which occurs cein the last bit in the int32 is lost when it is upcast( sidecast??) to float32. pandas should always upcast any float32 mixed with any int32 to float64 which would preserve fidelity. But the same can happen with float64 and int64 -- fixing this would require float128, which isn't on offer.

@bashtage
Copy link
Contributor Author

FWIW even storing as float64 doesn't work since the data-loss occurs when iterrows is called.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

yep..only shows up on 'bigger' values of ints when mixed....IIRC @bashtage you caught this in to_stata......

easy fix is to user itertuples (which doesn't coerce to a uniform series)

other way is to use a dtype='object' Series (maybe as a kw to iterrows). though I think itertuples might be faster anyhow

the problem with iterrows generally is that it calls self.values which by definition tries to cast everything to a compatible type (where compatible doesn't take into account precision), just a type that will 'hold' everything

@jorisvandenbossche
Copy link
Member

@bashtage Thanks for the clarification! Now I see, and I quickly tried using itertuples and this fixes this indeed (not the int32->int64 type conversion, but the data loss)

@jorisvandenbossche
Copy link
Member

I can do a PR. The other question remains, whether we should try to preserve more of the dtypes (using smallint/int/bigint).

@mangecoeur
Copy link
Contributor

I tried to fix this on my side, also in the hope of more efficient inserts - see https://github.com/mangecoeur/pandas/tree/sql-dataloss-6509 but I'm getting a lot of test failiures for some reason, something to do with indexes being included or not, is anyone else seeing this after changing to itertuples?

@jorisvandenbossche
Copy link
Member

@mangecoeur If I only change the iterrows to itertuples (I think you also did some other changes in the branch you linked), then the tests run for me locally: https://github.com/jorisvandenbossche/pandas/compare/sql-itertuples

@mangecoeur
Copy link
Contributor

@jorisvandenbossche in theory the only other thing I changed was trying to tweak the tests because I thought maybe we were writing the frame index where we should not be. Our versions are otherwise almost identical except that I turned the for loop into a list comprehension. if yours works we should go with that...

@jorisvandenbossche
Copy link
Member

@mangecoeur Travis is happy, so will do a PR

vitorbaptista added a commit to vitorbaptista/codigo-mestrado that referenced this issue Apr 15, 2015
The problem was that Pandas' iterrows() converted int columns to float. The
solution was using itertuples() instead.

Check pandas-dev/pandas#6509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants