BUG: SQL writing can lead to data loss #6509

bashtage · 2014-02-28T17:13:47Z

to_sql uses iterrows which leads to data conversion, and in the case of mixed data types, can lead to data loss.

s1 = pd.Series([0],dtype=np.float32)
s2 = pd.Series([2**27 + 1],dtype=np.int32)
df = pd.DataFrame({'float':s1, 'int':s2})

for row in df.iterrows():
    print row[1][1] - (2**27 + 1)

(The same issue applies when using df.to_sql)

I found the same bug in to_stata and have submitted a PR on it.

to_sql is the only other location in pandas the uses iterrows, which should probably be avoided in all non-user code.

It should be an easy fix using something like itertuple - I don't use the SQL code so I'm not in a good position to fix this problem.

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-28T17:16:10Z

cc @mangecoeur, @hayd, @jorisvandenbossche

jorisvandenbossche · 2014-03-10T19:53:05Z

I can see that using itertuples instead of iterrows can maybe be the better solution, but I don't see how this can lead to data loss. Because the iterrows is used internally in the sql code, so between using iterrows and inserting the data in an sql table, no operations are performed on the data.

It is a fact that the dtype info is lost (and data are maybe unnecessarily converted), but the column type info to write the data to sql is constructed from the column types of the dataframe itself (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L484).
But it is true that the dtypes are not conserved between writing and reading to/from eg postgresql. So eg the example from your tests for to_stata will not work for sql:

s1 = pd.Series(2**9, dtype=np.int16)
s2 = pd.Series(2**17, dtype=np.int32)
s3 = pd.Series(2**33, dtype=np.int64)
original = pd.DataFrame({'int16': s1, 'int32': s2, 'int64': s3})
original.index.name = 'index'

formatted = original
formatted['int64'] = formatted['int64'].astype(np.float64)

original.to_sql("test_read_write2", engine)
written_and_read_again = sql.read_table("test_read_write", engine)

This will not work for sql, as the resulting types are int64 instead of int16/int32. But this is due to our mapping of the dtypes to sqlalchemy types, where just all int types are mapped to Integer (no distinction between smallint/int/bigint).

But I am not that familiar with the sql conversion code. @mangecoeur, can you take a look at this?

bashtage · 2014-03-10T20:05:11Z

One scenario where this can occur is when a DataFrame contains both int32 and float32. When you use iterrows pandas decides to cast these to float32 which means that the integer no longer has 32 bits of precision (it now has about 25).

You cannot generate this by mixing int data - only by mixing int with float, and then with either float32 and int32 or float64 and int64.

Lets take this for a spin:

import pandas as pd
import numpy as np

s1 = pd.Series(2**25 + 1,dtype=np.int32)
s2 = pd.Series(0.0,dtype=np.float32)
df = pd.DataFrame({'s1': s1, 's2': s2})

for r in df.iterrows():
    iterrows = list(r[1])

for t in df.df.itertuples():
    itertuples = list(t[1:])

for v in zip(iterrows ,itertuples):
    print v[0] - v[1]

The output is

-1.0
0.0

Which occurs cein the last bit in the int32 is lost when it is upcast( sidecast??) to float32. pandas should always upcast any float32 mixed with any int32 to float64 which would preserve fidelity. But the same can happen with float64 and int64 -- fixing this would require float128, which isn't on offer.

bashtage · 2014-03-10T20:07:14Z

FWIW even storing as float64 doesn't work since the data-loss occurs when iterrows is called.

jreback · 2014-03-10T20:08:10Z

yep..only shows up on 'bigger' values of ints when mixed....IIRC @bashtage you caught this in to_stata......

easy fix is to user itertuples (which doesn't coerce to a uniform series)

other way is to use a dtype='object' Series (maybe as a kw to iterrows). though I think itertuples might be faster anyhow

the problem with iterrows generally is that it calls self.values which by definition tries to cast everything to a compatible type (where compatible doesn't take into account precision), just a type that will 'hold' everything

jorisvandenbossche · 2014-03-10T20:12:43Z

@bashtage Thanks for the clarification! Now I see, and I quickly tried using itertuples and this fixes this indeed (not the int32->int64 type conversion, but the data loss)

jorisvandenbossche · 2014-03-10T20:14:56Z

I can do a PR. The other question remains, whether we should try to preserve more of the dtypes (using smallint/int/bigint).

mangecoeur · 2014-03-10T20:15:56Z

I tried to fix this on my side, also in the hope of more efficient inserts - see https://github.com/mangecoeur/pandas/tree/sql-dataloss-6509 but I'm getting a lot of test failiures for some reason, something to do with indexes being included or not, is anyone else seeing this after changing to itertuples?

jorisvandenbossche · 2014-03-10T20:29:54Z

@mangecoeur If I only change the iterrows to itertuples (I think you also did some other changes in the branch you linked), then the tests run for me locally: https://github.com/jorisvandenbossche/pandas/compare/sql-itertuples

mangecoeur · 2014-03-10T20:37:08Z

@jorisvandenbossche in theory the only other thing I changed was trying to tweak the tests because I thought maybe we were writing the frame index where we should not be. Our versions are otherwise almost identical except that I turned the for loop into a list comprehension. if yours works we should go with that...

jorisvandenbossche · 2014-03-10T20:58:35Z

@mangecoeur Travis is happy, so will do a PR

The problem was that Pandas' iterrows() converted int columns to float. The solution was using itertuples() instead. Check pandas-dev/pandas#6509

jreback added Bug labels Feb 28, 2014

jreback added this to the 0.14.0 milestone Feb 28, 2014

jorisvandenbossche mentioned this issue Feb 28, 2014

ENH: sql support with SQLAlchemy - follow-up #6292

Closed

17 tasks

jorisvandenbossche mentioned this issue Mar 10, 2014

BUG: replace iterrows with itertuples in sql insert (GH6509) #6591

Merged

jorisvandenbossche closed this as completed in #6591 Mar 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: SQL writing can lead to data loss #6509

BUG: SQL writing can lead to data loss #6509

bashtage commented Feb 28, 2014

jreback commented Feb 28, 2014

jorisvandenbossche commented Mar 10, 2014

bashtage commented Mar 10, 2014

bashtage commented Mar 10, 2014

jreback commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

mangecoeur commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

mangecoeur commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

BUG: SQL writing can lead to data loss #6509

BUG: SQL writing can lead to data loss #6509

Comments

bashtage commented Feb 28, 2014

jreback commented Feb 28, 2014

jorisvandenbossche commented Mar 10, 2014

bashtage commented Mar 10, 2014

bashtage commented Mar 10, 2014

jreback commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

mangecoeur commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014

mangecoeur commented Mar 10, 2014

jorisvandenbossche commented Mar 10, 2014