Improve type handling in read_sql and read_sql_table #13049

rsdenijs · 2016-05-01T18:30:07Z

Problem

In pd.read_sql and pd.read_sql_table when the chunksize parameter is set, Pandas builds a DataFrame with dtypes inferred from the data in the chunk. This can be a problem if an INTEGER colum contains null values in some chunks but not in others, leading the same column to be int64 in some cases and in others float64. A similar problem happens with strings.

In ETL processes or simply when dumping large queries to disk in HDF5 format, the user currently has the burden of explicitly having to handle the type conversions of potentially many columns.

Solution?

Instead of guessing the type from a subset of the data, it should be possible to obtain the type information from the database and map it to the appropriate dtypes.

It is possible to obtain column information from Sqlalchemy when querying a full table by inspecting its metadata, but I was unsuccessfull in findind a way to do it for a general query.
Although I am unaware of all the possible type problems that can arise DBAPI does actually enforce the cursor.description to specify whether each result column is nullable.
Pandas could use this information (optionally) to always interpret nullable numeric columns as floats and strings as object columns.

jreback · 2016-05-02T13:05:18Z

The _wrap_result needs to incorporate the meta-data from the table and cast as appropriate (or potentially just pass it directly to .from_records).

@jorisvandenbossche

jorisvandenbossche · 2016-05-02T22:54:27Z

There is already the _harmonize_columns method (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L895) that is called in read_table after from_records is used. So the column information from the database is already used to some extend, but this method can possibly be improved.

However, the problem of eg possible NaNs in integer columns will not be solved by this I think? The only way to be certain to always have consistent dtype in different chunks is to convert integer columns always to float (unless in the case that a not-nullable constraint is put on the column). Which I am not sure of we should do, as in many cases we will be converting all integers without NaN unnecessarily to floats ..

rsdenijs · 2016-05-03T07:15:36Z

@jorisvandenbossche read_table could use the nullable information provided by sqlalchemy. An integer that is nullable could be casted as float in pandas.
In the case of read_query i did not find the column type from sqlalchemy directly, but the type and nullable information is specified in the cursor description from the DBAPI.

Cursor attributes
.description
This read-only attribute is a sequence of 7-item sequences.

Each of these sequences contains information describing one result column:

name
type_code
display_size
internal_size
precision
scale
null_ok

The first two items ( name and type_code ) are mandatory, the other five are optional and are set to None if no meaningful values can be provided.

This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet.

The type_code can be interpreted by comparing it to the Type Objects specified in the section below.

This is supported by most major drivers with the exception of sqlite3, ~~for reasons I dont understand~~ because sqlite has no proper column types .

jorisvandenbossche · 2016-05-03T09:58:13Z

read_table could use the nullable information provided by sqlalchemy. An integer that is nullable could be casted as float in pandas.

IMO the problem with this is that by default columns can hold NULLs, so I suppose in many cases people will not specify this, although maybe their columns do in practice not hold NULLs. For all those cases the dtype of the returned column would now change, in many cases unnecessarily.

I am not saying the issue you raise is not a problem, because it certainly is, but I am considering what would be the best solution for all cases.

rsdenijs · 2016-05-03T10:15:50Z

I doubt that in serious environments non-nullable columns are left as nullable... but I guess we will never know. I think this could be handled by a keyword use_metadata_nulls or something with a better name.

jorisvandenbossche · 2016-05-03T10:27:25Z

@rsdenijs That is quite possible, but the fact is that there are also a lot of less experienced people using pandas/sql .. The question then of course is to what extent we have to take those into account for this issue.
(and it is actually more the problem that pandas cannot have integer columns with missing values ... but that is a whole other can of worms :-))

Anyway, trying to think of other ways to deal with this issue:

the issue with string columns and missing values should certainly be solvable I think (both can perfectly be object dtype, in that case we don't have the int/float issue)
We could also provide a way to specify dtypes in read_sql. But this would then still be manual work, and has probably not that much of advantage to just doing the astype after read_sql (maybe a little bit more convenience)
Something like what you suggest: keyword use_metadata_nulls to trigger this check. But it is always a tough balance between keeping the API simple and clear and providing the options you need.

Would you be interested in doing a PR for the first bullet point? This is in any case the non-controversial part I think and could already solve it for string columns (leaving only int columns to handle manually).

rsdenijs · 2016-05-05T14:23:31Z

@jorisvandenbossche Actually I might have been confused regarding the strings. String columns are always of type object, regardless of the presence of NaNs. For some reason I thought there was an actual string type in pandas. So although I would like to take a stab at it, im no longer sure what the goal would be.

Regarding the ints types, I think that read_table (not read_query) should always inspect if the column is nullable from the SqlAlchemy info. We are reading the col_type anyway, why not check if it is nullable?
Specifically, I think the following part is bad when we are chunking, because we dont know if later chunks will have nulls (in fact, im not sure it is ever achieving anything, as it is being called after from_records, so pure int and pure bool columns should already have the right type)


                elif len(df_col) == df_col.count():
                    # No NA values, can convert ints and bools
                    if col_type is np.dtype('int64') or col_type is bool:
                        self.frame[col_name] = df_col.astype(
                            col_type, copy=False)

If for some reason we can not verify the column is nullable (sqlalchemy), when chunking the default behaviour should imo be that ints.

chananshgong · 2016-12-21T10:58:04Z

My problem is that even if the detection works, integers loose precision when casted to float and my values are id of records so I need full 64 bit integer precision. Any workaround?

konstantinmiller · 2017-11-30T13:28:02Z

It would be extremely helpful to be able to specify the types of columns as read_sql() input arguments! Could we maybe have at least that for the moment?

jorisvandenbossche · 2017-11-30T13:47:15Z

Yes, we can.. if somebody makes a contribution to add it!
So PR welcome to add a dtype argument to read_sql

sam-hoffman · 2020-04-29T20:04:53Z

I'm interested in taking this on! Is a fix on this still welcome?

jorisvandenbossche · 2020-04-30T14:08:55Z

@sam-hoffman contributions to improve type handling in sql reading are certainly welcome, but, I am not sure there is already a clear actionable conclusion from the above discussion (if I remember correctly, didn't yet reread the whole thread). So maybe you can first propose more concretely what you would like to change?

aaronlutz · 2020-10-12T20:18:27Z

I'm interested in taking this on! Is a fix on this still welcome?

@sam-hoffman please do!

avinashpancham · 2020-12-30T16:51:29Z

@jorisvandenbossche based on the above discussion I would propose to add a dtype arg for read_sql and read_sql_table. In #37546 I already added it for the read_sql_query function. Agree?

silverdevelopper · 2022-07-26T21:09:40Z

Hello what's is the latest situation with that issue?

eirnym · 2024-08-10T10:13:45Z

As a temporal workaround for nullable int64 types I use following and prefer to specify each type for each column.

dtype={
        'column': pd.Int64Dtype()
    }

tobwen · 2024-09-22T16:03:17Z

Seems like this is stale now?

jreback added Dtype Conversions Unexpected or buggy dtype conversions IO SQL to_sql, read_sql, read_sql_query labels May 2, 2016

TomAugspurger mentioned this issue Sep 17, 2017

allow read_sql to avoid conversion to float #17560

Closed

jorisvandenbossche mentioned this issue Oct 16, 2017

Specify type "category" in read_sql #17862

Closed

ed-alertedh mentioned this issue Nov 22, 2017

More expressive error message for read_sql_table (when partition is empty) dask/dask#2900

Open

jorisvandenbossche mentioned this issue Dec 4, 2017

Suggestion: a dtype argument for read_sql #6798

Closed

jorisvandenbossche added Enhancement good first issue labels Dec 4, 2017

jorisvandenbossche added this to the Next Major Release milestone Dec 4, 2017

TomAugspurger mentioned this issue Dec 8, 2017

read_sql returns "Metadata mismatch" when part of column is Null dask/dask#2974

Closed

xhochy mentioned this issue Oct 5, 2020

ENH: Pluggable SQL performance #36893

Open

avinashpancham mentioned this issue Dec 20, 2020

ENH: Add dtype argument to read_sql_query (GH10285) #37546

Merged

5 tasks

mroeschke removed the good first issue label Apr 24, 2021

Hoeze mentioned this issue Jun 7, 2021

Feature request: to_pandas()/to_arrow() PyMySQL/mysqlclient#443

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve type handling in read_sql and read_sql_table #13049

Improve type handling in read_sql and read_sql_table #13049

rsdenijs commented May 1, 2016

jreback commented May 2, 2016

jorisvandenbossche commented May 2, 2016

rsdenijs commented May 3, 2016 •

edited

Loading

jorisvandenbossche commented May 3, 2016

rsdenijs commented May 3, 2016

jorisvandenbossche commented May 3, 2016 •

edited

Loading

rsdenijs commented May 5, 2016

chananshgong commented Dec 21, 2016

konstantinmiller commented Nov 30, 2017

jorisvandenbossche commented Nov 30, 2017

sam-hoffman commented Apr 29, 2020 •

edited

Loading

jorisvandenbossche commented Apr 30, 2020

aaronlutz commented Oct 12, 2020

avinashpancham commented Dec 30, 2020

silverdevelopper commented Jul 26, 2022

eirnym commented Aug 10, 2024

tobwen commented Sep 22, 2024

Improve type handling in read_sql and read_sql_table #13049

Improve type handling in read_sql and read_sql_table #13049

Comments

rsdenijs commented May 1, 2016

Problem

Solution?

jreback commented May 2, 2016

jorisvandenbossche commented May 2, 2016

rsdenijs commented May 3, 2016 • edited Loading

jorisvandenbossche commented May 3, 2016

rsdenijs commented May 3, 2016

jorisvandenbossche commented May 3, 2016 • edited Loading

rsdenijs commented May 5, 2016

chananshgong commented Dec 21, 2016

konstantinmiller commented Nov 30, 2017

jorisvandenbossche commented Nov 30, 2017

sam-hoffman commented Apr 29, 2020 • edited Loading

jorisvandenbossche commented Apr 30, 2020

aaronlutz commented Oct 12, 2020

avinashpancham commented Dec 30, 2020

silverdevelopper commented Jul 26, 2022

eirnym commented Aug 10, 2024

tobwen commented Sep 22, 2024

rsdenijs commented May 3, 2016 •

edited

Loading

jorisvandenbossche commented May 3, 2016 •

edited

Loading

sam-hoffman commented Apr 29, 2020 •

edited

Loading