Suggestion: a dtype argument for read_sql #6798

Midnighter · 2014-04-04T13:02:45Z

The pandas.io.sql module has a convenient read_frame function which has been of great use to me. The function has an argument coerce_float. I propose to make that more general akin to the pandas.DataFrame initialization method which has an argument dtype. This would allow passing a dict with column names as keys and the desired data type as values. Columns not in the dict would be inferred normally.

If this is not planned but desired I can look into patching that myself.

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-04T13:07:31Z

give a try on the new sql support in master/0.14 (coming soon), see here: #6292

jreback · 2014-05-16T17:50:33Z

not sure this is necessary, @jorisvandenbossche ?

can always convert_objects() if needed (but in general the dtypes ar coerced properly)

jorisvandenbossche · 2014-05-20T07:57:48Z

@Midnighter Can you give a concrete example case where you would like to use such a keyword.

If there is good use for it, we can consider to add I think, but be aware that you already have coerce_floats to convert all numeric-like strings to numerics, and parse_dates to parse datetime-like strings to datetimes. And that in general the types in the database (ints, reals, strings, etc) are already converted correctly to pandas types.

Midnighter · 2014-05-20T08:34:19Z

@jreback I'm ashamed to admit that I didn't get around to trying out the new SQL support in master yet.

@jorisvandenbossche My use case is a silly column that is of type string but actually only contains integers (or strings that represent integers I should say). So yes, it's not a big deal to convert them after the data frame is returned but I thought that adding the dtype argument would:

Provide a clean interface when I want to mess with the data types manually (as in my situation).
Mirror the DataFrame constructor.
Avoid multiple conversion steps after column (array) creation.

jreback · 2014-05-20T12:31:52Z

@Midnighter

you can check out df.convert_objects(convert_numeric=True) to do what you want; this is by definition a user initiated step; automatic conversion is for 'normal' cases; this is not.
dtype in the DataFrame constructor is a single-dtype; not normally what you want; you are probably looking to specify a dict of colmumn->dtype.
However, sql normally can provide good dtype hints (even better than read_csv). Dtype conversion always involves multiple steps (its actually more efficien that way as these are vectorized operations),

jreback · 2014-07-07T15:48:33Z

@jorisvandenbossche close?

Midnighter · 2014-07-07T16:55:21Z

That's really easy, thanks for the hint.
I was sure that I did exactly that previously but I can't find the code any longer and can't seem to replicate it right now.

In general I'm very happy with the new SQL integration, great work!

halfmoonhalf · 2017-09-07T07:02:41Z

say you have a MySQL table with tinyint column, when downloaded by pandas.read_sql_table(), the column's dtype is int64 not int8. This could be a case where a dtype argument for read_sql is useful. Am I right?

some details:

in MySQL:
status tinyint(4) NOT NULL DEFAULT '0', # <<--- tinyint

in pandas:
df = pd.read_sql_table('xxx', engine)
print df.info()
...
status int64 # <<--- tinyint becomes int64
...

tobyjohnson · 2017-10-10T20:05:11Z

I have a compelling use case for this functionality.

I have tables (in Impala, but I think same problem would arise in MySQL) with both columns of both integer and double type, all of which contain nulls. I want to write a short python script that queries a table (the table name being a user provided argument) and writes the table contents to a flat file. The intent is that the table schema might be unknown a priori to both the user and the script, so the script doesn't know the names of which columns are integer vs double.

So my code looks like:

mydata = pd.read_sql('SELECT * FROM %s;' % sanitized_table_name, odbcConnection)
mydata.to_csv(outputFile, sep = '|', index = False)

But this fails to do what I need because any columns in the SQL database that are actually integer, often contain nulls so get coerced to floats/doubles in the pandas dataframe. Then when I call to_csv(), these get printed with spurious numeric precision (like "123.0"). This then confuses the downstream application (which is a reload into Impala) because they are not recognized as integer any more.

I can't easily use the float_format argument when I call to_csv(), because my python script doesn't know which columns are meant to be integer and which columns are meant to be double.

If read_sql() took an argument like dtype=str, like read_csv() does, then my problem would be solved.

kodonnell · 2017-12-03T19:00:44Z

Seconding @halfmoonhalf - my use case involved hundreds of one-hot encoded columns (i.e. 0 or 1), and it's a big memory burden to pull them in as int64s. If I remember rightly, there's a way to change such things in the cursor.

Workaround: iterated through in chunks, converting each chunk to correct dtype, then append, etc.

jorisvandenbossche · 2017-12-04T09:34:59Z

As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a dtype keyword for read_sql).

But also, this will not necessarily change memory problems, as this depends on the actual driver used, and pandas still gets the data from those drivers as python objects.

kodonnell · 2017-12-04T09:46:47Z

But also, this will not necessarily change memory problems, as this depends on the actual driver used, and pandas still gets the data from those drivers as python objects.

With a naive implementation, yes. Though there are workarounds, e.g.

internally chunk, but convert each chunk to the correct type. Memory will only ever bloat by 1 / n_chunks of what it would with a complete read
get more involved involved with drivers - e.g. you can specify the database type <-> python type mapping, or you could do some clever SQL cast(blah as <dtype>) sort of tricks. I wouldn't recommend this approach though, for maintenance etc.

konstantinmiller · 2019-01-28T10:38:10Z

The problem that I have is that whenever integer columns that contains NULL's get converted to float you loose precision. However, simply adding a dtype parameter won't help here, as pandas can't represent NULL's in integer columns. However, if it would be possible to replace NULL's with some constant value before creating the DataFrame, and in addition specifying dtype would be cool :)

[As a workaround it's of course also possible to COALESCE NULL's to a fixed constant.]

polvoazul · 2019-05-22T02:05:57Z

The problem that I have is that whenever integer columns that contains NULL's get converted to float you loose precision. However, simply adding a dtype parameter won't help here, as pandas can't represent NULL's in integer columns. However, if it would be possible to replace NULL's with some constant value before creating the DataFrame, and in addition specifying dtype would be cool :)

[As a workaround it's of course also possible to COALESCE NULL's to a fixed constant.]

This is no longer true: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Mesqualito · 2020-02-11T12:41:21Z

Unfortunately I am not a pro (@jorisvandenbossche: "As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a dtype keyword for read_sql).") - but I may "contribute" another use case for dtype-specification in read_sql:

A lot of old-school ERP-systems - which can be analyzed, corrected and migrated much better with the help of your great pandas library - keyed their program-internal pseudo-ENUMs to int's in the particular database. Of course some of those key-value-pairs can be found in related tables - but unfortunately not all, a lot of those "ENUMs" are hard-coded.

To get useful informations from the resulting dataframe, these integer values have to be replaced to something readable in the pandas Series. There are workarounds with cast, convert or inserting new columns - but specifying dtypes on read would be really handy here, e.g. turning a 1 into a '1' to get replaced in the next step easily, with .loc, Series.replace/Regex, in multiple columns or DataFrame.replace.

pavlokurochka · 2021-03-06T01:00:22Z

Another use case - read_sql will coerce to float numeric strings with leading zeroes - which ruins US zip codes for example.

hali-risclab · 2021-05-10T16:24:31Z

I would also like to request the dtype argument within read_sql. When reading in data from Oracle DB, the table in Oracle defined certain columns as VARCHAR2, but the actual values are more suitable as CATEGORY in pandas dataframe. Writing extra code for chunking & astype('category') conversion is tedious, and my DBAs would not bend their back backwards to change the table schema in Oracle just for my use case to reduce memory usage.

wouter-vdb · 2023-04-17T15:54:21Z

Can be closed: the parameter was added in v2.0.0.

match-gabeflores · 2023-04-17T16:21:40Z

PR was here:
#50797

docs
https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html#:~:text=dtypeType%20name%20or%20dict%20of%20columns

jreback added the SQL label Apr 4, 2014

jreback changed the title ~~Suggestion: a dtype argument for read_frame~~ Suggestion: a dtype argument for read_frame May 20, 2014

jorisvandenbossche changed the title ~~Suggestion: a dtype argument for read_frame~~ Suggestion: a dtype argument for read_sql May 20, 2014

Midnighter closed this as completed Jul 7, 2014

jorisvandenbossche added this to the No action milestone Nov 10, 2016

jorisvandenbossche mentioned this issue Oct 16, 2017

Specify type "category" in read_sql #17862

Closed

grofte mentioned this issue Jan 30, 2019

read_sql should convert Numeric to decimals #20525

Open

zaneselvans mentioned this issue Oct 15, 2019

Apply correct data types to columns in output objects catalyst-cooperative/pudl#453

Closed

redhog mentioned this issue Sep 28, 2020

ENH: Support for Postgres NaN data type #10742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: a dtype argument for read_sql #6798

Suggestion: a dtype argument for read_sql #6798

Midnighter commented Apr 4, 2014

jreback commented Apr 4, 2014

jreback commented May 16, 2014

jorisvandenbossche commented May 20, 2014

Midnighter commented May 20, 2014

jreback commented May 20, 2014

jreback commented Jul 7, 2014

Midnighter commented Jul 7, 2014

halfmoonhalf commented Sep 7, 2017

tobyjohnson commented Oct 10, 2017

kodonnell commented Dec 3, 2017

jorisvandenbossche commented Dec 4, 2017

kodonnell commented Dec 4, 2017

konstantinmiller commented Jan 28, 2019

polvoazul commented May 22, 2019

Mesqualito commented Feb 11, 2020

pavlokurochka commented Mar 6, 2021

hali-risclab commented May 10, 2021

wouter-vdb commented Apr 17, 2023

match-gabeflores commented Apr 17, 2023

Suggestion: a dtype argument for read_sql #6798

Suggestion: a dtype argument for read_sql #6798

Comments

Midnighter commented Apr 4, 2014

jreback commented Apr 4, 2014

jreback commented May 16, 2014

jorisvandenbossche commented May 20, 2014

Midnighter commented May 20, 2014

jreback commented May 20, 2014

jreback commented Jul 7, 2014

Midnighter commented Jul 7, 2014

halfmoonhalf commented Sep 7, 2017

tobyjohnson commented Oct 10, 2017

kodonnell commented Dec 3, 2017

jorisvandenbossche commented Dec 4, 2017

kodonnell commented Dec 4, 2017

konstantinmiller commented Jan 28, 2019

polvoazul commented May 22, 2019

Mesqualito commented Feb 11, 2020

pavlokurochka commented Mar 6, 2021

hali-risclab commented May 10, 2021

wouter-vdb commented Apr 17, 2023

match-gabeflores commented Apr 17, 2023