Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: a dtype argument for read_sql #6798

Closed
Midnighter opened this issue Apr 4, 2014 · 19 comments
Closed

Suggestion: a dtype argument for read_sql #6798

Midnighter opened this issue Apr 4, 2014 · 19 comments
Labels
IO SQL to_sql, read_sql, read_sql_query

Comments

@Midnighter
Copy link

The pandas.io.sql module has a convenient read_frame function which has been of great use to me. The function has an argument coerce_float. I propose to make that more general akin to the pandas.DataFrame initialization method which has an argument dtype. This would allow passing a dict with column names as keys and the desired data type as values. Columns not in the dict would be inferred normally.

If this is not planned but desired I can look into patching that myself.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2014

give a try on the new sql support in master/0.14 (coming soon), see here: #6292

@jreback jreback added the SQL label Apr 4, 2014
@jreback
Copy link
Contributor

jreback commented May 16, 2014

not sure this is necessary, @jorisvandenbossche ?

can always convert_objects() if needed (but in general the dtypes ar coerced properly)

@jorisvandenbossche
Copy link
Member

@Midnighter Can you give a concrete example case where you would like to use such a keyword.

If there is good use for it, we can consider to add I think, but be aware that you already have coerce_floats to convert all numeric-like strings to numerics, and parse_dates to parse datetime-like strings to datetimes. And that in general the types in the database (ints, reals, strings, etc) are already converted correctly to pandas types.

@Midnighter
Copy link
Author

@jreback I'm ashamed to admit that I didn't get around to trying out the new SQL support in master yet.

@jorisvandenbossche My use case is a silly column that is of type string but actually only contains integers (or strings that represent integers I should say). So yes, it's not a big deal to convert them after the data frame is returned but I thought that adding the dtype argument would:

  1. Provide a clean interface when I want to mess with the data types manually (as in my situation).
  2. Mirror the DataFrame constructor.
  3. Avoid multiple conversion steps after column (array) creation.

@jreback jreback changed the title Suggestion: a dtype argument for read_frame Suggestion: a dtype argument for read_frame May 20, 2014
@jreback
Copy link
Contributor

jreback commented May 20, 2014

@Midnighter

  1. you can check out df.convert_objects(convert_numeric=True) to do what you want; this is by definition a user initiated step; automatic conversion is for 'normal' cases; this is not.

  2. dtype in the DataFrame constructor is a single-dtype; not normally what you want; you are probably looking to specify a dict of colmumn->dtype.

  3. However, sql normally can provide good dtype hints (even better than read_csv). Dtype conversion always involves multiple steps (its actually more efficien that way as these are vectorized operations),

@jorisvandenbossche jorisvandenbossche changed the title Suggestion: a dtype argument for read_frame Suggestion: a dtype argument for read_sql May 20, 2014
@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

@jorisvandenbossche close?

@Midnighter
Copy link
Author

  1. That's really easy, thanks for the hint.

  2. I was sure that I did exactly that previously but I can't find the code any longer and can't seem to replicate it right now.

In general I'm very happy with the new SQL integration, great work!

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 10, 2016
@halfmoonhalf
Copy link

say you have a MySQL table with tinyint column, when downloaded by pandas.read_sql_table(), the column's dtype is int64 not int8. This could be a case where a dtype argument for read_sql is useful. Am I right?

some details:

in MySQL:
status tinyint(4) NOT NULL DEFAULT '0', # <<--- tinyint

in pandas:
df = pd.read_sql_table('xxx', engine)
print df.info()
...
status int64 # <<--- tinyint becomes int64
...

@tobyjohnson
Copy link

I have a compelling use case for this functionality.

I have tables (in Impala, but I think same problem would arise in MySQL) with both columns of both integer and double type, all of which contain nulls. I want to write a short python script that queries a table (the table name being a user provided argument) and writes the table contents to a flat file. The intent is that the table schema might be unknown a priori to both the user and the script, so the script doesn't know the names of which columns are integer vs double.

So my code looks like:

mydata = pd.read_sql('SELECT * FROM %s;' % sanitized_table_name, odbcConnection)
mydata.to_csv(outputFile, sep = '|', index = False)

But this fails to do what I need because any columns in the SQL database that are actually integer, often contain nulls so get coerced to floats/doubles in the pandas dataframe. Then when I call to_csv(), these get printed with spurious numeric precision (like "123.0"). This then confuses the downstream application (which is a reload into Impala) because they are not recognized as integer any more.

I can't easily use the float_format argument when I call to_csv(), because my python script doesn't know which columns are meant to be integer and which columns are meant to be double.

If read_sql() took an argument like dtype=str, like read_csv() does, then my problem would be solved.

@kodonnell
Copy link

Seconding @halfmoonhalf - my use case involved hundreds of one-hot encoded columns (i.e. 0 or 1), and it's a big memory burden to pull them in as int64s. If I remember rightly, there's a way to change such things in the cursor.

Workaround: iterated through in chunks, converting each chunk to correct dtype, then append, etc.

@jorisvandenbossche
Copy link
Member

As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a dtype keyword for read_sql).

But also, this will not necessarily change memory problems, as this depends on the actual driver used, and pandas still gets the data from those drivers as python objects.

@kodonnell
Copy link

But also, this will not necessarily change memory problems, as this depends on the actual driver used, and pandas still gets the data from those drivers as python objects.

With a naive implementation, yes. Though there are workarounds, e.g.

  • internally chunk, but convert each chunk to the correct type. Memory will only ever bloat by 1 / n_chunks of what it would with a complete read
  • get more involved involved with drivers - e.g. you can specify the database type <-> python type mapping, or you could do some clever SQL cast(blah as <dtype>) sort of tricks. I wouldn't recommend this approach though, for maintenance etc.

@konstantinmiller
Copy link

The problem that I have is that whenever integer columns that contains NULL's get converted to float you loose precision. However, simply adding a dtype parameter won't help here, as pandas can't represent NULL's in integer columns. However, if it would be possible to replace NULL's with some constant value before creating the DataFrame, and in addition specifying dtype would be cool :)

[As a workaround it's of course also possible to COALESCE NULL's to a fixed constant.]

@polvoazul
Copy link

The problem that I have is that whenever integer columns that contains NULL's get converted to float you loose precision. However, simply adding a dtype parameter won't help here, as pandas can't represent NULL's in integer columns. However, if it would be possible to replace NULL's with some constant value before creating the DataFrame, and in addition specifying dtype would be cool :)

[As a workaround it's of course also possible to COALESCE NULL's to a fixed constant.]

This is no longer true: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

@Mesqualito
Copy link

Unfortunately I am not a pro (@jorisvandenbossche: "As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a dtype keyword for read_sql).") - but I may "contribute" another use case for dtype-specification in read_sql:

A lot of old-school ERP-systems - which can be analyzed, corrected and migrated much better with the help of your great pandas library - keyed their program-internal pseudo-ENUMs to int's in the particular database. Of course some of those key-value-pairs can be found in related tables - but unfortunately not all, a lot of those "ENUMs" are hard-coded.

To get useful informations from the resulting dataframe, these integer values have to be replaced to something readable in the pandas Series. There are workarounds with cast, convert or inserting new columns - but specifying dtypes on read would be really handy here, e.g. turning a 1 into a '1' to get replaced in the next step easily, with .loc, Series.replace/Regex, in multiple columns or DataFrame.replace.

@pavlokurochka
Copy link

Another use case - read_sql will coerce to float numeric strings with leading zeroes - which ruins US zip codes for example.

@hali-risclab
Copy link

I would also like to request the dtype argument within read_sql. When reading in data from Oracle DB, the table in Oracle defined certain columns as VARCHAR2, but the actual values are more suitable as CATEGORY in pandas dataframe. Writing extra code for chunking & astype('category') conversion is tedious, and my DBAs would not bend their back backwards to change the table schema in Oracle just for my use case to reduce memory usage.

@wouter-vdb
Copy link

Can be closed: the parameter was added in v2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests