-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: a dtype argument for read_sql #6798
Comments
give a try on the new sql support in master/0.14 (coming soon), see here: #6292 |
not sure this is necessary, @jorisvandenbossche ? can always |
@Midnighter Can you give a concrete example case where you would like to use such a keyword. If there is good use for it, we can consider to add I think, but be aware that you already have |
@jreback I'm ashamed to admit that I didn't get around to trying out the new SQL support in master yet. @jorisvandenbossche My use case is a silly column that is of type string but actually only contains integers (or strings that represent integers I should say). So yes, it's not a big deal to convert them after the data frame is returned but I thought that adding the
|
dtype
argument for read_frame
|
@jorisvandenbossche close? |
In general I'm very happy with the new SQL integration, great work! |
say you have a MySQL table with tinyint column, when downloaded by pandas.read_sql_table(), the column's dtype is int64 not int8. This could be a case where a dtype argument for read_sql is useful. Am I right? some details: in MySQL: in pandas: |
I have a compelling use case for this functionality. I have tables (in Impala, but I think same problem would arise in MySQL) with both columns of both integer and double type, all of which contain nulls. I want to write a short python script that queries a table (the table name being a user provided argument) and writes the table contents to a flat file. The intent is that the table schema might be unknown a priori to both the user and the script, so the script doesn't know the names of which columns are integer vs double. So my code looks like: mydata = pd.read_sql('SELECT * FROM %s;' % sanitized_table_name, odbcConnection) But this fails to do what I need because any columns in the SQL database that are actually integer, often contain nulls so get coerced to floats/doubles in the pandas dataframe. Then when I call to_csv(), these get printed with spurious numeric precision (like "123.0"). This then confuses the downstream application (which is a reload into Impala) because they are not recognized as integer any more. I can't easily use the float_format argument when I call to_csv(), because my python script doesn't know which columns are meant to be integer and which columns are meant to be double. If read_sql() took an argument like dtype=str, like read_csv() does, then my problem would be solved. |
Seconding @halfmoonhalf - my use case involved hundreds of one-hot encoded columns (i.e. Workaround: iterated through in chunks, converting each chunk to correct dtype, then append, etc. |
As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a But also, this will not necessarily change memory problems, as this depends on the actual driver used, and pandas still gets the data from those drivers as python objects. |
With a naive implementation, yes. Though there are workarounds, e.g.
|
The problem that I have is that whenever integer columns that contains NULL's get converted to float you loose precision. However, simply adding a [As a workaround it's of course also possible to |
This is no longer true: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html |
Unfortunately I am not a pro (@jorisvandenbossche: "As I said elsewhere (#13049 (comment)), we would be happy to receive a contribution to make this better. (eg a dtype keyword for read_sql).") - but I may "contribute" another use case for dtype-specification in read_sql: A lot of old-school ERP-systems - which can be analyzed, corrected and migrated much better with the help of your great pandas library - keyed their program-internal pseudo-ENUMs to int's in the particular database. Of course some of those key-value-pairs can be found in related tables - but unfortunately not all, a lot of those "ENUMs" are hard-coded. To get useful informations from the resulting dataframe, these integer values have to be replaced to something readable in the pandas Series. There are workarounds with cast, convert or inserting new columns - but specifying dtypes on read would be really handy here, e.g. turning a 1 into a '1' to get replaced in the next step easily, with .loc, Series.replace/Regex, in multiple columns or DataFrame.replace. |
Another use case - read_sql will coerce to float numeric strings with leading zeroes - which ruins US zip codes for example. |
I would also like to request the dtype argument within read_sql. When reading in data from Oracle DB, the table in Oracle defined certain columns as VARCHAR2, but the actual values are more suitable as CATEGORY in pandas dataframe. Writing extra code for chunking & astype('category') conversion is tedious, and my DBAs would not bend their back backwards to change the table schema in Oracle just for my use case to reduce memory usage. |
Can be closed: the parameter was added in v2.0.0. |
The
pandas.io.sql
module has a convenientread_frame
function which has been of great use to me. The function has an argumentcoerce_float
. I propose to make that more general akin to thepandas.DataFrame
initialization method which has an argumentdtype
. This would allow passing adict
with column names as keys and the desired data type as values. Columns not in thedict
would be inferred normally.If this is not planned but desired I can look into patching that myself.
The text was updated successfully, but these errors were encountered: